
Table of Contents
The Bug
The trouble with LLMs is their inconsistency.
In one experiment, researchers asked an AI the same question 1,000 times and got 80 completely different answers. Even more surprising, answers were identical for the first 102 words but then diverged. 992 responses said “Queens, New York” while 8 said “New York City” when describing where physicist Richard Feynman was born.
Just plain weird.
But given the overall value that LLMs offer us, we accept our ‘go to’ LLM is just being ‘creative’. So, we’ve learnt to sense check when the fidelity of output matters which it often does.
Even so, this is irritating. Enough to make us wonder if anyone really knows why and can LLM inconsistency be fixed?
Turns out being ‘creative’ with the facts is not the root of inconsistency. Even when we tell the AI to be completely deterministic (non-creative) by turning temperature settings to zero, it still gives different responses to identical questions.
Until now this was a mystery with lots of hypotheses as to what was happening underneath the bonnet.
The clever folk at Thinking Machines have now pinpointed the real culprit.
Your AI’s answer depends on how many other people are using the LLM system at the same time.
Here’s why.
Ever wondered what happens to your prompt/question after you punch in your request? It joins every other request being made at that time.
For greater efficiency, multiple user requests are processed together in “batches”. Batching is dynamic. So, when the system is busy, it might process fifty requests together. When it’s quiet, maybe just five.
Here’s the punchline.
The mathematical operations inside the AI give slightly different results depending on batch size. Even for identical individual requests.
Large batches are optimised for many simultaneous requests. Small batches switch to a different approach that’s optimised for fewer requests.
This dynamic switching delivers excellent performance because the system always uses the most efficient strategy for the current situation. However, it creates unpredictability because the same request can be processed differently depending on server load.
Think of it like a restaurant kitchen: the same recipe might turn out slightly different depending on whether the chef is cooking one meal or preparing a banquet for 100 people, even if the ingredients and techniques are identical.
That’s why LLM output is inconsistent.
How to Fix It
Now the problem’s understood, researchers have managed to develop techniques that ensure individual requests get the same answer regardless of how busy the system is.
Here’s the hack.
- Ensure calculations happen in the same sequence regardless of batch size
- Use the same mathematical approaches whether processing 1 request or 1,000
- Update how the AI handles memory and calculations
When implemented, these changes make AI responses perfectly consistent.
Reasons To Be Happy
Here are some scenarios that will improve as this breakthrough is operationalised.
Organisational Use
Reliability: If you’re using AI for critical business decisions, the current inconsistency means you might get different strategic recommendations for identical scenarios depending on when you ask.
Testing and Validation: Companies trying to evaluate AI performance can’t get reliable benchmarks when the same test produces different results each time.
Customer Experience: Users may notice inconsistent responses to similar queries, potentially undermining trust in AI-powered services.
AI Training and Development
Research Reproducibility: Scientists can’t properly validate AI research when experiments aren’t repeatable. This slows down progress across the entire field.
Performance Measurement: It’s impossible to accurately measure whether an AI system is improving if baseline measurements keep changing.
So lots of positives from retuning the engine. But will this resolve our trust issues with LLMs?
Not entirely.
The fix we just explored ensures consistent responses but doesn’t eliminate the issue of plausible sounding but incorrect information. The accuracy gap still requires RAG (Retrieval Augmentation Generation) capability which grounds responses in facts from your knowledge base. Don’t sunset your RAG investment just yet!
In fact, greater consistency makes RAG systems better.
Consistent retrieval results: When the same query hits your knowledge base, you get the same retrieved documents every time.
Predictable response quality: Your RAG system becomes more trustworthy for business-critical applications.
Better evaluation: You can accurately measure and improve RAG performance when results are reproducible.
For customer contact leaders, this is all good news. And hopefully this explanation for non technical minds will generate a few more IT-Ops conversations as a result.
Yes.
Anything else to know before inviting IT to go fix our LLMS?
Current research shows that making AI deterministic (reliable) costs about 60% more processing time. Here’s why.
The 60% performance penalty occurs because deterministic systems must:
- Give up adaptive optimisations that would speed up processing based on current conditions
- Use conservative strategies that work across all scenarios rather than optimal ones for specific situations
- Accept inefficiencies in low-batch scenarios to maintain consistent behaviour
In summary, consistency costs more.
So, is it worth the extra cost?
Well remember the headline that introduced this post. Researchers asked an AI the same question 1,000 times and got 80 completely different answers. When researchers re-ran that experiment using the new approach (deterministic), it produced identical results every time.
You’ll have to decide whether consistency is worth this performance penalty. Each use case needs scrutiny.
In Customer CoIn Customer Contact, you might conclude any high risk situation that needs 100% consistency of information or advice is always includes human supervision. That’s currently a sensible argument.
However, if both accuracy and consistency of LLM enabled AI agents could be controlled within an acceptable margin of error (zero tolerance as the safest option), then the degree of supervision and the expertise to supervise them changes for the better. At that point we might have evolved from default supervision of high risk use cases to exception based supervision.
On the broader point of trade-offs what’s likely to occur are more nuanced deployment strategies such as running separate “deterministic” AI services for critical applications while using faster, less consistent versions for general use.
The Bottom Line
The Thinking Machines’ research reveals that AI inconsistency isn’t an inherent limitation of the technology. Instead, it’s a solvable engineering problem. However, the solution requires trade-offs between speed and consistency. As AI becomes more critical to business operations, understanding and addressing reliability issues will become essential for responsible AI deployment.
At stake is research validity, business reliability, customer trust, and the fundamental question of when we can truly depend on AI systems for important decisions.
For those interested in more detail on the original research. the ‘full fat’ technical explanation and solution is here.
Also, consider inviting in Brainfood for a chat on how this and other AI capabilities impact Customer Contact.
Thank you for your time and attention.