Are LLM Hallucinations Really A Threat To Your CX?

Posted in

listen to or read this Deep Dive

written by a human – voiced by their AI twin

Big Picture Scene Setting

Looking through the lens of operational leadership, there are ongoing implications as AI is progressively integrated into everyday life. We are a transitional generation. Pivoting from a familiar way of working to something still being formed.

Such is the volume of new things to think about and act on, we are having to learn about things we could have once left to our technically minded colleagues. What was fringe to a successful career is now core. It seems everyone now needs their own foundation understanding of AI to remain part of what’s next.

The mission is simply this.

We’re being enlisted into a collective effort to reinvent our organisations. To figure out through trial and significant error, when and if AI makes sense. And of course, ensure progress is being made before others race too far ahead.

In this context, informed decision-making needs foundation understanding rather than narrow up-skilling in what’s a rapidly evolving set of technologies.

A foundation approach offers broader benefits. Such as understanding how to mitigate the key risks and therefore protect your ability to realise the benefits. Or put more bluntly: not allow ignorance of the downsides to screw up your investment in realising the upsides.

Here’s another important benefit. Having a foundation understanding of how Large Language Models (LLMs) behave, allows you to distinguish ‘the art of the possible’ from ‘a hope for the impossible’.

To put this into context, let’s think about one of LLMs better known behaviours – the tendency to ‘hallucinate’. What’s your understanding of this behaviour? Are you a ‘comes with the territory’ type of person? Or are you mystified by the very concept!! As in, ‘is psychotropic artificial intelligence even a thing!’?

Having clear understanding of whether you can ever expect to completely remove hallucinations from LLM behaviour does matter. It influences if, when, and how you deploy them, and sets expectations for their management.

Understanding LLM Hallucinations

Hallucinations occur when a model generates information that is false, misleading, or unsupported by facts, while presenting it as truthful and accurate.

These hallucinations can range from minor inaccuracies to completely fabricated information. In customer service contexts, this might manifest as an AI agent providing incorrect product details, misquoting company policies, inventing non-existent services.

Or maybe something even more consequential as in this healthcare example.

According to The New York Times which reported the story, the AI in question, ‘drafted a message reassuring a patient she had gotten a hepatitis B vaccine‘. When in fact, the patient’s records showed ‘no evidence of that vaccination’.

Does this have life-threatening consequences? Maybe not. But is certainly one that undermines the health provider’s reputation.

Now, you might hear this with enough foundation understanding to immediately wonder why the LLM was not plumbed into an approved source of patient information. Or ask why human oversight was not embedded into the patient communication workflow.

In which case, as a leader in that healthcare provider, your issue is not your own level of understanding. But that of others who fell short of recognising the risks. Or maybe just finding someone with enough know-how to troubleshoot and report back to you if the incident was one-off or systemic.

Alternatively, if the story did surprise you, here’s some relevant foundation understanding to absorb.

‘Human intelligence’ and ‘artificial intelligence’ are not the same. Even though you’ll find this assumption frequently restated in a variety of ways across social media and broadcasting. Therefore, In the absence of any well-publicised, contrary view, it’s easy to end up believing that artificial intelligence is basically the same as our own.

But the differences are crucial to understand.

As a Large Language Model generates output, it is not consciously aware of ‘right’ versus ‘wrong’ answers as a human mind is. Instead, it is predicting what to say next based on statistical probabilities. These are shaped by the completeness and accuracy of its training data and the way in which the model is designed, built and optimised.

It is also important to appreciate AI’s lack of adaptability relative to our own. Humans excel at adapting to new situations and contexts. Often with little to no conscious effort.In contrast, Machine Learning and Deep Learning experts will tell you that choosing the right class of model is crucial. ‘Best fit’ is based on what each type of model is optimised to excel in.

Unsurprisingly, this principle applies to Large Language Model choices as well. This means a great model will underperform when picked for the wrong job. In other words, the use case directly influences the quality of outcome.

As evidence of this, Stanford researchers have found that, even though fine-tuned, legal advice models still hallucinate in about one out of six queries, which is roughly 16% of the time. this still compares favourably against ‘general purpose’ models. These ones showed failures rates above 50% when tasked with answering legal questions.

It’s back to an old truth. Choosing the right tool matters.

As an operational leader, you should not expect yourself to be qualified to make such choices – unless you choose to ‘go geek’ – All you need is enough understanding that it’s ‘a question that needs raising’ and ‘a choice that needs making’.

Just keep in the back of your mind that LLMs are not created equal. Nor are the latest releases a guarantee of reduced hallucinations: simply because the context of their use is a huge influence on their effectiveness.

This is why organisations need to be actively aware of which LLMs they currently deploy, are about to deploy, and how they are being used.

Unfortunately, tracking them can be opaque. Especially when bundled into larger solutions such as CCaaS (call centre as a service) or the incoming wave of AI agent solutions.

Here’s a visual from the AI For Non Technical Minds course that’s illustrates the issue.

image

Trust But Verify

Why does this matter to customer service leaders? It matters because Large Language Models increasingly drive today’s customer service infrastructure. Vendor competition to be first to market with the next ‘bigger-better’ AI capability has been intense ever since day one of Generative AI’s premature market release: which is another story for another time.

As a result, every category, from omni-channel orchestration to real-time conversation intelligence, is now infused with LLM goodness as the visual shows. It turns out that ‘finding patterns in large datasets in real-time’, and ‘what contact centres churn out by the bucketload each day’, makes for a ‘perfect marriage’.

So far, results are encouraging. But given everything just said about hallucinations, what unfamiliar threats might service leaders now face from LLMs?

The trouble is you need to poke around quite a bit to find out. Contact centre solution providers seldom, if ever, develop their own Large Language Models given the expense and expertise required.

It’s an intensively competitive space that needs deep pockets. One in which next generation models are launched at breakneck speed in the race to be first. One that is now evolving into a quest for supremacy in Agentic AI which takes the art of generation into the world of execution.

Do you imagine that contact centre vendors or their channel partners know much about these Large Language Models? Obviously their in-house solution designers do since they have to select the best ones to integrate. And to be clear, I’m not suggesting that bad choices are being made.

But Large language models remain ‘black boxes’ in terms of how they work – even to their original designers and the broader research community that investigates them. To massively understate the challenges involved, making the inner workings of a published model ‘transparent’ and ‘adaptive’ remains work in progress. In fact, we have scarcely got going.

This prompts the question whether you and your organisation are taking too much on trust.

Why?

We know models behave differently across generations: given ‘design tweaks’ and ‘fine- tuning’ in the quest to reach the next competitive breakthrough. Might a new version therefore impact previously reliable behaviour?

Is it unreasonable to assume contact centre vendors will be drawn or nudged towards embedding the latest version to stay ahead? Cloud solutions are constantly being iteratively improved. It’s a core feature. If that’s the case, what are the risks of introducing a new LLM? As said earlier, it’s best to understand the downsides to protect the upside benefits.

The problem is, what does the typical operational leader in customer service actually know about any of this?

Has anyone thought to ask which models are being used? Whether they were tested against specific use cases in customer service; for your sector’s needs, and potentially for any regulatory demands? Testing all that is a time commitment that’s not so easy when it’s a race to be first.

Also, these questions might need to be directed at a single partner solution. Or more likely, be directed towards the full mix of vendors that make up your customer service ecosystem.

And it is highly unlikely, that in such a young market, there is even the notion, as yet, of a commonly accepted testing framework or set of standards.

But putting that to one side, let’s assume that sufficient due diligence is taking place. Are the results being shared with you?

You and your team need that insight to ensure their effective supervision and management. No LLM is perfect. They hallucinate, exhibit bias, and can behave in other unexpected ways.

But hampered by a lack of foundation understanding, we’ve learnt to believe in a trinity of safeguards that we were asked to take on trust: guardrails, fine tuning, and of course RAG.

As a side note: RAG or Retrieval-Augmented Generation, is a technique that enhances LLMs by integrating them with quality-controlled data sources such as a curated in-house knowledge base. This enables them to provide more accurate and contextually relevant responses.

The combined impact of guardrails, fine tuning, and RAG is that hallucinations are certainly reduced. But are not eliminated.

So, should ‘passive acceptance’ convert into proactive risk management dialogue and action?

Is it not best to adopt a mindset of ‘Trust But Verify’, and start to ask more penetrating questions next time your vendor’s customer success team visits?

In anticipation, here’s a few questions to kickstart that agenda.

  1. Which solutions that we purchase from you use LLMs? Who makes them? What make/model are the ones we are currently using? Are they modified? If so, how? By who? For what benefits?
  2. What’s your upgrade policy when new models are released? How do we collaborate on intended upgrades and acceptance criteria?
  3. What criteria do you use to choose the best LLM to match the use case? How has it been tested to verify this?
  4. What specific measures have you implemented to reduce or mitigate LLM hallucinations in your CCaaS solution? What’s your guidance on expected hallucination rates and your advice in relation to the ‘xyz’ use cases we want to test?
  5. What performance data can you share with us on hallucination rates? Is this historic or kept up to date? Do you have specific data that aligns with our own use cases? How can we collaborate on data sharing and establishing our own benchmarks and reporting?
  6. What capability can you provide to help us manage and minimise hallucinations – from pre-launch testing to operational management/reporting and performance optimisation?
  7. What training or guidance do you provide to help our staff effectively use and oversee the LLM-powered features in your CCaaS solution?

Remember, your purpose is to assess the level of risk you are inheriting and how that is balanced against the benefits brought by LLMs.

Also approach these sessions in the right spirit. They are going to work best as a collaborative rather than confrontational conversation.

Finally, be patient. It might take more than one session to get into the rhythm of productive discussion and ensure the right expertise is in the room.

Concluding Thoughts

Hallucinations are a feature of LLM technology until someone invents another architecture to enable generative AI. They can be reduced based on use case and various safeguards. Some I mentioned. Others I’ll share in a follow-up ‘deep dive’.

Being able to engage in discussion and decision-making on LLMs is becoming an operational responsibility: given the increasingly important role they are set to play in AI-first operating models: in which contact centres and customer service are early adopters.

Of course, it’s going to be a team effort and deep technology expertise is always needed. Therefore investing in a common language and foundation understanding helps operational and technical teams collaborate more effectively.

In fact, that’s why Brainfood Training brought AI For Non Technical Minds to market. We talk about hallucinations and other LLM related issues as part of ‘AI Readiness’ which is the middle section of the course.

The aim of these ‘deep dive’ articles is to help build on the foundation understanding generated from the initial course. Thank you for reading. Hopefully you can now make more informed decisions about how you manage LLM hallucinations.

share an insight – ask a question