In this story
Jeffrey Ip
Cofounder @ Confident, creating companies that tell stories. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

LLM Chatbot Evaluation Explained: Top Metrics and Testing Techniques

October 8, 2024
·
10 min read
Presenting...
The open-source LLM evaluation framework.
Star on GitHub
LLM Chatbot Evaluation Explained: Top Metrics and Testing Techniques

Here’s a true story: Just last week, I was sorting out a delayed shipment and spent some time texting my dedicated customer support rep, Johnny. Johnny was great. He was polite and responsive, so much to the point that I felt bad leaving him on read at times. But as our conversation dragged on, he kept asking the same old questions, and his generic suggestions weren’t helping. Hmm, I thought to myself, maybe this isn’t Johnny after all.

I’m no detective, but it was obvious that Johnny is in fact an LLM chatbot. As much as I appreciate Johnny’s demeanor, he kept forgetting what I told him, and his answers were often times long and robotic. This is why LLM chatbot evaluation is imperative to deploying production grade LLM conversational agents.

In this article, I’ll teach you how to evaluate LLM chatbots so thoroughly that you’ll be able to quantify whether they’re convincing enough to pass as real people. More importantly, you’ll be able to use these evaluation results to identify how to iterate on your LLM chatbot, such as using a different prompt template or LLM.

As the author of DeepEval , the open-source LLM evaluation framework, I’ll share how to evaluate LLM chatbots as we’ve built the way to evaluate LLM conversations for over 100k users. You’ll learn:

  • LLM chatbot/conversation evaluation vs regular LLM evaluation
  • Different modes of LLM conversation evaluation
  • Different types of LLM chatbot evaluation metrics
  • How to implement LLM conversation evaluation in code using DeepEval

Let’s start talking.

What is LLM Chatbot Evaluation and How is it Different From LLM Evaluation?

LLM chatbot evaluation is the process of evaluating the performance of LLM conversational agents by assessing the quality of responses made by large language models (LLMs) in a conversation. It is different from regular LLM (system) evaluation because while regular LLM evaluation evaluates LLM applications on individual input-output interactions, LLM chatbot evaluation involves evaluating LLM input-output interactions using prior conversation history as additional context.

Difference between LLM chatbot evaluation and LLM system evaluation

This means that although the criteria for LLM chatbot evaluation can be extremely similar for non-conversational LLM agents, metrics used to carry out these evaluation require a whole new implementation to take prior conversation histories into account.

This “conversation history” actually has a more technical term to it, called turns. When you hear about a multi-turn conversation, it is basically just a fancy way of describing messages or user exchanges with an LLM chatbot.

So the question becomes, given a list of turns in a conversation, how and what should we be evaluating? Should we look at the conversation as a whole, individual turns in a conversation, or what?

Different Modes of LLM Conversation Evaluation

It turns out (no pun intended) there are two types of LLM conversation evaluation:

  • Entire conversation evaluation: This involves looking at the entire conversation and evaluating it based on all turns in a conversation.
  • Last best response evaluation: This involves evaluating only the last response an LLM chatbot generates in a conversation.

The reason why we don’t evaluate individual turns is because often times there are a lot redundant chit-chats which is a waste of tokens to evaluate. The principle remains the same: for whichever type of conversation evaluation you decide to adopt, it should be carried out with the context of the prior turns in mind.

Entire Conversation Evaluation

Evaluating entire conversations as a whole are useful because some evaluation criteria just requires the entire conversation as context. For example, imagine you’re building a conversational agent to help users open a bank account. The LLM agent will have to ask for the user’s name, address, SSN, and so much more, but a common problem you might face is an LLM conversational agent might forget about prior information a user has already supplied, which results in it end up asking repetitive questions that is frustrating to interact with.

This means that to define a conversational metric to quantify how well your LLM and prompt templates can retain prior knowledge supplied by a user, you’ll have to look at the entire conversation as a whole. In fact, the first conversational metric I’d like to introduce is exactly the knowledge retention metric, which assesses how well an LLM chatbot is able to retain information presented to it throughout a conversation. You can access it in DeepEval and don’t worry, I’ll demonstrate how in later sections.

Another time when evaluating entire conversations are useful is when you’re looking to use all turns in a conversation to construct a final metric score. (for those who are interested in what I mean by a metric score, I highly recommend you to read this article here that explains everything about what an LLM evaluation metric is.)

For example, let’s say you want to measure how relevant your LLM chatbot’s responses are. You can define the final conversation relevancy metric score as the number of relevant responses divided by the total number of turns in a conversation. But the question becomes, how do you determine whether a response is relevant?

This might sound like a silly question at first, after all, can’t you just ask an LLM-judge to give you the number of irrelevant responses by supplying it with the entire turn history? The problem with this approach is LLM-as-a-judge can easily hallucinate when a conversation gets lengthy. But an even more important question you should ask yourself, is which prior turns to consider when measuring response relevancy. Let me explain.

Imagine a conversation with 100 turns (I know, sounds like a grade school math problem). You’re evaluating the 50th turn, and although the response isn’t quite relevant when only considering the previous 2 turns (48th and 49th turn) as supporting context, it is in fact extremely relevant when you take the previous 10 turns (39–49th turn) into account.

Sliding window evaluation approach for taking prior turns into account

But let’s take it to the limit (one more time). What if a response is only relevant when considering the previous 30 turns? 100 turns? 1000 turns? The question becomes: how many prior turns prior to the current one should you be taking into account when evaluating a specific LLM chatbot response? This is why when evaluating entire conversations, in some instances it is important to determine how much of the previous conversation history to take into account.

Those from a more traditional software engineering background will recognize this as a problem that can be solved using the sliding window approach, where for each response, we can take the previous min(0, current turn number — window size) turns into account to determine each response’s relevancy. So back to the conversation relevancy example, to calculate the conversation relevancy score we would simply:

  • Loop through all individual turn in a conversation
  • Using a sliding window approach, grab the last min(0, current turn number — window size) most recent turns and take it into account to determine whether the current turn response is relevant
  • Add up the number of relevant responses, and divide it by the total number of turns

I hope that was clear. Let’s go on to talk about the easier mode of conversation evaluation.

Last Best Response Evaluation

Some users instead of evaluating the entire conversation, might prefer to evaluate the last best response in a conversation instead. That’s not to say prior turns in a conversation isn’t as supporting context, it is just that instead of evaluating based on responses from the entire conversation we simply look at the last response in a conversation and determine its metric score based on the criteria at hand. This also means that often times the last response is left empty, and only during evaluation will it be generated by your LLM chatbot for evaluation.

Last best response evaluation approach

The good news for folks looking to evaluate the last best response in a conversation is, you can reuse ALL non-conversational LLM evaluation metrics used for evaluating individual LLM system responses. Well not really, you’ll have to tweak them a little to include some number of prior turns as additional context, but that’s more straight forwards than evaluating entire conversations.

Confident AI: The LLM Evaluation Platform

The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.

Regression test and evaluate LLM apps on the cloud.
LLM evaluation metrics for ANY use case.
Real-time LLM observability and tracing.
Automated human feedback collection.
Generate evaluation datasets on the cloud.
LLM security, risk, and vulnerability scanning.

LLM Conversation Evaluation Using DeepEval

Before we start talking about conversational metrics, I want to introduce DeepEval , the open-source LLM evaluation framework. I want to introduce DeepEval not because that’s what I’m working on (ok maybe a little), but because DeepEval offers all the conversational metrics that we’re going to go through in the next section, and it would be useful to be familiarized with the framework.

In DeepEval, you can evaluate LLM conversations by creating a conversational test case. A conversational test case is made up of a list of turns, which represents your LLM conversation.


pip install deepeval

from deepeval.test_case import ConversationalTestCase

convo_test_case = ConversationalTestCase(
  chatbot_role="You are a happy jolly CS agent",
  turns=[
    LLMTestCase(input="Hi", actual_output"Hey how can I help?"),
    LLMTestCase(input="I just want someone to talk to", actual_output"No problem I'm here to listen!")
  ]
)

The conversational turns are instances of LLMTestCases , where the input is the user input and the actual_output is the LLM chatbot response. There is also a chatbot_role parameter that is only required if you want to use the role adherence metric later on. You can learn more about conversational and LLM test cases here.

Once you have packaged a conversation into a conversational test case, all that’s left is to import the metric you would like to use and measure it against the test case. Here’s an example:


from deepeval.metrics import KnowledgeRetentionMetric

...
metric = KnowledgeRetentionMetric(verbose_mode=True)
metric.measure(convo_test_case)
print(metric.score, metric.reason)

It really is as simple as that.

Different Types of LLM Chatbot Evaluation Metrics

In this section, I’ll go through the four most useful conversational metrics you’ll want to consider for evaluating entire conversations, and how to use them in less than 5 of code via DeepEval , the open-source LLM evaluation framework.

These metrics are designed to evaluate entire LLM conversations and so for those looking to evaluate conversations using the last best response approach, you can checkout this comprehensive article I wrote on LLM evaluation metrics instead, which goes through all the LLM evaluation metrics you need to know for last best response, including RAG, LLM agents, and even for a fine-tuning use case, and explains why we’re using LLM-as-a-judge for all metrics related to LLM evaluation.

(PS. DeepEval also offers 14+ pre-defined LLM evaluation metrics and custom metrics for last-best-response evaluations, which can be found here.)

Role Adherence

The role adherence metric assesses whether your LLM chatbot is able to act how it is instructed to throughout a conversation. It is particular useful for a role-playing use case. It is calculated by looping through each turn individually before using an LLM to determine which one of them does not adhere to the specified chatbot_role using previous turns as context. The final role adherence metric score is simply the number of turns that adhered to the specified chatbot role divided by the total number of turns in a conversational test case.


from deepeval.test_case import ConversationalTestCase, LLMTestCase
from deepeval.metrics import RoleAdherenceMetric

convo_test_case = ConversationalTestCase(
  chatbot_role="description of chatbot role goes here",
  turns=[
    LLMTestCase(
      input="user input goes here", 
      actual_output="LLM chatbot repsonse goes here"
    )
  ]
)
metric = RoleAdherenceMetric(verbose_mode=True)
metric.measure(convo_test_case)
print(metric.score, metric.reason)

Conversation Relevancy

The conversation relevancy metric as introduced earlier, assesses whether your LLM chatbot is able to generate relevant responses throughout a conversation. It is calculated by looping through each turn individually and adopts a sliding window approach to take the last min(0, current turn number — window size) turns into account to determine whether it is relevant or not. The final conversation relevancy metric score is simply the number of turn responses that is relevant divided by the total number of turns in a conversational test case.

Similar to the other metrics, you can use the conversation relevancy metric through DeepEval in a few lines of code:


from deepeval.test_case import ConversationalTestCase, LLMTestCase
from deepeval.metrics import ConversationRelevancyMetric

convo_test_case = ConversationalTestCase(
  turns=[
    LLMTestCase(
      input="user input goes here", 
      actual_output="LLM chatbot repsonse goes here"
    )
  ]
)
metric = ConversationRelevancyMetric(window_size=5, verbose_mode=True)
metric.measure(convo_test_case)
print(metric.score, metric.reason)

Knowledge Retention

The knowledge retention metric assesses whether your LLM chatbot is able to retain information presented to it throughout a conversation. It is calculated by first extracting a list of knowledges presented to it up to the certain turn in a conversation, and determining whether the LLM is asking for information that is already present in the turn response. The knowledge retention score is simply the number of turns without knowledge attritions divided by the total number of turns.


from deepeval.test_case import ConversationalTestCase, LLMTestCase
from deepeval.metrics import KnowledgeRetentionMetric

convo_test_case = ConversationalTestCase(
  turns=[
    LLMTestCase(
      input="user input goes here", 
      actual_output="LLM chatbot repsonse goes here"
    )
  ]
)
metric = KnowledgeRetentionMetric(verbose_mode=True)
metric.measure(convo_test_case)
print(metric.score, metric.reason)

Conversation Completeness

The conversation completeness metric assesses whether your LLM chatbot is able to fulfill user requests throughout a conversation. It is useful because conversation completeness can be used as a proxy to measure user satisfaction and chatbot effectiveness. It is calculated by first using an LLM to extract a list of high level user intentions found in the conversation turns, before using the same LLM to determine whether each intention was met and/or satisfied throughout the conversation.


from deepeval.test_case import ConversationalTestCase, LLMTestCase
from deepeval.metrics import ConversationCompletenessMetric

convo_test_case = ConversationalTestCase(
  turns=[
    LLMTestCase(
      input="user input goes here", 
      actual_output="LLM chatbot repsonse goes here"
    )
  ]
)
metric = ConversationCompletenessMetric(verbose_mode=True)
metric.measure(convo_test_case)
print(metric.score, metric.reason)

Using DeepEval with Confident AI to Regression Test LLM Chatbots

While DeepEval is an LLM evaluation framework, Confident AI is the LLM evaluation platform, powered by DeepEval. Using Confident AI is optional, but it does offer the ability to run DeepEval’s metrics on the cloud and it generates test reports for you to identify LLM regressions between multiple evaluations.

Optionally login via DeepEval via the CLI:


deepeval login

And run evaluations to regression test your LLM chatbot using the same architecture we discussed above:


from deepeval import evaluate
from deepeval.test_case import ConversationalTestCase, LLMTestCase
from deepeval.metrics import ConversationRelevancyMetric

convo_test_case = ConversationalTestCase(
  turns=[
    LLMTestCase(
      input="user input goes here", 
      actual_output="LLM chatbot repsonse goes here"
    )
  ]
)
metric = ConversationRelevancyMetric()

evaluate(test_cases=[convo_test_case], metrics=[metric])

You’ll have access to the results locally after running the evaluate function and if you’re using Confident AI, a neat test report will be generated that you can use for comparing test results.

Regression testing LLM chatbots on Confident AI

Conclusion

I hope that was a breeze to read, and now that you’re here it means you have everything you need to know to get started with evaluating LLM chatbots to safeguard against breaking changes or identify areas of improvement for your conversational agent. In this article we learnt ways in which we can evaluate an LLM conversation, different techniques we can employ, what metrics to lookout for, and how to implement it in DeepEval.

Don’t forget to give ⭐ DeepEval a star on Github ⭐ if you found this article useful, and as always, till next time.

* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The LLM Evaluation Platform

The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.

Regression test and evaluate LLM apps on the cloud.
LLM evaluation metrics for ANY use case.
Real-time LLM observability and tracing.
Automated human feedback collection.
Generate evaluation datasets on the cloud.
LLM security, risk, and vulnerability scanning.
Jeffrey Ip
Cofounder @ Confident, creating companies that tell stories. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Thank you! You're now subscribed to Confident AI's weekly newsletter.
Oops! Something went wrong while submitting the form.