Most developers don't setup a process to automatically evaluate LLM outputs when building LLM applications even if that means introducing unnoticed breaking changes because evaluation can be an extremely challenging task. In this article, you're going to learn how to evaluate LLM outputs the right way. (PS. if you want to learn how to build your own evaluation framework, click here.)

On the agenda:

what are LLMs and why they're difficult to evaluate
different ways to evaluate LLM outputs in Python
how to evaluate LLMs using DeepEval

Enjoy!

What are LLMs and what makes them so hard to evaluate?

To understand why LLMs are difficult to evaluate and why they're often times referred to as a "black box", let's debunk are LLMs and how they work.

GPT-4 is an example of a large language model (LLM) and was trained on huge amounts of data. To be exact, around 300 billion words from articles, tweets, r/tifu, stack-overflow, how-to-guides, and other pieces of data that were scraped off the internet.

Anyway, the GPT behind "Chat" stands for Generative Pre-trained Transformers. A transformer is a specific neural network architecture which is particularly good at predicting the next few tokens (a token == 4 characters for GPT-4, but this can be as short as one character or as long as a word depending on the specific encoding strategy).

So in fact, LLMs don't really "know" anything, but instead "understand" linguistic patterns due to the way in which they were trained, which often times makes them pretty good at figuring out the right thing to say. Pretty manipulative huh?

All jokes aside, if there's one thing you need to remember, it's this: the process of predicting the next plausible "best" token is probabilistic in nature. This means that, LLMs can generate a variety of possible outputs for a given input, instead of always providing the same response. It is exactly this non-deterministic nature of LLMs that makes them challenging to evaluate, as there's often more than one appropriate response.

Why do we need to evaluate LLM applications?

When I say LLM applications, here are some examples of what I'm referring to:

Chatbots: For customer support, virtual assistants, or general conversational agents.
Code Assistance: Suggesting code completions, fixing code errors, or helping with debugging.
Legal Document Analysis: Helping legal professionals quickly understand the essence of long contracts or legal texts.
Personalized Email Drafting: Helping users draft emails based on context, recipient, and desired tone.

LLM applications usually have one thing in common - they perform better when augmented with proprietary data to help with the task at hand. Want to build an internal chatbot that helps boost your employee's productivity? OpenAI certainly doesn't keep tabs on your company's internal data (hopefully).

This matters because it is now not only OpenAI's job to ensure GPT-4 is performing as expected, but also yours to make sure your LLM application is generating the desired outputs by using the right prompt templates, data retrieval pipelines, model architecture (if you're fine-tuning), etc.

Evaluation (I'll just call them evals from hereon) helps you measure how well your application is handling the task at hand. Without evals, you will be introducing unnoticed breaking changes and would have to manually inspect all possible LLM outputs each time you iterate on your application, which to me sounds like a terrible idea.

How to evaluate LLM outputs

There are two ways everyone should know about when it comes to evals - with and without LLMs. In fact, you can learn how to build your own evaluation framework in under 20 minutes here.

Evals without LLMs

A nice way to evaluate LLM outputs without using LLMs is using other machine learning models derived from the field of NLP. You can use specific models to judge your outputs on different metrics such as factual correctness, relevancy, biasness, and helpfulness (just to name a few, but the list goes on), despite non-deterministic outputs.

For example, we can use natural language inference (NLI) models (which outputs an entailment score) to determine how factually correct a response is based on some provided context. The higher the entailment score, the more factually correct an output is, which is particularity helpful if you're evaluating a long output that's not so black and white in terms of factual correctness.

You might also wonder how can these models possibly "know" whether a piece of text is factually correct. It turns out you can provide context to these models for them to take at face value. In fact, we call these context ground truths or references. A collection of these references are often referred to an evaluation dataset.

But not all metrics require references. For example, relevancy can be calculated using cross-encoder models (another ML model), and all you need is supply the input and output for it to determine how relevant they are to each another.

Off the top of my head, here's a list of reference-less metrics:

relevancy
summarization
bias
toxicity
helpfulness
harmlessness
coherence

And here is a list of reference based metrics:

hallucination
semantic similarity

Note that reference based metrics doesn't require you to provide the initial input, as it only judges the output based on the provided context. Click here to learn everything you need to know about LLM evaluation metrics.

Using LLMs for Evals

There's a new emerging trend to use state-of-the-art (aka. gpt-4) LLMs to evaluate themselves or even other others LLMs.

G-Eval is a Recently Developed Framework that uses LLMs for Evals

I'll attach an image from the research paper that introduced G-eval below, but in a nutshell G-Eval is a two part process - the first generates evaluation steps, and the second uses the generated evaluation steps to output a final score.

Let's run though a concrete example. Firstly, to generate evaluation steps:

introduce an evaluation task to GPT-4 (eg. rate this summary from 1 - 5 based on relevancy)
introduce an evaluation criteria (eg. Relevancy will based on the collective quality of all sentences)

Once the evaluation steps has been generated:

concatenate the input, evaluation steps, context, and the actual output
ask it to generate a score between 1 - 5, where 5 is better than 1
(Optional) take the probabilities of the output tokens from the LLM to normalize the score and take their weighted summation as the final result

Step 3 is actually pretty complicated, because to get the probability of the output tokens, you would typically need access to the raw model outputs, not just the final generated text. This step was introduced in the paper because it offers more fine-grained scores that better reflect the quality of outputs.

Here's a diagram taken from the paper that can help you visualize what we learnt:

Utilizing GPT-4 with G-Eval outperformed traditional metrics in areas such as coherence, consistency, fluency, and relevancy, but, evaluations using LLMs can often be very expensive. So, my recommendation would be to evaluate with G-Eval as a starting point to establish a performance standard and then transition to more cost-effective metrics where suitable.

For those who interested, click here to learn more about G-Eval and all the different types of LLM evaluation metrics.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Try Now for Free

Checkout DeepEval

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Request a Demo

Checkout DeepTeam

Evaluating LLM outputs in python

By now, you probably feel inundated by all the jargon and definitely wouldn't want to implement everything from scratch. Imagine having to research what's the best way to compute each individual metric, train your own model for it, and code up an evaluation framework...

Luckily, there are a few open source packages such as ragas and DeepEval that provides an evaluation framework so you don't have to write your own.

As the cofounder of Confident (the company behind DeepEval), I'm going to go ahead and shamelessly show you how you can unit test your LLM applications in CI/CD pipelines using DeepEvals 😊 (but seriously, we have an amazing Pytest-like developer experience, easy to setup, and offer a free platform for you to visualize your evaluation results)

Let's wrap things up with some coding.

Setting up your test environment

To implement our much anticipated evals, create a project folder and initialize a python virtual environment by running the code below in your terminal:

mkdir evals-example
cd evals-example
python3 -m venv venv
source venv/bin/activate

Your terminal should now look something like this:

(venv)

Installing dependencies

Run the following code:

pip install deepeval

Setting your OpenAI API Key

Lastly, set your OpenAI API key as an environment variable. We'll need OpenAI for G-Evals later (which basically means using LLMs for evaluation). In your terminal, paste in this with your own API key (get yours here if you don't already have one):

export OPENAI_API_KEY="your-api-key-here"

Writing your first test file

Let's create a file called `test_evals.py` (note that test files must start with "test"):

touch test_evals.py

Paste in the following code:

from deepeval.metrics import GEval, HallucinationMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval import assert_test

def test_hallucination():
    hallucination_metric = HallucinationMetric(minimum_score=0.5)
    test_case = LLMTestCase(
    	input="What if these shoes don't fit?", 
      actual_output=""We offer a 30-day full refund at no extra costs.", 
      context=[""All customers are eligible for a 30 day full refund at no extra costs."]
     )
    assert_test(test_case, [hallucination_metric])

def test_relevancy():
    answer_relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
    test_case = LLMTestCase(
    	input="What does your company do?", 
      actual_output="Our company specializes in cloud computing"
     )
    assert_test(test_case, [relevancy_metric])
    
def test_humor():
    funny_metric = GEval(
    		name="Humor",
        criteria="How funny it is",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
    )
    test_case = LLMTestCase(
    	input="Write me something funny related to programming", 
      actual_output="Why did the programmer quit his job? Because he didn't get arrays!"
    )
    assert_test(test_case, [funny_metric])

Now run the test file:

deepeval test run test_evals.py

For each of the test cases, there is a predefined metric provided by DeepEval, and each of these metrics output a score from 0 - 1. For example, `HallucinationMetric(minimum_score=0.5)` means we want to evaluate how factually correct an output is, where the `minimum_score=0.5` means the test will only pass if the output score is higher than a 0.5 threshold.

Let's go over the test cases one by one:

`test_hallucination` tests how factually correct your LLM output is relative to the provided context.
`test_relevancy` tests how relevant the output is relative to the given input.
`test_humor` tests how funny your LLM output is. This test case uses LLM for evaluation, more specifically G-Eval.

Notice how there's up to 4 moving parameters for a single test case - the input, the expected output, the actual output (of your application), and the context (that was used to generate the actual output). Depending on the metric you're testing, some parameters are optional, while some are mandatory.

Lastly, here's how you can aggregate metrics on a single test case:

...

def test_everything():
    test_case = LLMTestCase(
    	input="What did the cat do?", 
      actual_output="The cat climbed up the tree", 
      context=["The cat ran up the tree."], 
      expected_output="The cat ran up the tree."
    )
    assert_test(test_case, [hallucination_metric, relevancy_metric, humor_metric])

Not so hard after all huh? Write enough of these (10-20), and you'll have much better control over what you're building 🤗

PS. And here's a bonus feature DeepEval offers: Unit Testing LLM Applications in CI/CD pipelines.

Also, you can leverage DeepEval's free platform by running the following command:

deepeval login

Follow the instructions (login, get your API key, paste it in the CLI), and run the test again by typing in the same command:

deepeval test run test_example.py

Let me know what happens!

Conclusion

In this article, you've learnt:

how LLMs work
examples of LLM applications
why it's hard to evaluate LLM outputs
how to unit test LLM outputs using DeepEval

With evals, you can stop making breaking changes to your LLM application ✅ quickly iterate on your implementation to improve on metrics you care about ✅ and most importantly be confident in the LLM application you build 😇

The source code for this tutorial is available here:
https://github.com/confident-ai/blog-examples/tree/main/evals-example

Thank you for reading, and till next time 🫡