ChatGPT, the leading code generator, has soared in popularity over the past year thanks to the seemingly omniscient GPT-4. Its ability to generate coherent and poetic responses to previously unseen contexts has accelerated the development of other foundational large language models (LLMs), such as Anthropic’s Claude, Google’s Bard, and Meta’s open-source LLaMA model. Consequently, this has enabled ML engineers to build retrieval-based LLM applications around proprietary data like never before. But these applications continue to suffer from hallucinations, struggle to keep up-to-date with the latest information, and don’t always respond relevantly to prompts.
In this article, as the founder of Confident AI, the world’s first open-source evaluation infrastructure for LLM applications, I will outline how to evaluate LLM and retrieval pipelines, different workflows you can employ for evaluation, and the common pitfalls when building RAG applications that evaluation can solve.
Evaluation is (not) Eyeballing Outputs
Before we begin, does your current approach to evaluation look something like the code snippet below? You loop through a list of prompts, run your LLM application on each one of them, wait a minute or two for it to finish executing, manually inspect everything, and try to evaluate the quality of the output based on each input.
If this sounds familiar, this article is desperately for you. (And hopefully, by the end of this article, you’ll know how to stop eyeballing results.)
Evaluation as a Multi-Step, Iterative Process
Evaluation is an involved process but has huge downstream benefits as you look to iterate on your LLM application. Building an LLM system without evaluations is akin to building a distributed backend system without any automated testing — although it might work at first, you’ll end up wasting more time fixing breaking changes than building the actual thing. (Fun fact: Did you know that AI-first applications suffer from a much lower one-month retention because users don’t revisit flaky products?)
To evaluate LLMs, you need several components - an evaluation dataset (that improves over time), choose and implement up to a handful of evaluation metrics on criteria relevant to your use case, and some evaluation infrastructure in place to continuously run real-time evaluations throughout the lifetime of your LLM application.
By the way, if you're looking to get a better general sense of what LLM evaluation is, here is another great read.
Step One — Creating an Evaluation Dataset
The first step to any successful evaluation workflow for LLM applications is to create an evaluation dataset, or at least have a vague idea of the type of inputs your application is going to get. It might sound fancy and a lot of work, but the truth is you’re probably already doing it as you’re eyeballing outputs.
Let’s consider the eyeballing example above. Correct me if I’m wrong, but what you’re really trying to do is to judge an output based on what you’re expecting. You probably already know something about the knowledge base you’re working with and are likely aware of what retrieval results you expect to see should you also choose to print out the retrieved text chunks in your retrieval pipeline. The initial evals dataset doesn’t have to be comprehensive, but start by writing down a set of QAs with the relevant context:
Here, the “input” is mandatory, but “expected_output” and “context” are optional (you’ll see why later).
If you wish to automate things, you can try to generate an evals dataset by looping through your knowledge base (which could be in a vector database like Qdrant) and ask GPT-3.5 to generate a set of QAs instead of manually doing it yourself. It’s flexible, versatile, and fast, but limited by the data it was trained on. (Ironically, you’re more likely to care about evaluation if you’re building in a domain that requires deep expertise, since it’s more reliant on the retrieval pipeline rather than the foundational model itself.)
Lastly, you might wonder, “Why do I need an evaluation dataset when there are already standard LLM benchmarks out there?”. Well, it’s because public benchmarks like Stanford HELM are redundant when it comes to evaluating an LLM application that’s based on your proprietary data.
Step Two — Identify Relevant Metrics for Evaluation
The next step in evaluating LLM applications is to decide on the set of metrics you want to evaluate your LLM application on. Some examples include:
- factual consistency (how factually correct your LLM application is based on the respective context in your evals dataset)
- answer relevancy (how relevant your LLM application’s outputs are based on the respective inputs in your evals dataset)
- coherence (how logical and consistent your LLM application’s outputs are)
- toxicity (whether your LLM application is outputting harmful content)
- RAGAS (for RAG pipelines)
- bias (pretty self-explanatory)
I’ll write about all the different types of metrics in another article, but as you can see, different metrics require different components in your evals dataset to reference against one another. Factual consistency doesn’t care about the input, and toxicity only cares about the output. (Here, we would call factual consistency a reference-based metric since it requires some sort of grounded context, while toxicity, for example, is a reference-less metric.)
Step Three — Implement a Scorer to Compute Metric Scores
This step involves taking all the relevant metrics you’ve previously identified and implementing a way to compute a score for each data point in your evals dataset. Here’s an example of how you might implement a scorer for factual consistency (code taken from DeepEval):
Here, we used a natural language inference model from Hugging Face to compute an entailment score ranging from 0–1 to measure factual consistency. It doesn’t have to be this particular implementation, but you get the point — you’ll have to decide how you want to compute a score for each metric and find a way to implement it. One thing to note is that LLM outputs are probabilistic in nature, so your implementation of the scorer should take this into account and not penalize outputs that are equally correct but different from what you expect.
At Confident AI, we use a combination of model-based, statistical, but also LLM-based scorers depending on the type of metric we’re trying to evaluate. For example, we use a model-based approach to evaluate metrics such as factual consistency (NLI models) and answer relevancy (cross-encoders), while for more nuanced metrics such as coherence, we implemented a framework called G-Eval (which applies LLMs with Chain-of-Though) for evaluation using GPT-4. (If you’re interested, here’s the paper that introduces GEval — a robust framework to utilize LLMs for evaluation) In fact, the authors of the paper found that G-Eval outperforms all traditional scores such as:
- BLEU (compares n-grams of the machine-generated text to n-grams of a reference translation and counting the number of matches)
- BERTScore (a metric for evaluating text generation based on BERT embeddings)
- ROUGE (a set of metrics for evaluating automatic summarization of texts as well as machine translation)
- MoverScore (computes the distance between the contextual embeddings of words in the machine-generated text and those in a reference text)
If you’re not familiar with these scores, don’t worry, here's an in depth article on all types of LLM evaluation metric scorers.
Lastly, you’ll need to define a passing criterion for each metric; the passing criterion is the threshold which the metric score will need to meet in order for your LLM application output to be deemed satisfactory for a given input. For example, a passing criterion for the factual consistency metric implemented above could be 0.6, since the metric outputs a score ranging from 0 to 1. (Similarly, the passing criteria might be 1 for a metric that outputs a 0 or 1 binary score.)
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.
Step Four — Apply each Metric to your Evaluation Dataset
With everything in place, you can now loop through your evaluation dataset and evaluate each data point individually. The algorithm looks something like this:
- Loop through your evaluation dataset.
- For each data point, run your LLM application based on the given input.
- Once your LLM application has finished generating an output for a given data point, compute a score for each of the metrics you’ve previously defined.
- Identify and log failing metrics (metrics where the passing criteria wasn’t met).
- Iterate on your LLM application based on these failing metrics.
- Repeat steps 1–5 until no metrics are failing.
Now, you can stop eyeballing outputs and ensure that having confidence in your LLM application is as easy as having passing test cases.
Step Five — Integrate Evaluations as Unit Tests in CI/CD Pipelines
Having everything setup is great, but to take automated evaluations a step further, you can include evaluations in the form of unit tests in CI/CD pipelines such as on GitHub Actions, which you can do through DeepEval, the open-source LLM evaluation framework. DeepEval offers 14+ LLM evaluation metrics to cover almost any use case you may have, something which I've been working on to help other developers automate eyeballing LLM outputs.
Here is a great read of how to unit test LLMs in CI/CD pipelines, but as a quick summary, first install DeepEval:
Then, create a test file, similar to Pytest:
Write a simple test case:
Which you can execute via the CLI:
That's all! To take unit testing to CI/CD pipelines, simply include a test file with your test cases, and execute this same command in for example. a YAML file (if for example you're using GitHub workflows).
Step Six — Continuous Evaluations in Production
The final step involves evaluating LLM outputs in real-time. This is vital as it allows you to be alerted of any unsatisfactory responses and iterate on them as quickly as possible. Unfortunately, there's currently no easy way to do this. But if you would like to have real-time evaluations in production, you can consider Confident AI, we help you automate manual evaluation in all stages of your LLM development cycle. I'm going to stop shamelessly advertising what I'm working on, but here is the link to sign up for those interested (it's free to try!).
Evaluation Helps You Iterate Towards the Optimal Hyperparameters
There are several benefits of setting up an evaluation framework that would allow you to rapidly iterate and improve on your LLM application/retrieval pipeline:
- Taking a RAG-based application as an example, you can now run several nested for loops to find the optimal combination of hyperparameters such as chunk size, top k retrieval, embedding model, and prompt template that would yield the highest metric scores for your evaluation dataset.
- You’ll be able to make marginal improvements without worrying about unnoticed breaking changes.
Evaluation is Not Bullet-Proof Though
Although your evaluation framework is now in place, it is flimsy and fragile, especially in the early days of deploying to production. This is because your users will start prompting your application in ways you’ve never expected, but that’s okay. To build a truly robust LLM application, you should:
- Identify unsatisfactory outputs, mark them for reproducibility, and add them to your evaluation dataset. This is known as continuous evaluation and without it, you’ll find that your LLM application will slowly become out of touch with what your users care most about. There are several ways you can identify bad outputs, but the most foolproof way would be to use humans as an evaluator.
- Identify on a component level which part of your LLM pipeline is causing unsatisfactory outputs. This is known as evaluating with tracing and without it, you’ll find yourself making unnecessary changes because you “think” for example, the retrieval component is not retrieving the relevant text chunks when it’s actually the prompt template that’s the problem.
Other Approaches to Evaluation
Another way to evaluate LLM applications could be an auto-evaluation approach where LLMs are used as judges for picking the best output when presented with several different choices. In fact, data from Databricks claims that LLM-as-a-judge agrees with human grading on over 80% of judgments. There are several points to note when using LLM-as-a-judge:
- GPT-3.5 works, but only if you provide an example.
- GPT-4 works well even without an example.
- Use low-precision grading scales like 1–5 or a binary scale to retain precision, instead of going for something like 1–100.
A possible approach to auto-evaluation is to:
- Generate outputs on all different combinations of hyperparameters.
- Ask GPT-4 to compare and pick the best set of outputs in a pairwise fashion.
- Identify the set of hyperparameters for the best set of outputs GPT-4 has chosen.
A problem I have with this approach, and why we haven’t implemented a way to do this at Confident AI, is that it leaves nothing actionable for subsequent iteration and improvement.
Conclusion
Evaluating LLM pipelines is essential to building robust applications, but evaluation is an involved and continuous process that requires a lot of work. If you want to do short-lived, untrusted evaluation, print statements are a great choice. However, if you want to employ a robust evaluation infrastructure in your current development workflow, you can use Confident AI. We’ve done all the hard work for you already, and you can find us on GitHub⭐
And it comes with a platform that allows you to log and debug historical evaluation results, centralize evaluation datasets, and run real-time evaluations in production. Thank you for reading, and I’ll be back next week to talk about all the different metrics for LLM evaluation.
Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.