Test cases, goldens, and datasets are three one of the most important primitives to learn about for LLM evaluation. They outline how interactions with your LLM app is represented in Confident AI, which is imparative for applying metrics for evaluation.
In summary:
input that will kickstart your LLM app but also any other custom metadata that’s required to invoke your appThese primitives are standardized in Confident AI and are used for all forms of evals.
You ought to understand test cases to understand what are metrics evaluating in the next section.
Test cases capture your LLM app’s runtime inputs and outputs, which metrics use for evaluation. Test cases are:
As a developer, you need to map these arguments into the test case format—either single-turn or multi-turn.
A single-turn test case represents a single, atomic interaction with your LLM app:
In the diagram above, we see that an interaction can include an input, actual_output, retrieval_context (for RAG), tools_called, etc. An interaction can live in both the:
In deepeval, a single-turn test case is represented by an LLMTestCase:
Each parameter represents different aspects of an interaction:
Here’s a quick example of how you would populate the input and actual_output fields of an LLMTestCase during evaluation:
In fact, the input will very unlikely be orphaned as shown in the example, and most definitely come from single-turn goldens in your dataset.
When we run regression tests in later sections, we will be matching either the inputs or name of test cases to see if their performance has regressed across test runs.
Goldens are extremely similar to test cases - in fact almost identical - for both single and multi-turn. However, goldens are edit-heavy and contains extra fields that provides you more flexibility to kickstart your LLM app for evaluation.
When you edit datasets on Confident AI, you are editing goldens, not test cases. Another important thing to remember is, single-turn goldens create single-turn test cases, and vice versa.
A single-turn golden is represented by the Golden class in deepeval:
It is highly not recommended to pre-populate the actual output, retrieval context, and tools called in goldens, as these are meant to be populated dynamically and doing so will defeat the purpose of evaluation.
You’ll notice that goldens are more opinionated and contains a custom_column_key_values field that you can edit either on the platform or via code.
Lastly, a dataset is a collection of goldens. A dataset is either multi-turn or single-turn, and cannot be both at the same time. Datasets can be created either:
deepeval)To create a single-turn dataset, you need to use single-turn goldens, and vice versa. At evaluation time, you will need to:
This workflow is extremely important and stays the same no matter whether you are running end-to-end, component-level, single, or multi-turn evals.
Users often ignore the importance of step 3, but it is extremely important as it will tell Confident AI which test run belongs to which dataset, which makes it possible for you to compare prompts and models later on.
Quick example of an end-to-end evaluation:
Looking at both single and multi-turn examples, it should now be clear why multi-turn is much more challenging to evaluate, since we not only have to do the necessary ETL to format goldens into test cases, but also generate a long list of turns as well.
Now that you know what single-turn, multi-turn, end-to-end, and component-level testing is, as well as the primitives involved in evaluation, it’s time to understand: