Test Cases, Goldens, and Datasets

Overview

Test cases, goldens, and datasets are three one of the most important primitives to learn about for LLM evaluation. They outline how interactions with your LLM app is represented in Confident AI, which is imparative for applying metrics for evaluation.

In summary:

Test cases represents either single or multi-turn interactions with your LLM app, which metrics will use for evaluation
Goldens are precursor to test cases - when you edit datasets on Confident AI, you are editing goldens, containing not just the input that will kickstart your LLM app but also any other custom metadata that’s required to invoke your app
Datasets is a list of goldens and orchestrates the entire evaluation process, may it be single, multi-turn, e2e or component-level testing

These primitives are standardized in Confident AI and are used for all forms of evals.

You ought to understand test cases to understand what are metrics evaluating in the next section.

Test Cases

Test cases capture your LLM app’s runtime inputs and outputs, which metrics use for evaluation. Test cases are:

Only found in test runs, produced after evaluation
Contains a pass/fail status, determined by their metric scores, and
Are immutable, meaning they cannot be edited once created

As a developer, you need to map these arguments into the test case format—either single-turn or multi-turn.

Single-Turn

Multi-Turn

A single-turn test case represents a single, atomic interaction with your LLM app:

In the diagram above, we see that an interaction can include an input, actual_output, retrieval_context (for RAG), tools_called, etc. An interaction can live in both the:

End-to-end level: The “observable” system inputs and outputs are piped into a test case
Component-level: An individual component’s interactions are piped into a test case

In deepeval, a single-turn test case is represented by an LLMTestCase:

llm_test_case.py

1 from pydantic import BaseModel
2 
3 class LLMTestCase(BaseModel):
4     input: str
5     actual_output: Optional[str] = None
6     retrieval_context: Optional[List[str]] = None
7     tools_called: Optional[List[ToolCall]] = None
8 
9     # Static fields that are ported over from goldens
10     expected_output: Optional[str] = None
11     context: Optional[List[str]] = None
12     expected_tools: Optional[List[ToolCall]] = None
13 
14     # Not used for evals
15     name: Optional[str] = None

Each parameter represents different aspects of an interaction:

Input: The input to your LLM app. This is usually not the entire prompt, and if you’re using the OpenAI API for example this is be the contents of the last user message.
Actual output: The output of your LLM app for a given input.
Retrieval Context: The dynamic text chunks that were retrieved, especially relevant for RAG use cases.
Tools Called: Any tools that were called for the given input.
Expected Output: The ideal output of your LLM app for a given input.
Context: Any static supporting context that is relevant for your use case.
Expected Tools: The ideal list of tools that should be called for a given input.

Here’s a quick example of how you would populate the input and actual_output fields of an LLMTestCase during evaluation:

main.py

1 from openai import OpenAI
2 from deepeval.test_case import LLMTestCase
3 
4 client = OpenAI()
5 
6 def llm_app(query: str) -> str:
7     return client.chat.completions.create(
8         model="gpt-4o",
9         messages=[
10             {"role": "user", "content": query}
11         ]
12     ).choices[0].message.content
13 
14 query = "What's the date today?"
15 output = llm_app(query)
16 
17 test_case = LLMTestCase(input=query, actual_output=output)

In fact, the input will very unlikely be orphaned as shown in the example, and most definitely come from single-turn goldens in your dataset.

When we run regression tests in later sections, we will be matching either the inputs or name of test cases to see if their performance has regressed across test runs.

Goldens

Goldens are extremely similar to test cases - in fact almost identical - for both single and multi-turn. However, goldens are edit-heavy and contains extra fields that provides you more flexibility to kickstart your LLM app for evaluation.

When you edit datasets on Confident AI, you are editing goldens, not test cases. Another important thing to remember is, single-turn goldens create single-turn test cases, and vice versa.

Single-Turn

Multi-Turn

A single-turn golden is represented by the Golden class in deepeval:

golden.py

1 from pydantic import BaseModel
2 
3 class Golden(BaseModel):
4     input: str
5     expected_output: Optional[str] = None
6     context: Optional[List[str]] = None
7     expected_tools: Optional[List[ToolCall]] = None
8 
9     # Useful metadata for generating test cases
10     additional_metadata: Optional[Dict] = None
11     comments: Optional[str] = None
12     custom_column_key_values: Optional[Dict[str, str]] = None
13 
14     # Fields that you should ideally not populate
15     actual_output: Optional[str] = None
16     retrieval_context: Optional[List[str]] = None
17     tools_called: Optional[List[ToolCall]] = None

It is highly not recommended to pre-populate the actual output, retrieval context, and tools called in goldens, as these are meant to be populated dynamically and doing so will defeat the purpose of evaluation.

You’ll notice that goldens are more opinionated and contains a custom_column_key_values field that you can edit either on the platform or via code.

Datasets

Lastly, a dataset is a collection of goldens. A dataset is either multi-turn or single-turn, and cannot be both at the same time. Datasets can be created either:

On the platform directly under Project > Datasets, or
Via Confident AI’s Evals API (also available in deepeval)

To create a single-turn dataset, you need to use single-turn goldens, and vice versa. At evaluation time, you will need to:

Loop through goldens in your dataset to invoke your LLM app using the input of each golden
Map the correct arguments from your golden and LLM app to create test cases
Add these test cases back to your dataset
Run evaluation on these test cases

This workflow is extremely important and stays the same no matter whether you are running end-to-end, component-level, single, or multi-turn evals.

Users often ignore the importance of step 3, but it is extremely important as it will tell Confident AI which test run belongs to which dataset, which makes it possible for you to compare prompts and models later on.

Quick example of an end-to-end evaluation:

Single-Turn

Multi-Turn

main.py

1 from deepeval.dataset import EvaluationDataset
2 from deepeval.test_case import LLMTestCase
3 from deepeval.metrics import AnswerRelevancyMetric
4 from deepeval import evaluate
5 
6 dataset = EvaluationDataset()
7 dataset.pull(alias="YOUR-DATASET-ALIAS") # replace with your alias
8 
9 # step 1.
10 for golden in dataset.goldens:
11     test_case = LLMTestCase(
12         input=golden.input,
13         actual_output=your_llm_app(golden.input) # step 2.
14     )
15     # step 3., very important!
16     dataset.add_test_case(test_case)
17 
18 # step 4.
19 evaluate(test_cases=dataset.test_cases, metrics=[AnswerRelevancyMetric()])

Looking at both single and multi-turn examples, it should now be clear why multi-turn is much more challenging to evaluate, since we not only have to do the necessary ETL to format goldens into test cases, but also generate a long list of turns as well.

Next Steps

Now that you know what single-turn, multi-turn, end-to-end, and component-level testing is, as well as the primitives involved in evaluation, it’s time to understand:

What are LLM-as-a-Judge metrics
Which metrics are suitable for your use case