In this story
Jeffrey Ip
Cofounder @ Confident, creator of DeepEval. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

How I Built Deterministic LLM Evaluation Metrics for DeepEval

February 9, 2025
·
9 min read
Presenting...
The open-source LLM evaluation framework.
Star on GitHub
How I Built Deterministic LLM Evaluation Metrics for DeepEval

A little more than a month ago, I had several calls with a few DeepEval users and noticed a clear divide — those who were happy with the out-of-the-box metrics and those who weren’t.

For context, DeepEval is an open-source LLM evaluation framework I’ve been working on for the past year, and all of its LLM evaluation metrics uses LLM-as-a-judge. It’s grown to nearly half a million monthly downloads and close to 5,000 GitHub stars. With over 800k daily evaluations ran, engineers nowadays use it to unit-testing LLM applications such as RAG pipelines, agents, and chatbots.

The users who weren’t satisfied with our metrics had a simple reason: the metrics didn’t fit their use case and they weren’t deterministic enough since they were all evaluated using LLM-as-a-judge. That’s a real problem because the whole point of DeepEval is to eliminate the need for engineers to build their own evaluation metrics and pipelines. If our built-in metrics aren’t usable and people have to go through that effort anyway, then we have no reason to exist.

The more users I talked to, the more I saw codebases filled with hundreds of lines of prompts and logic just to tweak the metrics to be more tailored and deterministic. It was clear that users weren’t just customizing — they were compensating for gaps in what we provided.

This raised an important question: How can we make DeepEval’s built-in metrics flexible and deterministic enough that fewer teams feel the need to roll their own?

Spoiler alert - the solution looks something like this:

Deterministic LLM Powered Decision Trees

The Problem With Custom Metrics

To set the stage better, DeepEval’s default metrics are metrics such as contextual recall, answer relevancy, answer correctness, etc. where the criteria was more general and (relatively) stable. I say stable, because these metrics were mostly based on the question-answer-generation (QAG) technique. Since QAG constrains verdicts to a binary “yes” or “no” for close-ended questions, there’s very little room for stochasticity.

And no, we're not using statistical scorers for LLM evaluation metrics, and you can learn why in this separate article.

I’d also argue that, metrics like answer relevancy are inherently broad. It’s difficult to define relevancy in absolute terms, but as long as the metric provides a reasonable explanation, most people are willing to accept it. Another reason these metrics worked well is that they had clear and straightforward equations behind them. Whether someone liked a metric or not usually came down to whether they agreed with the algorithm used to calculate it.

For example, the contextual recall metric mentioned earlier assess a RAG pipeline’s retriever — it determines for a given input to your LLM application, whether the text chunks your retriever has retrieved is sufficient to produce the ideal expected LLM output. The algorithm was simple and intuitive:

  1. Extract all attributes found in the expected output using an LLM-judge
  2. For each extracted attribute, use an LLM-judge to determine whether it can be inferred from the retrieval context, which is a list of text nodes. This uses QAG, where the determination will be either “yes” or “no” for each extracted attribute.
  3. The final contextual recall score was the proportion of “yes”s in total.

It was easy to understand and made intuitive sense — after all, this is exactly how recall should work. And because the LLM-judge was constrained to binary responses in step 2, the evaluation remained mostly stable for any given test case.

But the problem came when we started looking at evaluation metrics that involved custom criteria and quite frankly, I also didn’t believe our default metrics are well-rounded enough for use case specific evaluation.

You see, when I talk about custom criteria, I don’t mean something simple like:

“Determine if the LLM output is a good summary of the expected output.”

That’s relatively straightforward. What I mean are criteria like:

“Check if the output is in markdown format. If it is, verify that all summary headings are present. If they are, confirm they are in the correct order. Then, assess the quality of the summary itself.”

This kind of evaluation is fundamentally different. It’s no longer just about determining quality — it’s about enforcing a multi-step process with conditional logic, each step introducing new layers of complexity.

Internally, we started calling simpler criteria like “is this a good summary?” toy criteria. These cases didn’t require true deterministic evaluation, and we already had GEval to support them, a SOTA metric to score custom criteria through CoT prompting via a form-filling paradigm. Sure, users could tweak their criteria’s language to make the metric more or less strict, but when you have a clear use case like summarization and don’t have thousands of test cases to establish statistical significance, you need deterministic evaluation. For most users, this was a deal-breaker, and so in these cases, it’s not enough to rely on loose, subjective assessments — the evaluation needs to produce consistent, reliable scores that teams can trust to reflect real performance.

This brings us back to the horrifying codebases we saw — hundreds of lines dedicated to tweaking and shaping evaluation logic just to make the metrics work for specific use cases.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

Finding A Repeatable Solution For All

Whether our users realized it or not, what they were building was essentially a decision tree powered by LLMs. Each step in their evaluation process was a node, where an LLM would make a decision or extract key attributes, and the result would determine what happened next. If a response passed one check, it moved on to the next. If it failed, the scoring logic adjusted accordingly.

This explained why their codebases looked the way they did — long chains of LLM calls, nested conditionals, and rule-based logic, all designed to enforce a structured evaluation process. And while it was clear that users needed this level of control, it was equally clear that writing and maintaining this logic manually was painful.

Interestingly, some users were already using GEval within their DAGs, not for full evaluation but as a lightweight way to pass or fail certain checks before moving forward. Sometimes, the goal wasn’t to assign a granular score but simply to verify whether a response met a minimum requirement before proceeding to a more detailed evaluation. This reinforced what we were seeing: users needed flexibility in how they structured their evaluations, and a one-size-fits-all scoring system wasn’t going to cut it.

We concluded that we need a way for users to easily build DAGs within DeepEval, and came to the conclusion that 4types of nodes were required:

  1. Nodes that processed test cases into better formats for evaluation (LLM inputs, outputs, expected outputs, etc.)
  2. Nodes that made binary judgements (“yes” or “no”, like QAG) based on the context from its parents nodes.
  3. Nodes that made non-binary judgements (a list of possible string outputs) based on the context from its parents nodes.
  4. Leaf nodes that returned either a hard-coded score, or GEval computed score based on a parent node’s verdict (e.g. “yes” or “no” if its parent node is a node that makes binary judgements).

And so, we shipped a DAG metric with these four types of nodes, and (obnoxiously) named it the Deep Acyclic Graph (DAG) metric. We even got a comment from a redditor based on the naming of the DAG metric:

“Confusion isn’t fun for the confused. It (DAG) also runs the risk of giving readers/users the impression that the authors of the tool don’t know what a DAG really is.”

Don’t you worry, I promise we know what a direct acyclic graph is (but we might eventually rename it to double dee-aee-gee, who knows?).

Deterministic, LLM-Powered, Decision Trees

As of February 6th, 2025, DeepEval’s DAG metric is the most powerful metric we’ve built so far. It’s not only customizable but also fully deterministic, structured around LLM-powered decision trees to bring clarity and control to evaluation.

What makes DeepEval’s default metrics work so well is that they break an LLM test case (which contains parameters such as the input, actual output, expected output, etc.) into atomic units and an LLM-judge’s verdicts into constraint responses through QAG. This reduces the chance of hallucination and allows for finer control at each step by providing custom examples for in-context learning. We took this principle and applied it directly to the DAG metric, structuring evaluations into four core node types:

  • Task nodes — Break down an LLM test case into atomic units.
  • Binary Judgment nodes — True or False, and decides the next node based on either the raw test case or outputs from parent nodes.
  • Non-Binary Judgment nodes — A list of strings, and decides the next node based on either the raw test case or outputs from parent nodes.
  • Verdict nodes— Returns a final hardcoded OR GEval score based on the full evaluation path.

(All leaf nodes are verdict nodes, and verdict nodes cannot be a root node)

The most impactful use case we’ve seen so far is summarization, especially for structured documents like legal contracts, medical notes, or meeting transcripts. These summaries must follow a specific format, with required sections in the correct order, while also maintaining quality and completeness — each step becoming a node in the DAG.

For example, let’s say you’re evaluating a meeting transcript summary that needs to include three key sections in the correct ordering: Intro, Body, and Conclusion. Instead of manually writing and managing evaluation prompts for each, you can:

  1. Define a task node as the root node in the DAG to extract the summary headings
  2. Use a binary judgement node to determine whether the summary headings are correct.
  3. If they are correct, use a non binary judgement node to determine if they are in the correct order (or conversely, how out of order they are ).

Depending on the evaluation path, the DAG returns a different score. This handles everything that would otherwise require a complex, custom-built evaluation pipeline.

A DAG Decision Tree

With DeepEval’s DAG metric, all of this can now be done in just a few lines of code, eliminating the need for manual orchestration while ensuring structured and deterministic evaluation.


from deepeval.metrics.dag import (
    DeepAcyclicGraph,
    TaskNode,
    BinaryJudgementNode,
    NonBinaryJudgementNode,
    VerdictNode,
)

correct_order_node = NonBinaryJudgementNode(
    criteria="Are the summary headings in the correct order: 'intro' => 'body' => 'conclusion'?",
    children=[
        VerdictNode(verdict="Yes", score=10),
        VerdictNode(verdict="Two are out of order", score=4),
        VerdictNode(verdict="All out of order", score=2),
    ],
)

correct_headings_node = BinaryJudgementNode(
    criteria="Does the summary headings contain all three: 'intro', 'body', and 'conclusion'?",
    children=[VerdictNode(verdict=False, score=0), correct_order_node],
)

extract_headings_node = TaskNode(
    instructions="Extract all headings in `actual_output`",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    output_label="Summary headings",
    children=[correct_headings_node, correct_order_node],
)

# create the DAG
dag = DeepAcyclicGraph(root_nodes=[extract_headings_node])

# create the metric
metric = DAGMetric(name="Summarization", dag=dag)

# run the metric!
metric.measure(test_case)
print(metric.score, metric.reason)

There’s still so much more use cases to be explored, and I’m going to end this section here by leaving the rest to your imagination.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

Icing On The Cake, With A Cherry On Top

A Cake

Beyond solving the core problem, the DAG metric brought some unexpected advantages that made it even more powerful.

Equally Effective with Weaker LLMs

Many users prefer to use their own LLMs for evaluation, but with traditional metrics like GEval, weaker models struggle to provide reliable results. Instead of needing to:

  • Fine-tune a custom model, or
  • Fill the prompt with tons of examples,

Since DAG gives you the control to break down evaluation into granular steps, it makes evaluation easy for even smaller, less capable models to handle.

Able to Take Advantage of DeepEval’s Ecosystem

Since DAG is fully integrated into DeepEval, it benefits from:

  • Optimized parallel execution — Task nodes at the same level run in parallel.
  • Efficient cost management — No need to manually track API usage or optimize calls.
  • Built-in caching — Previously computed metric results are reused when possible.
  • Error handling — An error that occurs in your DAG is automatically bubbled up for debugging. You can also ignore errors to not halt your entire evaluation.

Debug DAGs Like Never Before

Understanding why a test case passes or fails is crucial. With DeepEval’s verbose_mode, the DAG metric:

  • Logs the full evaluation path — You can trace every decision taken.
  • Shows intermediate judgments — Easily debug why a test failed.
  • Provides insights at each step — See where adjustments are needed.

The DAG metric doesn’t just improve evaluation — it makes the entire process more transparent, efficient, and adaptable, regardless of the LLM you’re using.

Final Thoughts

We were originally worried that people won’t love DeepEval because it doesn’t give them enough control.

Existing evaluation metrics either lack control (like GEval, which struggles with weaker models) or require extensive prompt engineering and manual orchestration to get reliable results. But now with the latest DAG metric, this problem is solved this by breaking evaluation into structured, deterministic steps, allowing even small models to perform well while enabling granular customization.

Fully integrated into DeepEval, it runs much faster than any custom-built solution, automating parallel execution, cost tracking, caching, and verbose debugging — eliminating the complexity of building and maintaining your own evaluation framework.

Don’t forget to give ⭐ DeepEval a star on Github ⭐ if you found this article insightful, and as always, till next time.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.
* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrail accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.
Jeffrey Ip
Cofounder @ Confident, creator of DeepEval. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Thank you! You're now subscribed to Confident AI's weekly newsletter.
Oops! Something went wrong while submitting the form.