How I Built Deterministic LLM Evaluation Metrics for DeepEval

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Finding A Repeatable Solution For All

Whether our users realized it or not, what they were building was essentially a decision tree powered by LLMs. Each step in their evaluation process was a node, where an LLM would make a decision or extract key attributes, and the result would determine what happened next. If a response passed one check, it moved on to the next. If it failed, the scoring logic adjusted accordingly.

This explained why their codebases looked the way they did — long chains of LLM calls, nested conditionals, and rule-based logic, all designed to enforce a structured evaluation process. And while it was clear that users needed this level of control, it was equally clear that writing and maintaining this logic manually was painful.

Interestingly, some users were already using GEval within their DAGs, not for full evaluation but as a lightweight way to pass or fail certain checks before moving forward. Sometimes, the goal wasn’t to assign a granular score but simply to verify whether a response met a minimum requirement before proceeding to a more detailed evaluation. This reinforced what we were seeing: users needed flexibility in how they structured their evaluations, and a one-size-fits-all scoring system wasn’t going to cut it.

We concluded that we need a way for users to easily build DAGs within DeepEval, and came to the conclusion that 4types of nodes were required:

Nodes that processed test cases into better formats for evaluation (LLM inputs, outputs, expected outputs, etc.)
Nodes that made binary judgements (“yes” or “no”, like QAG) based on the context from its parents nodes.
Nodes that made non-binary judgements (a list of possible string outputs) based on the context from its parents nodes.
Leaf nodes that returned either a hard-coded score, or GEval computed score based on a parent node’s verdict (e.g. “yes” or “no” if its parent node is a node that makes binary judgements).

And so, we shipped a DAG metric with these four types of nodes, and (obnoxiously) named it the Deep Acyclic Graph (DAG) metric. We even got a comment from a redditor based on the naming of the DAG metric:

“Confusion isn’t fun for the confused. It (DAG) also runs the risk of giving readers/users the impression that the authors of the tool don’t know what a DAG really is.”

Don’t you worry, I promise we know what a direct acyclic graph is (but we might eventually rename it to double dee-aee-gee, who knows?).

Deterministic, LLM-Powered, Decision Trees

As of February 6th, 2025, DeepEval’s DAG metric is the most powerful metric we’ve built so far. It’s not only customizable but also fully deterministic, structured around LLM-powered decision trees to bring clarity and control to evaluation.

What makes DeepEval’s default metrics work so well is that they break an LLM test case (which contains parameters such as the input, actual output, expected output, etc.) into atomic units and an LLM-judge’s verdicts into constraint responses through QAG. This reduces the chance of hallucination and allows for finer control at each step by providing custom examples for in-context learning. We took this principle and applied it directly to the DAG metric, structuring evaluations into four core node types:

Task nodes — Break down an LLM test case into atomic units.
Binary Judgment nodes — True or False, and decides the next node based on either the raw test case or outputs from parent nodes.
Non-Binary Judgment nodes — A list of strings, and decides the next node based on either the raw test case or outputs from parent nodes.
Verdict nodes — Returns a final hardcoded OR GEval score based on the full evaluation path.

(All leaf nodes are verdict nodes, and verdict nodes cannot be a root node)

The most impactful use case we’ve seen so far is summarization, especially for structured documents like legal contracts, medical notes, or meeting transcripts. These summaries must follow a specific format, with required sections in the correct order, while also maintaining quality and completeness — each step becoming a node in the DAG.

For example, let’s say you’re evaluating a meeting transcript summary that needs to include three key sections in the correct ordering: Intro, Body, and Conclusion. Instead of manually writing and managing evaluation prompts for each, you can:

Define a task node as the root node in the DAG to extract the summary headings
Use a binary judgement node to determine whether the summary headings are correct.
If they are correct, use a non binary judgement node to determine if they are in the correct order (or conversely, how out of order they are ).

Depending on the evaluation path, the DAG returns a different score. This handles everything that would otherwise require a complex, custom-built evaluation pipeline.

With DeepEval’s DAG metric, all of this can now be done in just a few lines of code, eliminating the need for manual orchestration while ensuring structured and deterministic evaluation.


from deepeval.metrics.dag import (
    DeepAcyclicGraph,
    TaskNode,
    BinaryJudgementNode,
    NonBinaryJudgementNode,
    VerdictNode,
)

correct_order_node = NonBinaryJudgementNode(
    criteria="Are the summary headings in the correct order: 'intro' => 'body' => 'conclusion'?",
    children=[
        VerdictNode(verdict="Yes", score=10),
        VerdictNode(verdict="Two are out of order", score=4),
        VerdictNode(verdict="All out of order", score=2),
    ],
)

correct_headings_node = BinaryJudgementNode(
    criteria="Does the summary headings contain all three: 'intro', 'body', and 'conclusion'?",
    children=[VerdictNode(verdict=False, score=0), correct_order_node],
)

extract_headings_node = TaskNode(
    instructions="Extract all headings in `actual_output`",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    output_label="Summary headings",
    children=[correct_headings_node, correct_order_node],
)

# create the DAG
dag = DeepAcyclicGraph(root_nodes=[extract_headings_node])

# create the metric
metric = DAGMetric(name="Summarization", dag=dag)

# run the metric!
metric.measure(test_case)
print(metric.score, metric.reason)

There’s still so much more use cases to be explored, and I’m going to end this section here by leaving the rest to your imagination.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Icing On The Cake, With A Cherry On Top

Beyond solving the core problem, the DAG metric brought some unexpected advantages that made it even more powerful.

Equally Effective with Weaker LLMs

Many users prefer to use their own LLMs for evaluation, but with traditional metrics like GEval, weaker models struggle to provide reliable results. Instead of needing to:

Fine-tune a custom model, or
Fill the prompt with tons of examples,

Since DAG gives you the control to break down evaluation into granular steps, it makes evaluation easy for even smaller, less capable models to handle.

Able to Take Advantage of DeepEval’s Ecosystem

Since DAG is fully integrated into DeepEval, it benefits from:

Optimized parallel execution — Task nodes at the same level run in parallel.
Efficient cost management — No need to manually track API usage or optimize calls.
Built-in caching — Previously computed metric results are reused when possible.
Error handling — An error that occurs in your DAG is automatically bubbled up for debugging. You can also ignore errors to not halt your entire evaluation.

Debug DAGs Like Never Before

Understanding why a test case passes or fails is crucial. With DeepEval’s verbose_mode, the DAG metric:

Logs the full evaluation path — You can trace every decision taken.
Shows intermediate judgments — Easily debug why a test failed.
Provides insights at each step — See where adjustments are needed.

The DAG metric doesn’t just improve evaluation — it makes the entire process more transparent, efficient, and adaptable, regardless of the LLM you’re using.

Final Thoughts

We were originally worried that people won’t love DeepEval because it doesn’t give them enough control.

Existing evaluation metrics either lack control (like GEval, which struggles with weaker models) or require extensive prompt engineering and manual orchestration to get reliable results. But now with the latest DAG metric, this problem is solved this by breaking evaluation into structured, deterministic steps, allowing even small models to perform well while enabling granular customization.

Fully integrated into DeepEval, it runs much faster than any custom-built solution, automating parallel execution, cost tracking, caching, and verbose debugging — eliminating the complexity of building and maintaining your own evaluation framework.

Don’t forget to give ⭐ DeepEval a star on Github ⭐ if you found this article insightful, and as always, till next time.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrail accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.