LLM Agent Evaluation: Assessing Tool Use, Task Completion, Agentic Reasoning, and More

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Tool-Calling Evaluation

Tool-Calling Evaluation focuses on two critical aspects: Tool Correctness, which determines if the correct tools were called, and Tool-Calling Efficiency, which evaluates whether the tools were used in the most efficient way to achieve the desired results.

Tool Correctness

Tool Correctness assesses whether an agent’s tool-calling behavior aligns with expectations by verifying that all required tools were correctly called. Unlike most LLM evaluation metrics, the Tool Correctness metric is a deterministic measure and not an LLM-judge.

At it’s most basic level, evaluating tool selection itself is sufficient. But more often that not, you’ll also want to assess the Input Parameters passed into these tools and the Output Accuracy of the results they generate:

Tool Selection: Comparing the tools the agent calls to the ideal set of tools required for a given user input.
Input Parameters: Evaluating the accuracy of the input parameters passed into the tools against ground truth references.
Output Accuracy: Verifying the generated outputs of the tools against the expected ground truth.

It’s important to note that these parameters represent levels of strictness rather than distinct metrics, as evaluating input parameters and output accuracy depends on the correct tools being called. If the wrong tools are used, evaluating these parameters becomes irrelevant.

Furthermore, the Tool Correctness score doesn’t have to be binary or require exact matching:

Order Independence: The order of tool calls may not matter as long as all necessary tools are used. In such cases, evaluation can focus on comparing sets of tools rather than exact sequences.
Frequency Flexibility: The number of times each tool is called may be less significant than ensuring the correct tools are selected and used effectively.

These considerations all depend on your evaluation criteria, which is strongly tied to your LLM Agent’s use-case. For example, a medical AI agent responsible for diagnosing a patient might query the “patient symptom checker” tool after retrieving data from the “medical history database” tool, rather than in the reverse order. As long as both tools are used correctly and all relevant information is accounted for, the diagnosis could still be accurate.


from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCallParams, ToolCall

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    tools_called=[ToolCall(name="WebSearchTool"), ToolCall(name="QueryTool")],
    expected_tools=[ToolCall(name="WebSearchTool")]
)

metric = ToolCorrectnessMetric(
  evaluation_params=[ToolCallParams.TOOL, ToolCallParams.INPUT_PARAMETERS],
  should_consider_ordering=True
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

The same flexibility in scoring applies to Input Parameters and Output Accuracy. If a tool requires multiple input parameters, you might calculate the percentage of correct parameters rather than demand an exact match. Similarly, if the output is a numerical value, you could measure its percentage deviation from the expected result.

Ultimately, your definition of the Tool Correctness metric should align with your evaluation criteria and use case to ensure it effectively reflects the desired outcomes.

Tool Efficiency

Equally important to tool correctness is tool efficiency. Inefficient tool-calling patterns can increase response times, frustrate users, and significantly raise operational costs.

Think about it: imagine a chatbot helping you book a flight. If it first checks the weather, then converts currency, and only afterward searches for flights, it’s taking an unnecessarily convoluted route. Sure, it might get the job done eventually, but wouldn’t it be far better if it went straight to the flight API?

Let’s explore how tool efficiency can be evaluated, starting with deterministic methods:

Redundant Tool Usage measures how many tools are invoked unnecessarily — those that do not directly contribute to achieving the intended outcome. This can be calculated as the percentage of unnecessary tools relative to the total number of tool invocations.
Tool Frequency evaluates whether tools are being called more often than necessary. This method penalizes tools that exceed a predefined threshold for the number of calls required to complete a task (many times this is just 1).

While these deterministic metrics provide a solid foundation, evaluating tool efficiency for more complex LLM agents can be challenging. Tool-calling behavior in such agents can quickly become branched, nested, and convoluted (trust me I’ve tried).

A more flexible approach is using an LLM as a judge. For example, one way to calculate tool efficiency extracts the user’s goal (the task the agent needs to accomplish) and evaluates the tool-calling trajectory based on the tools called (e.g., name, description, input parameters, output) and a provided list of available tools to determine if the trajectory was the most efficient method (DeepEval’s method).


from deepeval.metrics import ToolEffiencyMetric
from deepeval.test_case import LLMTestCase, ToolCall

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    tools_called=[ToolCall(name="WebSearchTool")],
)

metric = ToolEffiencyMetric(
  available_tools=[ToolCall(name="WebSearchTool"), ToolCall(name="QueryTool")]
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

This metric not only simplifies efficiency calculation but also avoids the need for rigid specifications, such as a fixed number of tool calls. Instead, it evaluates efficiency based on the tools available and their relevance to the task at hand.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Agent icWorkflow Evaluation

While tool-calling metrics are essential for assessing LLM agents, they focus only of on usage. However, effective evaluation requires a broader perspective — one that examines the agent’s entire workflow.

This includes assessing the full process: from the initial user input, through the reasoning steps and tool interactions, to the final response provided to the user.

Task Completion

A critical metric for assessing agent workflows is Task Completion (also known as task success or goal accuracy). This metric measures how effectively an LLM agent completes a user-given task. The definition of “task completion” can vary significantly depending on the task’s context.

Consider AgentBench, which was the first benchmarking tool designed to evaluate the ability of LLMs to act as agents. It tests LLMs across eight distinct environments, each with unique task completion criteria, including:

Digital Card Game: here, the task completion criteria is clear and objective — the agent’s goal is to win the game. The corresponding metric is the win rate, or the number of times the agent wins.
Web Shopping: here, task completion is less straightforward. AgentBench uses a custom metric to evaluate the product purchased by the agent against the ideal product. This metric considers multiple factors, such as price similarity and attribute similarity, which is determined through text matching.

Custom metrics like these are highly effective when the scope of tasks is limited and accompanied by a large dataset with ground-truth labels. However, in real-world applications, agents are often required to perform a diverse set of tasks—many of which may lack predefined ground-truth datasets.

For example, an LLM agent equipped with tools like a web browser can perform virtually unlimited web-based tasks. In such cases, collecting and evaluating interactions in production becomes impractical, as ground-truth references cannot be defined for every possible task. This complexity necessitates a more adaptable and scalable evaluation framework.

DeepEval’s Task Completion metric addresses these challenges by leveraging LLMs to:

Determine the task from the user’s input.
Analyze the reasoning steps, tool usage, and final response to assess whether the task was successfully completed.


from deepeval import evaluate
from deepeval.metrics import TaskCompletionMetric
from deepeval.test_case import LLMTestCase

metric = TaskCompletionMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True
)
test_case = LLMTestCase(
    input="Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.",
    actual_output=(
        "Day 1: Eiffel Tower, dinner at Le Jules Verne. "
        "Day 2: Louvre Museum, lunch at Angelina Paris. "
        "Day 3: Montmartre, evening at a wine bar."
    ),
    tools_called=[
        ToolCall(
            name="Itinerary Generator",
            description="Creates travel plans based on destination and duration.",
            input_parameters={"destination": "Paris", "days": 3},
            output=[
                "Day 1: Eiffel Tower, Le Jules Verne.",
                "Day 2: Louvre Museum, Angelina Paris.",
                "Day 3: Montmartre, wine bar.",
            ],
        ),
        ToolCall(
            name="Restaurant Finder",
            description="Finds top restaurants in a city.",
            input_parameters={"city": "Paris"},
            output=["Le Jules Verne", "Angelina Paris", "local wine bars"],
        ),
    ],
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

With this approach, you no longer need to rely on predefined ground-truth datasets or rigid custom criteria. Instead, DeepEval gives you the flexibility to evaluate tasks of all kind.

G-Eval for Custom Agent

Sometimes, you’ll want to evaluate something specific about your LLM agent. G-Eval is a framework that leverages LLMs with chain-of-thought (CoT) reasoning to evaluate outputs based on ANY custom criteria.

This means you can define custom metrics in natural language to assess your agent’s workflow.

Consider a Restaurant Booking Assistant. A common issue might arise where the agent tells the user, “The restaurant is fully booked,” but leaves out important context, such as whether it checked alternative dates or nearby restaurants. For users, this can feel incomplete or unhelpful. To ensure the output reflects the full scope of the agent’s efforts and improves user experience, you could define custom evaluation criteria with G-Eval, such as:


from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

transparency_metric = GEval(
    name="Transparency",
    criteria="Determine whether the tool invocation information is captured in the actual output.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.TOOLS_CALLED],
)

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Agentic Reasoning Evaluation

We’ve all seen benchmarks like MMLU and reasoning tasks such as BoolQ being used to test an LLM’s ability to handle mathematical, commonsense, and causal reasoning. While these benchmarks are useful, they often assume that a model’s reasoning skills are entirely dependent on its inherent capabilities. But in practice, that’s rarely the whole story.

In real-world scenarios, your LLM agent’s reasoning is shaped by much more than just the model itself. Things like the prompt template (e.g., chain-of-thought reasoning), tool usage, and the agent’s architecture all play critical roles. Testing the model in isolation might give you a starting point, but it won’t tell you how well your agent performs in real-world workflows where these factors come into play.

On top of that, you need to think about your agent’s specific domain. Every task and workflow is different, and tailoring evaluations to your unique use case is the best way to ensure your agent’s reasoning is both accurate and useful.

Here are a few metrics you can use to evaluate agent-specific reasoning:

Reasoning Relevancy: is the reasoning behind each tool call clearly tied to what the user is asking for? For example, if the agent queries a restaurant database, it should make sense why it’s doing that — it’s checking availability because the user requested it.
Reasoning Coherence: Does the reasoning follow a logical, step-by-step process? Each step should add value and make sense in the context of the task.

Conclusion

Can’t believe you made it all the way here! Congratulations on becoming an expert in evaluating LLM agents. To recap, LLM agents stand out from regular LLM applications due to their ability to call tools and perform reasoning.

This means we need to evaluate tool-calling, reasoning steps, and entire agentic workflows that combine these capabilities. Fortunately, DeepEval provides these metrics out of the box and ready to use.

Don’t forget that you’ll also need to focus on other aspects beyond just agentic metrics to get a comprehensive evaluation. Thanks for reading and don’t forget to give ⭐ DeepEval a star on Github ⭐ if you found this article useful.

* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrail accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.