In this story
Kritin Vongthongsri
Cofounder @ Confident AI | Empowering LLM practitioners with Evals | Previously AI/ML at Fintech Startup | ML + CS @ Princeton

LLM Agent Evaluation: Assessing Tool Use, Task Completion, Agentic Reasoning, and More

January 28, 2025
·
14 min read
Presenting...
The open-source LLM evaluation framework.
Star on GitHub
LLM Agent Evaluation: Assessing Tool Use, Task Completion, Agentic Reasoning, and More

LLM agents suck. I spent the past week building a web-crawling LLM agent using a popular python framework to scrape some information on potential leads off the internet. It was a complete letdown.

The agent was slow, inconsistent, and riddled with issues (sounds familiar?operator @ OpenAI). It kept making unnecessary function calls, and would occasionally get stuck in infinite reasoning loops that made no sense at all. Eventually, I scrapped it for simple web-scraping script that took 30 minutes to code.

Don’t get me wrong — I’m a huge advocate for LLM agents and fully believe in their potential. But let’s face it: building an effective agent is no easy task. There are countless things that can go wrong, and even minor bottlenecks can make or break your entire user experience.

That said, it’s not all doom and gloom. If you’re able to identify these bottlenecks and implement the right fixes, the possibilities for automation and are endless. The key is knowing how to evaluate and improve your LLM agent correctly and effectively.

Fortunately for you, over the past year, I’ve helped hundreds of companies test and refine their agents, poured over every LLM agent benchmark I could find (including new ones popping up constantly), and built an army of agents myself. And today, I’ll walk you through everything you need to know about LLM agent evaluation.

LLM Agent Evaluation vs. LLM Evaluation

To understand how LLM Agent evaluation differs from “traditional” LLM evaluation, it’s important to first establish what makes these LLM Agents unique.

Simple LLM Agent Architecture
  1. LLM Agents can invoke tools and call APIs.
  2. LLM Agents can be highly autonomous.
  3. LLM Agents are powered by reasoning frameworks.

Here’s how these agent-specific attributes shape the way we evaluate LLM Agents.

1. LLM Agents can invoke tools and call APIs

Perhaps the most notable characteristic of LLM Agents is their ability to call and invoke “tools,” such as APIs or functions that can interact with the real world — updating databases, buying stocks, reserving restaurants, and even web scraping.

For obvious reasons, this is fantastic if you’ve perfected your agent engineering. However, chances are, there’s a use case you don’t know about: maybe your agent prefers booking certain restaurants on specific days, or, perhaps occasionally calls 10 other totally unrelated tools before performing a simple web scraping task.

Such complexities and potential errors in tool calling — whether calling the right tools, using the correct tool input parameters, or generating the correct tool output — make Tool Calling Metrics essential to evaluating LLM agents.

2. LLM Agents are significantly more autonomous

LLM Agents operate with a much higher level of autonomy. While a traditional LLM application typically generates a single response to a single user input, an LLM Agent might take multiple reasoning steps, make several tool calls, and only then respond.

As you can imagine, this shift slightly complicates the evaluation process. While a single “test case” might previously have been as simple as an input-output pair, it now also includes the intermediate reasoning steps and the tools called.

Example agent workflow that includes tool-calling and reasoning steps

These intricate workflows (the collection of tool calling, reasoning steps, and agent responses) aren’t so easily captured by traditional RAG metrics like Answer Relevancy, which doesn’t consider tool calling and reasoning steps, and calls for newer metrics that are more tailed towards these Agentic thought processes.

That’s not to say you shouldn’t be using RAG metrics to evaluate your LLM Agent — quite the opposite, actually. You’ll absolutely want to use them, especially if your agent retrieves information from a knowledge-base (in fact, here’s an excellent guide on RAG evaluation). But you’ll also need additional metrics that are specifically tailored to evaluating agent workflows, which I’ll dive further into in the sections below.

3. LLM Agents are powered by reasoning frameworks.

Finally, LLM agents don’t just act — they reason. Before taking any action, such as calling a tool or crafting a response, agents deliberate on why that action is the appropriate next step. This reasoning process is influenced by various factors, including the LLM model and the prompt template (i.e. specifying it to perform CoT reasoning).

ReAct Reasoning Framework

Evaluating these intermediate reasoning steps is crucial, as it sheds light on why your agent might struggle to consistently select the correct tools or fall into infinite loops. An agent’s reasoning engine underpins all its decision-making, so ensuring its logic is reasonable, is incredibly important.

So far, we’ve examined how tool-calling, workflows, and reasoning distinguish LLM agents, necessitating their own set of evaluation tools and metrics.

However, it’s important to remember that an LLM agent is still fundamentally an LLM application. As such, it is subject to the same challenges and limitations as any “normal” LLM application. To build the best version of your LLM agent, you’ll need to evaluate it using general-purpose LLM metrics in addition to agent-specific ones. (If you’re new to LLM evaluation, this article on essential LLM evaluation metrics is a great starting point.)

In the next sections, I’ll take a deep dive into the 3 key aspects of agent evaluation and we briefly discussed earlier: Tool-Calling Evaluation, Agent Workflow Evaluation, and Reasoning Evaluation. By examining relevant metrics and sharing practical examples, I’ll demonstrate why these evaluations are crucial and absolutely necessary to your LLM Agent evaluation pipeline.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

Tool-Calling Evaluation

Tool-Calling Evaluation focuses on two critical aspects: Tool Correctness, which determines if the correct tools were called, and Tool-Calling Efficiency, which evaluates whether the tools were used in the most efficient way to achieve the desired results.

Tool Correctness

Tool Correctness assesses whether an agent’s tool-calling behavior aligns with expectations by verifying that all required tools were correctly called. Unlike most LLM evaluation metrics, the Tool Correctness metric is a deterministic measure and not an LLM-judge.

Tool Correctness Metric

At it’s most basic level, evaluating tool selection itself is sufficient. But more often that not, you’ll also want to assess the Input Parameters passed into these tools and the Output Accuracy of the results they generate:

  1. Tool Selection: Comparing the tools the agent calls to the ideal set of tools required for a given user input.
  2. Input Parameters: Evaluating the accuracy of the input parameters passed into the tools against ground truth references.
  3. Output Accuracy: Verifying the generated outputs of the tools against the expected ground truth.

It’s important to note that these parameters represent levels of strictness rather than distinct metrics, as evaluating input parameters and output accuracy depends on the correct tools being called. If the wrong tools are used, evaluating these parameters becomes irrelevant.

Furthermore, the Tool Correctness score doesn’t have to be binary or require exact matching:

  • Order Independence: The order of tool calls may not matter as long as all necessary tools are used. In such cases, evaluation can focus on comparing sets of tools rather than exact sequences.
  • Frequency Flexibility: The number of times each tool is called may be less significant than ensuring the correct tools are selected and used effectively.

These considerations all depend on your evaluation criteria, which is strongly tied to your LLM Agent’s use-case. For example, a medical AI agent responsible for diagnosing a patient might query the “patient symptom checker” tool after retrieving data from the “medical history database” tool, rather than in the reverse order. As long as both tools are used correctly and all relevant information is accounted for, the diagnosis could still be accurate.


from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCallParams, ToolCall

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    tools_called=[ToolCall(name="WebSearchTool"), ToolCall(name="QueryTool")],
    expected_tools=[ToolCall(name="WebSearchTool")]
)

metric = ToolCorrectnessMetric(
  evaluation_params=[ToolCallParams.TOOL, ToolCallParams.INPUT_PARAMETERS],
  should_consider_ordering=True
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

The same flexibility in scoring applies to Input Parameters and Output Accuracy. If a tool requires multiple input parameters, you might calculate the percentage of correct parameters rather than demand an exact match. Similarly, if the output is a numerical value, you could measure its percentage deviation from the expected result.

Ultimately, your definition of the Tool Correctness metric should align with your evaluation criteria and use case to ensure it effectively reflects the desired outcomes.

Tool Efficiency

Equally important to tool correctness is tool efficiency. Inefficient tool-calling patterns can increase response times, frustrate users, and significantly raise operational costs.

Think about it: imagine a chatbot helping you book a flight. If it first checks the weather, then converts currency, and only afterward searches for flights, it’s taking an unnecessarily convoluted route. Sure, it might get the job done eventually, but wouldn’t it be far better if it went straight to the flight API?

Let’s explore how tool efficiency can be evaluated, starting with deterministic methods:

  1. Redundant Tool Usage measures how many tools are invoked unnecessarily — those that do not directly contribute to achieving the intended outcome. This can be calculated as the percentage of unnecessary tools relative to the total number of tool invocations.
  2. Tool Frequency evaluates whether tools are being called more often than necessary. This method penalizes tools that exceed a predefined threshold for the number of calls required to complete a task (many times this is just 1).

While these deterministic metrics provide a solid foundation, evaluating tool efficiency for more complex LLM agents can be challenging. Tool-calling behavior in such agents can quickly become branched, nested, and convoluted (trust me I’ve tried).

A more flexible approach is using an LLM as a judge. For example, one way to calculate tool efficiency extracts the user’s goal (the task the agent needs to accomplish) and evaluates the tool-calling trajectory based on the tools called (e.g., name, description, input parameters, output) and a provided list of available tools to determine if the trajectory was the most efficient method (DeepEval’s method).


from deepeval.metrics import ToolEffiencyMetric
from deepeval.test_case import LLMTestCase, ToolCall

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    tools_called=[ToolCall(name="WebSearchTool")],
)

metric = ToolEffiencyMetric(
  available_tools=[ToolCall(name="WebSearchTool"), ToolCall(name="QueryTool")]
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

This metric not only simplifies efficiency calculation but also avoids the need for rigid specifications, such as a fixed number of tool calls. Instead, it evaluates efficiency based on the tools available and their relevance to the task at hand.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

Agent icWorkflow Evaluation

While tool-calling metrics are essential for assessing LLM agents, they focus only of on usage. However, effective evaluation requires a broader perspective — one that examines the agent’s entire workflow.

This includes assessing the full process: from the initial user input, through the reasoning steps and tool interactions, to the final response provided to the user.

Task Completion

A critical metric for assessing agent workflows is Task Completion (also known as task success or goal accuracy). This metric measures how effectively an LLM agent completes a user-given task. The definition of “task completion” can vary significantly depending on the task’s context.

Consider AgentBench, which was the first benchmarking tool designed to evaluate the ability of LLMs to act as agents. It tests LLMs across eight distinct environments, each with unique task completion criteria, including:

Agent Bench Tasks
  • Digital Card Game: here, the task completion criteria is clear and objective — the agent’s goal is to win the game. The corresponding metric is the win rate, or the number of times the agent wins.
  • Web Shopping: here, task completion is less straightforward. AgentBench uses a custom metric to evaluate the product purchased by the agent against the ideal product. This metric considers multiple factors, such as price similarity and attribute similarity, which is determined through text matching.

Custom metrics like these are highly effective when the scope of tasks is limited and accompanied by a large dataset with ground-truth labels. However, in real-world applications, agents are often required to perform a diverse set of tasks—many of which may lack predefined ground-truth datasets.

For example, an LLM agent equipped with tools like a web browser can perform virtually unlimited web-based tasks. In such cases, collecting and evaluating interactions in production becomes impractical, as ground-truth references cannot be defined for every possible task. This complexity necessitates a more adaptable and scalable evaluation framework.

DeepEval’s Task Completion metric addresses these challenges by leveraging LLMs to:

  1. Determine the task from the user’s input.
  2. Analyze the reasoning steps, tool usage, and final response to assess whether the task was successfully completed.

from deepeval import evaluate
from deepeval.metrics import TaskCompletionMetric
from deepeval.test_case import LLMTestCase

metric = TaskCompletionMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True
)
test_case = LLMTestCase(
    input="Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.",
    actual_output=(
        "Day 1: Eiffel Tower, dinner at Le Jules Verne. "
        "Day 2: Louvre Museum, lunch at Angelina Paris. "
        "Day 3: Montmartre, evening at a wine bar."
    ),
    tools_called=[
        ToolCall(
            name="Itinerary Generator",
            description="Creates travel plans based on destination and duration.",
            input_parameters={"destination": "Paris", "days": 3},
            output=[
                "Day 1: Eiffel Tower, Le Jules Verne.",
                "Day 2: Louvre Museum, Angelina Paris.",
                "Day 3: Montmartre, wine bar.",
            ],
        ),
        ToolCall(
            name="Restaurant Finder",
            description="Finds top restaurants in a city.",
            input_parameters={"city": "Paris"},
            output=["Le Jules Verne", "Angelina Paris", "local wine bars"],
        ),
    ],
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

With this approach, you no longer need to rely on predefined ground-truth datasets or rigid custom criteria. Instead, DeepEval gives you the flexibility to evaluate tasks of all kind.

G-Eval for Custom Agent

Sometimes, you’ll want to evaluate something specific about your LLM agent. G-Eval is a framework that leverages LLMs with chain-of-thought (CoT) reasoning to evaluate outputs based on ANY custom criteria.

This means you can define custom metrics in natural language to assess your agent’s workflow.

Consider a Restaurant Booking Assistant. A common issue might arise where the agent tells the user, “The restaurant is fully booked,” but leaves out important context, such as whether it checked alternative dates or nearby restaurants. For users, this can feel incomplete or unhelpful. To ensure the output reflects the full scope of the agent’s efforts and improves user experience, you could define custom evaluation criteria with G-Eval, such as:


from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

transparency_metric = GEval(
    name="Transparency",
    criteria="Determine whether the tool invocation information is captured in the actual output.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.TOOLS_CALLED],
)

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

Agentic Reasoning Evaluation

We’ve all seen benchmarks like MMLU and reasoning tasks such as BoolQ being used to test an LLM’s ability to handle mathematical, commonsense, and causal reasoning. While these benchmarks are useful, they often assume that a model’s reasoning skills are entirely dependent on its inherent capabilities. But in practice, that’s rarely the whole story.

In real-world scenarios, your LLM agent’s reasoning is shaped by much more than just the model itself. Things like the prompt template (e.g., chain-of-thought reasoning), tool usage, and the agent’s architecture all play critical roles. Testing the model in isolation might give you a starting point, but it won’t tell you how well your agent performs in real-world workflows where these factors come into play.

On top of that, you need to think about your agent’s specific domain. Every task and workflow is different, and tailoring evaluations to your unique use case is the best way to ensure your agent’s reasoning is both accurate and useful.

Here are a few metrics you can use to evaluate agent-specific reasoning:

  • Reasoning Relevancy: is the reasoning behind each tool call clearly tied to what the user is asking for? For example, if the agent queries a restaurant database, it should make sense why it’s doing that — it’s checking availability because the user requested it.
  • Reasoning Coherence: Does the reasoning follow a logical, step-by-step process? Each step should add value and make sense in the context of the task.

Conclusion

Can’t believe you made it all the way here! Congratulations on becoming an expert in evaluating LLM agents. To recap, LLM agents stand out from regular LLM applications due to their ability to call tools and perform reasoning.

This means we need to evaluate tool-calling, reasoning steps, and entire agentic workflows that combine these capabilities. Fortunately, DeepEval provides these metrics out of the box and ready to use.

Don’t forget that you’ll also need to focus on other aspects beyond just agentic metrics to get a comprehensive evaluation. Thanks for reading and don’t forget to give ⭐ DeepEval a star on Github ⭐ if you found this article useful.

* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepEval.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrail accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.
Kritin Vongthongsri
Cofounder @ Confident AI | Empowering LLM practitioners with Evals | Previously AI/ML at Fintech Startup | ML + CS @ Princeton

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Thank you! You're now subscribed to Confident AI's weekly newsletter.
Oops! Something went wrong while submitting the form.