Confident AI Blog - Resources to help teams stay confident in AI
SlackJust In: New Slack Community! Connect with AI engineers building with Confident AI, join now →

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Three Ways AI Systems Fail Even When Evals Pass

Three Ways AI Systems Fail Even When Evals Pass

AI systems can pass evals while still behaving incorrectly. This post explores three common failure modes that slip through output-based evaluation.

Brian Neville-O'Neill

Brian Neville-O'Neill

Apr 7, 2026
.
12 min
Your AI Agent Passed Evals. That’s the Problem.

Your AI Agent Passed Evals. That’s the Problem.

Passing evals doesn’t mean your system works. It means your tests didn’t catch how it fails.

Brian Neville-O'Neill

Brian Neville-O'Neill

Apr 6, 2026
.
4 min read
Launch Week Day 5 (5/5): Generate Datasets from Your Data Sources

Launch Week Day 5 (5/5): Generate Datasets from Your Data Sources

Your best evaluation data already exists — it's sitting in Google Drive, SharePoint, Notion, and S3. Dataset generation on Confident AI turns your existing documents into evaluation-ready datasets automatically.

Jeffrey Ip

Jeffrey Ip

Apr 4, 2026
.
4 min read
Launch Week Day 4 (4/5): Auto-Categorize Traces & Threads

Launch Week Day 4 (4/5): Auto-Categorize Traces & Threads

You can't improve what you can't see. Auto-categorization tells you what your users are actually asking, detects response drift, and shows you which categories perform best — and which ones need help.

Jeffrey Ip

Jeffrey Ip

Apr 3, 2026
.
4 min read
Launch Week Day 3 (3/5): Auto-Ingest Traces into Datasets & Annotation Queues

Launch Week Day 3 (3/5): Auto-Ingest Traces into Datasets & Annotation Queues

Production traces are the best dataset you’ll ever get — but most teams never turn them into one. With auto-ingest, your traces flow straight into datasets and annotation queues, continuously.

Brian Romain

Brian Romain

Apr 2, 2026
.
4 min read
Launch Week Day 2 (2/5): Scheduled Evals

Launch Week Day 2 (2/5): Scheduled Evals

Everyone agrees evals should run regularly. But nobody remembers to actually run them. Scheduled Evals fixes that — set the frequency, configure your mappings, and never scramble before a release again.

Kritin Vongthongsri

Kritin Vongthongsri

Apr 1, 2026
.
3 min read
Announcing Launch Week Q1 '26! Day 1: Automated Error Analysis

Announcing Launch Week Q1 '26! Day 1: Automated Error Analysis

Error analysis used to mean pulling traces in code, hacking together an LLM to recommend metrics, and hoping for the best. Not anymore.

Jeffrey Ip

Jeffrey Ip

Mar 31, 2026
.
4 min read
Multi-Turn LLM Evaluation in 2026: What You Need to Know

Multi-Turn LLM Evaluation in 2026: What You Need to Know

In this article, I'll break down multi-turn LLM evaluation — how it differs from single-turn, what metrics actually matter, and how to implement it.

Jeffrey Ip

Jeffrey Ip

Mar 22, 2026
.
14 min read
The Step-By-Step Guide to MCP Evaluation

The Step-By-Step Guide to MCP Evaluation

This article will teach you everything you need to evaluate MCP-based LLM applications.

Cale

Cale

Oct 25, 2025
.
9 min read
AI Agent Evaluation: Metrics, Traces, Human Review, and Workflows

AI Agent Evaluation: Metrics, Traces, Human Review, and Workflows

A practical guide to evaluating AI agents with LLM metrics and tracing—plus when human review matters, how it calibrates judges, and workflows that combine CI, sampling, and production signals.

Jeffrey Ip

Jeffrey Ip

Oct 7, 2025
.
20 min read