In this story
Jeffrey Ip
Cofounder @ Confident, creator of DeepEval. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

Leveraging LLM-as-a-Judge for Automated and Scalable Evaluation

February 23, 2025
·
9 min read
Presenting...
The open-source LLM evaluation framework.
Star on GitHub
Presenting...
The open-source LLM red teaming framework.
Star on GitHub
Leveraging LLM-as-a-Judge for Automated and Scalable Evaluation

Recently, I’m hearing the term “LLM as a Judge” more frequently than ever. Although it might be because I work in the LLM evaluation field for a living, LLM judges are taking over because it is becoming clear that it is a much better alternative for LLM evaluation when compared to human evaluators which are slow, costly and labor-intensive.

But, LLM judges do have their limitations, and using it without caution will cause you nothing but frustration. In this article, I’ll let you in on everything I know (so far) about using LLM judges for LLM (system) evaluation, including:

Can’t wait? Neither can I.

(Update: You can also now use LLM-as-a-judge for deterministic LLM metrics in DeepEval!)

What exactly is “LLM as a Judge”?

LLM-as-a-Judge is a powerful solution that uses LLMs to evaluate LLM responses based on any specific criteria of your choice, which means using LLMs to carry out LLM (system) evaluation. As introduced in the "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" paper as an alternative to human evaluation, which is often expensive and time-consuming, the three types of LLM as judges include:

  • Single Output Scoring (without reference): A judge LLM is provided with a scoring rubric as the criteria and prompted to assign a score to LLM responses based on various factors such as input to LLM system, retrieval context in RAG pipelines, etc.
  • Single Output Scoring (with reference): Same as above, but sometimes LLM judges can get flaky. Having a reference, ideal, expected output helps the judge LLM to return consistent scores.
  • Pairwise Comparison: Given two LLM generated outputs, the judge LLM will pick which one is the better generation with respect to the input. This also requires a custom criteria to determine what is “better”.

The concept is straightforward: provide an LLM with an evaluation criterion, and let it do the grading for you. But how and where exactly would you use LLMs to judge LLM responses?

Using LLM Judges as Scorers in Metrics

“LLM as a Judge” can be used to augment LLM evaluation by using it as a scorer for LLM evaluation metrics (if you don’t know what an LLM evaluation metric is, I highly recommend reading this article here). To get started, simply provide the LLM of your choice with a clear and concise evaluation criterion or a rubric, and use it to compute a metric score (ranging from 0 to 1) based on various parameters, such as the input and generated output of your LLM. Here is an example of an evaluation prompt to an LLM judge to evaluate summary coherence:


prompt = """
You will be given one summary (LLM output) written for a news article. 
Your task is to rate the summary on how coherent it is to the original text (input). 

Original Text:
{input}

Summary:
{llm_output}

Score:
"""

By collecting these metric scores, you can create a comprehensive suite of LLM evaluation results, which can be used to benchmark, evaluate, and even regression test LLM (systems).

The growing trend of using LLMs as a scorer for LLM evaluation metrics to evaluate other LLMs is catching on because the alternatives just don’t cut it. LLM evaluation is vital to quantifying and identifying areas to improve LLM system performance, but human evaluation is slow, and traditional evaluation methods like BERT and ROUGE miss the mark by overlooking the deeper semantics in LLM generated text. Think about it, how could we expect traditional, much smaller NLP models to effectively judge not just paragraphs of open-ended generated text, but also content in formats like Markdown or JSON?

Does It Really Work?

In short, yes, and research (see paper above) on LLM-as-a-judge shows it aligns with human judgments even more than humans agree with each other. And no, you don't need your evaluation models to be better than the one you're using for your app.

At first, it may seem counterintuitive to use an LLM to evaluate text generated by another LLM. If the model is producing the output, why would it be better at judging it—or spotting errors?

The key is separation of tasks. Instead of asking the LLM to redo its work, we use a different prompt—or even a different model—specifically for evaluation. This activates distinct capabilities, often reducing the task to a simple classification problem: assessing quality, coherence, or correctness. Detecting issues is often easier than avoiding them in the first place as evaluating is simpler than generating—an LLM judge only assesses what's produced, like checking relevance without needing to improve the answer.

Apart from the evaluation prompt being fundamentally different, there are also many techniques to improve LLM-as-a-judge accuracy, like CoT prompting and few-shot learning, which we'll talk more about later. We've also found success in confining LLM-judge outputs to extremes, making metric scores highly deterministic. In DeepEval, we've enabled users to construct decision trees—modeled as DAGs where nodes are LLM judges and edges represent decisions—to create highly deterministic evaluation metrics that precisely fit their criteria. You'll be able to read about this more in the "DAG" section.

(Side story: I actually learnt that LLM-as-a-judge works far better the hard way when building DeepEval. Initially, I relied on traditional non-LLM metrics like ROUGE and BLEU, which compare text based on word overlap. But I quickly saw users complaining that these scores lack accuracy for even simple sentences. And let's not talk about the lack of explanability.)

Alternatives to LLM Judges

This section really shouldn't exist but here are two popular alternatives to using LLMs for LLM evaluation and common reasons why they, in my opinion, are mistakenly preferred:

  • Human Evaluation: Often seen as the gold standard due to its ability to understand context and nuance. However, it’s time-consuming, expensive, and can be inconsistent due to subjective interpretations. It’s not unusual for a real-world LLM application to generate approximately 100,000 responses a month. I don’t know about you, but it takes me about 45 seconds on average to read through a few paragraphs and make a judgment about it. That adds up to around 4.5 million seconds, or about 52 consecutive days each month — without taking lunch breaks — to evaluate every single generated LLM responses.
  • Traditional NLP Evaluation Methods: Traditional scorers such as BERT and ROUGE are great — they are fast, cheap, and reliable. However, as I pointed out in my one of my previous article comparing all types of LLM evaluation metric scorers, these methods have two fatal flaws: they must require a reference text to compare the generated LLM outputs against, and are incredibly inaccurate as they overlook semantics in LLM-generated outputs, which are often open to subjective interpretation and comes in various complicated formats (e.g., JSON). Given that LLM outputs in production are open-ended without reference text, traditional evaluation methods hardly makes the cut.

(Also, both human and traditional NLP evaluation methods also lack explainability, which is the ability to explain the evaluation score it has given.)

And so, LLM as a judge is currently the best available option. They are scalable, can be fine-tuned or prompt-engineered to mitigate bias, relatively fast and cheap (though this depends on which method of evaluation you’re comparing against), and most importantly, can understand even extremely complicated pieces of generated text, regardless of the content itself and the format it is in. With that in mind, let’s go through the effectiveness of LLM judges and their pros and cons in LLM evaluation.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

LLMs, More Judgemental Than You Think

So the question is, how accurate are LLM judges? After all, LLMs are probabilistic models, and are still susceptible to hallucination, right?

Research has shown that when used correctly, state-of-the-art LLMs such as GPT-4 (yes, still GPT-4) have the ability to align with human judgement to up to 85%, for both pairwise and single-output scoring. For those who are still skeptical, this number is actually even higher than the agreement among humans (81%).

The fact that GPT-4 matches both pairwise and single output scoring implies GPT-4 has a relatively stable internal rubric, and this stability can further be improved through chain-of-thought (CoT) prompting.

G-Eval

As introduced in one of my previous articles, G-Eval is a framework that uses CoT prompting to stabilize  and make LLM judges more reliable and accurate in terms of metric score computation (scroll down to learn more about CoT).

G-Eval algorithm

G-Eval first generates a series of evaluation steps using from the original evaluation criteria and uses the generated steps to determine the final score via a form-filling paradigm (this is just a fancy way of saying G-Eval requires several pieces of information to work). For example, evaluating LLM output coherence using G-Eval involves constructing a prompt that contains the criteria and text to be evaluated to generate evaluation steps, before using an LLM to output a score from 1 to 5 based on these steps (for a more detailed explanation, read this article instead).

As you’ll learn later, the technique presented in G-Eval actually aligns with various techniques we can use to improve LLM judgements. You can use G-Eval immediately in a few lines of code through DeepEval, the open-source LLM evaluation framework.


pip install deepeval

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

test_case = LLMTestCase(input="input to your LLM", actual_output="your LLM output")
coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence - the collective quality of all sentences in the actual output",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

coherence_metric.measure(test_case)
print(coherence_metric.score, coherence_metric.reason)

DAG (direct acyclic graph)

There's a problem with G-Eval though, because it is not deterministic. This means that for a given benchmark that uses LLM-as-a-judge metrics, you can't trust it fully. That's not to say G-Eval isn't useful; in fact, it exceeds at tasks where subjective judgement is required, such as coherence, similarity, answer relevancy, etc. But when you have a clear criteria, such as the format correctness of a text summarization use case, you need deterministically.

You can achieve this with LLMs by structuring evaluations as a Directed Acyclic Graph (DAG). In this approach, each node represents an LLM judge handling a specific decision, while edges define the logical flow between decisions. By breaking down an LLM interaction into finer, atomic units, you reduce ambiguity and enforce alignment with your expectations. The more granular the breakdown, the eliminates the risk of misalignment. (You can read more about how I built the DAG metric for DeepEval here.)

DAG Architecture

For the DAG diagram shown above, which evaluates a meeting summarization use case, here is the corresponding code in DeepEval (you can find the documentation for DAG here):


from deepeval.test_case import LLMTestCase
from deepeval.metrics.dag import (
    DeepAcyclicGraph,
    TaskNode,
    BinaryJudgementNode,
    NonBinaryJudgementNode,
    VerdictNode,
)
from deepeval.metrics import DAGMetric

correct_order_node = NonBinaryJudgementNode(
    criteria="Are the summary headings in the correct order: 'intro' => 'body' => 'conclusion'?",
    children=[
        VerdictNode(verdict="Yes", score=10),
        VerdictNode(verdict="Two are out of order", score=4),
        VerdictNode(verdict="All out of order", score=2),
    ],
)

correct_headings_node = BinaryJudgementNode(
    criteria="Does the summary headings contain all three: 'intro', 'body', and 'conclusion'?",
    children=[
        VerdictNode(verdict=False, score=0),
        VerdictNode(verdict=True, child=correct_order_node),
    ],
)

extract_headings_node = TaskNode(
    instructions="Extract all headings in `actual_output`",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    output_label="Summary headings",
    children=[correct_headings_node, correct_order_node],
)

# create the DAG
dag = DeepAcyclicGraph(root_nodes=[extract_headings_node])

# create the metric
format_correctness = DAGMetric(name="Format Correctness", dag=dag)

# create a test case
test_case = LLMTestCase(input="your-original-text", actual_output="your-summary")

# evaluate
format_correctness.measure(test_case)
print(format_correctness.score, format_correctness.reason)

However, I wouldn't recommend starting off with DAG, simply because it is harder to use and G-Eval takes no time to setup at all. You should first try G-Eval, before slowly migrating to a finer technique like DAG. In fact, if you would like to use DAG to filter out a certain requirement such as format correctness before running G-Eval, you can also do that. Full example near the end of this article, where we use G-Eval as a leaf node instead of returning a hard-coded score.

LLMs are Not Perfect Though

As you might expect, LLM judges are not all rainbows and sunshines. They also suffer from several drawbacks, which includes:

  • Can't Make Up Their Minds: Their scores are non-deterministic, which means that for a given LLM output their are evaluating, the scores might be different depending on the time of day. You'll need a good way such as DAG to make them deterministic if you want to rely fully on them.
  • Narcissistic Bias: It has been shown that LLMs may favor the answers generated by themselves. We use the word “may” because research has shown that although GPT-4 and Claude-v1 favors itself with a 10% and 25% higher win rate respectively, they also favor other models and GPT-3.5 does not favor itself.
  • More is More: We humans all know the phrase less is more, but LLM judges tend to prefer more verbose text over more concise ones. This is a problem in LLM evaluation because LLM computed evaluation scores might not accurately reflect the quality of the LLM generated text.
  • Not-so-Fine-Grained Evaluation Scores: LLMs can be reliable judges when making high-level decisions, such as determining binary factual correctness or rating generated text on a simple 1–5 scale. However, as the scoring scale becomes more detailed with finer intervals, LLMs are more likely to produce arbitrary scores, making their judgments less reliable and more prone to randomness.
  • Position Bias: When using LLM judges for pairwise comparisons, it has been shown that LLMs such as GPT-4 generally prefer the first generated LLM output over the second one.

Furthermore, there are other general considerations such as LLM hallucination. However, that’s not to say these can’t be solved. In the next section, we’ll go through some techniques on how to mitigate such limitations.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

Improving LLM Judgements

Chain-Of-Thought Prompting

Chain-of-thought (CoT) prompting is an approach where the model is prompted to articulate its reasoning process, and in the context of using CoTs for LLM judges, it involves including detailed evaluation steps in the prompt instead of vague, high-level criteria to help a judge LLM perform more accurate and reliable evaluations. This also helps LLMs align better with human-expectations.

This is in fact the technique employed in G-Eval, which they call “auto-CoT”, and is of course implemented in DeepEval, which you can use like this:


from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

test_case = LLMTestCase(input="input to your LLM", actual_output="your LLM output")
coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence - the collective quality of all sentences in the actual output",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

coherence_metric.measure(test_case)
print(coherence_metric.score, coherence_metric.reason)

Few-Shot Prompting

Few-shot prompting is a simple concept which involves including examples to better guide LLM judgements. It is definitely more computationally expensive as you’ll be including more input tokens, but few-shot prompting has shown to increase GPT-4’s consistency from 65.0% to 77.5%.

Other than that, there’s not much to elaborate on here, and if you’ve ever tried playing around with different prompt templates you’ll know that adding a few examples in the prompts is probably the single most helpful thing one could do to steer LLM generated outputs.

Using Probabilities of Output Tokens

To make the computed evaluation score more continous, instead of asking the judge LLM to output scores on a finer scale which may introduce arbitrariness in the metric score, we can instead ask the LLM to generate 20 scores and use the probabilities of the LLM output tokens to normalize the score by calculating a weighted summation. This minimizes bias in LLM scoring, and smoothens the final computed metric score to make the final score more continuous without compromising accuracy.

Bonus: This is also implemented in DeepEval’s G-Eval implementation.

Reference-Guided Judging

Instead of single output, reference-free judging, providing an expected output as the ideal answer helps a judge LLM better align with human expectations. In your prompt, this can be as simple as incorporating it as an example in few-shot prompting.

Confining LLM Judgements

Instead of giving LLMs the entire generated output to evaluate, you can consider breaking it down into more fine-grained evaluations. For example, you can use LLM to power question-answer-generation (QAG), which is a powerful technique to compute scores that are non-arbitrary. QAG is a powerful technique to compute evaluation metric scores based on yes/no answers to close-ended questions. For example, if you would like to calculate the answer relevancy of an LLM output based on a given input, you can first extract all sentences in the LLM output, and determine the proportion of sentences that are relevant to the input. The final answer relevancy score will then be the proportion of relevant sentences in the LLM output. In a way the DAG we talked about earlier also uses QAG (I know, starting to feel a bit silly with the -AGs), especially for nodes where a binary judgement is expected. For a more complete example of QAG, read this article on how to use QAG to compute scores for various different RAG and text summarization metrics.

QAG is a powerful technique because it means LLM scores are no longer arbitrary and can be attributed to a mathematical formula. Breaking down the initial prompt to only include sentences instead of the entire LLM output can also help combat hallucinations as there is now less text to be analyzed.

Swapping Positions

No rocket science here, we can simply swap the positions to address position bias in pairwise LLM judges and only declare a win when an answer is preferred in both orders.

Fine-Tuning

For more domain specific LLM judges, you might consider fine tuning and custom open-source model like Llama-3.1. This is also if you would like faster interference time and cost associated with LLM evaluation.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

Using LLM Judges in LLM Evaluation Metrics

Lastly, LLM judges can be and are currently most widely used to evaluate LLM systems by incorporating it as a scorer in an LLM evaluation metric:

A reminder of how an LLM judge can be used as scorer.

A good implementation of an LLM evaluation metric will use all the mentioned techniques to improve the LLM judge scorer. For example in DeepEval (give it a star here⭐) we already use QAG to confine LLM judgements in RAG metrics such as contextual precision, or auto-CoTs and normalizing probabilities of output tokens for custom metrics such as G-Eval, and most importantly few-show prompting for all metrics to cover a wide variety of edge cases. For a full list of metrics that you can use immediately, click here.

To finish off this article, I’ll show you how you can leverage DeepEval’s metrics in a few lines of code. You can also find all the implementation on DeepEval’s GitHub, which is free and open-source.

Coherence

You’ve probably seen this a few times, a custom metric that you can implement via G-Eval:


from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

test_case = LLMTestCase(input="input to your LLM", actual_output="your LLM output")
coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence - the collective quality of all sentences in the actual output",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

coherence_metric.measure(test_case)
print(coherence_metric.score, coherence_metric.reason)

Note that we turned on verbose_mode for G-Eval. When verbose mode is turned on in DeepEval, it prints the internal workings of an LLM judge and allows you to see all the intermediate judgements made.

Text Summarization

Next up is summarization. I love talking about summarization because it is one of those use cases where users typically have a great sense of what a success criteria looks like. Formatting in a text summarization use case, is great. Here, we'll use DeepEval's DAG metric, with a twist. Instead of DAG with the code you've seen in the DAG section, we'll actually use DAG to automatically assign a score of 0 to summaries that don't follow the correct formatting requirement, before using G-Eval as a leaf node inside our DAG to return a final score instead. This meansthe final score is not hard-coded, but also ensures your summary meets a certain requirement.

First, create your DAG structure:


from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics.dag import (
    DeepAcyclicGraph,
    TaskNode,
    BinaryJudgementNode,
    NonBinaryJudgementNode,
    VerdictNode,
)
from deepeval.metrics import DAGMetric

g_eval_summarization = GEval(
	name="Summarization",
  criteria="Determine how good a summary the 'actual output' is to the 'input'",
  evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

correct_order_node = NonBinaryJudgementNode(
    criteria="Are the summary headings in the correct order: 'intro' => 'body' => 'conclusion'?",
    children=[
        VerdictNode(verdict="Yes", g_eval=g_eval_summarization),
        VerdictNode(verdict="Two are out of order", score=0),
        VerdictNode(verdict="All out of order", score=0),
    ],
)

correct_headings_node = BinaryJudgementNode(
    criteria="Does the summary headings contain all three: 'intro', 'body', and 'conclusion'?",
    children=[
        VerdictNode(verdict=False, score=0),
        VerdictNode(verdict=True, child=correct_order_node),
    ],
)

extract_headings_node = TaskNode(
    instructions="Extract all headings in `actual_output`",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    output_label="Summary headings",
    children=[correct_headings_node, correct_order_node],
)

# create the DAG
dag = DeepAcyclicGraph(root_nodes=[extract_headings_node])

Then, create the DAG metric out of this DAG and run an evaluation:


from deepeval.test_case import LLMTestCase
...

# create the metric
summarization = DAGMetric(name="Summarization", dag=dag)

# create a test case for summarization
test_case = LLMTestCase(input="your-original-text", actual_output="your-summary")

# evaluate
summarization.measure(test_case)
print(summarization.score, summarization.reason)

From the DAG structure, you can see that we return a score of 0 for all cases where the formatting is incorrect, but runs G-Eval afterwards. You can find the documentation to DAG here.

Contextual Precision

Contextual precision is a RAG metric that determines whether the nodes retrieved in your RAG pipeline is in the correct order. This is important because LLMs tend to consider nodes that are closer to the end of the prompt more (recency bias). Contextual precision is calculated using QAG, where the relevance of each node is determine by an LLM judge by looking at the input. The final score is a weighted cumulative precision, and you can view the full explanation here.


from deepeval.metrics import ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase

metric = ContextualPrecisionMetric()
test_case = LLMTestCase(
    input="...",
    actual_output="...",
    expected_output="...",
    retrieval_context=["...", "..."]
)

metric.measure(test_case)
print(metric.score, metric.reason)

Conclusion

You made it! It was a lot on LLM judges, but now at least we know what the different types of LLM judges are, their role in LLM evaluation, pros and cons, and the ways in which you can improve them.

The main objective of an LLM evaluation metric is to quantify the performance of your LLM (application), and to do this we have different scorers, which the current best are LLM judges. Sure, there are drawbacks such as LLMs exhibiting biasness in its judgements, but these can be prompt engineered through CoT and few-shot prompting.

Don’t forget to give ⭐ DeepEval a star on Github ⭐ if you found this article useful, and as always, till next time.

* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrail accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.
Jeffrey Ip
Cofounder @ Confident, creator of DeepEval. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Thank you! You're now subscribed to Confident AI's weekly newsletter.
Oops! Something went wrong while submitting the form.