How Supernormal cut LLM cost by 80% with Confident AI

Company
Supernormal
Headcount
25-50
Location
New York, USA
Customer Since
August, 2024
Industry
B2B AI SaaS
"Thanks to Confident AI, we were able to move to a fine-tuned model and cut our LLM costs by 80%. This opens up whole new use cases now to generate better output with more targeted LLM calls."
John Lemmon
AI Lead, Supernormal

Supernormal is shaping the future of work automation

As a pioneer in AI-powered meeting intelligence, Supernormal is building the future of work automation. Every month, Supernormal processes millions and millions of tasks generated from meetings into actionable insights, helping companies save hundreds of hours by automating tedious workflows with AI. Supernormal's mission is to seamlessly connect your conversations to your software and turn spoken thought into completed work.

“The team at Supernormal believes that conversations are the heartbeat of work—where ideas spark, decisions solidify, and tasks are born,” says John Lemmon, the AI Lead at Supernormal.

To do this, John’s team of AI engineers are building advanced LLM-powered meeting intelligence solutions, including:

  1. Real-time meeting notes: Generate structured meeting notes—including summaries, action items, and custom insights, as live conversation unfolds.
  2. Conversational AI assistants: Interactive voice agents to respond in real-time, generating contextual speech based on the meeting’s flow and participant interactions.
  3. Instant “AMA” meeting insights: Delivers precise answers and key takeaways from historical discussions in seconds.

Like many of its customers, Supernormal is scaling fast. With millions of conversations processed each month, John’s team needed quality assurance - to ensure Supernormal’s LLM products are driving the best possible ROI for customers, while being able to cut LLM inference cost simultaneously.

Supernormal maintains exceptionally high standards

For Supernormal, quality is key and it's the difference between a POC and an actual useful product. Being able to evaluate results means John’s team can make small, incremental improvements that add up over time and quickly know if they can switch to newer models without regressions. 

“If you ask an LLM for a summary of a meeting it can quickly generate one for you, but to generate a summary that captures all of the main points each time can be difficult to detect”. John also explains that “to guarantee that the output language, style, and formatting are consistent can be tricky when they fail only 1% of the time.”

For more complicated use cases, like generating useful action items, it can be hard to disambiguate between a simple task that needs to be done after a meeting vs a direction decided on during a meeting. Balancing between recall & precision is tricky when the answers are so nebulous.

John’s team also often wanted to switch over to newer models when they came out, either because it's cheaper or better performing. But models are chaotic, so regressions that were previously mitigated in their prompts came back with newer models and requires some re-tuning. 

“Specifically for gpt-4o, we saw the LLM choose the wrong language because it was picking up people's names instead of looking at the spoken text. Being able to quickly know if past issues resurface with newer models let us fix the problem quickly and take advantage of them without worries.”, says John.

Supernormal is outgrowing its manual, inefficient, LLM testing processes

Before Confident AI, John’s team mainly manual eyeballed a small "golden" dataset where they knew what results to expect. 

“But manual eyeballing doesn't scale and the golden dataset is too small and biased to pick up issues that might only appear 1% of the time," John explains.

There also wasn’t an easy way to collaborate within the team. Data had to be shared in google sheets, where the format wasn't easy to consume and differences between control & treatment prompts were hard to pinpoint. It was difficult to get these sheets consistent and everyone had their own style to it, making it harder to “peer review” the results.

John’s team also tried some other tools. LangSmith seemed promising but was difficult to run locally, requiring more than just a simple pip install and pushing users toward external servers. Other LLM evaluation frameworks follow this pattern, which is a non-starter whenever they wanted to implement their own metrics such as the ratio of the output-input length. Not every metric is going to be pre-built LLM-as-a-judges, but the AI team at Supernormal wanted to see all these metrics together for better experimentation.

"There's a lot of LLM eval tools out there now but they're mostly terrible, so sifting through them was a painful process."

John Lemmon
AI Lead, Supernormal

It became clear to John and his team that if they wanted to provide better AI-powered workflow automations for their customers while cutting inference cost, they needed a more effective, standardized LLM testing process. They sought an LLM evaluation-focused solution where their team could leverage to effectively run any metric, anywhere, while scaling testing through a centralized collaboration platform.

A more tailored, efficient, and reliable solution to experiment with LLM applications

The search led them to Confident AI, which John’s team saw as a true LLM evaluation solution enabling LLM experimentation with performance-reflective metrics.

“What drew us to Confident AI was the combination of customizable open-source metrics provided through DeepEval as well as the collaborative experimentation features on the cloud”, says John.

Confident AI solved the sharing problem John’s team had of wanting to quickly show results with other team members for review, while DeepEval being just a pip package enabled John’s team to run evals locally, seamlessly hooking it into their existing eval pipeline without much trouble. The LLM-evals Supernormal also were not very good, and so having tools like G-Eval out of the box with DeepEval was a great benefit.

“We were about to build this (Confident AI) out ourselves and the platform alone easily saved us at least 100 hours of effort. The framework (DeepEval) also worked well with our existing data in Snowflake, making it adaptable for other non-evaluation use cases.”, John added.

Additionally, John found it easy to set team members up for them to run their own experiments flexibly. Although Supernormal already had a system setup for routine checks, DeepEval made it easy to just run this in a notebook environment for ad-hoc work.

John also loves working with the team at Confident AI: “Whenever I hit any snag at all I can get very quick responses in Discord and usually a fix/improvement by the next day. This is impressive and rare, I've rarely seen responsiveness like this and it feels like having a whole company on the team helping us build things out quickly.”

80% reduction in LLM cost, countless hours saved, and no more spreadsheets

By centralizing Supernormal’s LLM evaluation workflow and standardizing the process, Confident AI eliminates hundreds of hours of the team’s manual testing work. Our solution gave Supernormal a standard way to do peer reviewed changes, and instead of taking at least a week to roll out changes, they can now often do it in hours instead. Using Confident AI, Supernormal was also able to easily switch models and eventually decrease its inference cost by 80%.

"Thanks to Confident AI, we were able to move to a fine-tuned model and cut our LLM costs by 80%. This opens up whole new use cases now to generate better output with more targeted LLM calls."

John Lemmon
AI Lead, Supernormal

Thankfully, this also meant no more Google sheets. Confident AI’s upcoming capabilities are also particularly exciting for Supernormal, especially as they look for better ways to create deterministic, decision-based custom metrics. With their evaluation criteria becoming more specific and precise, DeepEval’s DAG-based metric will make it easy to build deterministic LLM metrics, eliminating the need to manage thousands of lines of evaluation code, while automatically integrating with Confident AI for experimentation within the wider team.

With these challenges lifted, Supernormal can now focus on delivering maximum value to customers at a significantly lower cost, all thanks to Confident AI.

When you AI needs improvement, you need Confident AI.