Using LLMs for Synthetic Data Generation: The Definitive Guide

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Filtering Synthetic Data

Before you begin evolving your newly generated datasets, it’s essential to conduct thorough quality checks to avoid refining inputs that are inherently flawed. This step is crucial to ensure that no valuable resources are wasted and that your final dataset only contains high-quality goldens.

Filtering occurs at two critical stages of synthetic data generation: initially during context generation, and subsequently during the generation of synthetic inputs from these contexts.

Context Filtering

During context generation, there’s a chance you might randomly select a low-quality chunk. Oftentimes, your knowledge base may contain complex structures or excess whitespace that becomes unintelligible when broken down. Employing LLMs as judges is a robust method for identifying and eliminating these low-quality contexts.

You may customize the criteria for evaluating and filtering out these contexts, but here are some foundational guidelines to consider:

Clarity: Evaluate how clear and understandable the information is.
Depth: Assess the level of detailed analysis and presence of original insights.
Structure: Review the organization and logical progression of the content.
Relevance: Determine the content’s pertinence to the main topic.
Precision: Gauge the accuracy and attention to detail.
Novelty: Assess the uniqueness and originality of the content.
Conciseness: Evaluate the brevity and efficiency of the communication.
Impact: Judge the potential effect of the content on the audience.

You’ll also need to ensure that the remaining chunks are similar enough once you finish filtering out low-quality chunks. This involves a secondary filtering process where you weed out chunks that don’t pass the similarity threshold.

This structured approach to filtering ensures that only high-quality, relevant, and useful contexts proceed to the next stage of synthetic data generation.

Input Filtering

The second filtering stage focuses on the synthetic inputs generated from these contexts. This step is crucial because even well-curated contexts can sometimes lead to the generation of inputs that might not meet the required standards.

Here are a few criteria you may want to judge your synthetic input on:

Self-containment: Ensures the input is complete and can function independently without external references.
Clarity: Checks that the input clearly communicates its intended message or question to avoid misinterpretation.
Consistency: Assures the input aligns with the provided context or background information thematically and factually.
Relevance: Verifies that the input directly pertains to the intended tasks or queries, ensuring it’s purposeful and on-topic.
Completeness: Confirms the input includes all necessary details for effective interaction or query resolution.

Using these criteria helps ensure that synthetic inputs are not only high in quality but also tailored perfectly to their intended applications.

Styling Synthetic Data

Finally, you may want to tailor your queries to specific topics and customize their input and output formats to fit your unique use cases.

For example, if your application involves converting text to SQL, the outputs should accurately reflect SQL statements. In scenarios involving an evaluative LLM, using a JSON format with keys like ‘score’ and ‘reason’ might be more appropriate.

You should plan to apply specific styling during the initial generation, through any evolutionary changes, and after the final outputs are generated. Revisiting the style after the initial generation is crucial, as the evolution of synthetic queries can alter the initial styling intentions.

The extent of style adjustments after the first round will depend on your desired level of control over the final product and the associated cost implications.

Data Survival-Of-The-Fittest

Let’s clarify what data evolution is and why it’s so important to synthetic data generation using LLMs. Data evolution, first introduced in Microsoft’s Evol-Instruct, involves iteratively enhancing an existing set of queries to generate more complex and diverse ones through prompt engineering. This step is crucial for ensuring the quality, comprehensiveness, complexity, and diversity of the dataset. It’s what makes synthetic data superior to public or human-annotated datasets.

In fact, the original authors managed to produce 250,000 instructions from just 175 human-created queries. There are 3 types of data evolution:

In-Depth Evolving: Expands a simple instruction into a more detailed and intricate version.
In-Breadth Evolving: Produces new, diverse instructions to enrich the dataset.
Elimination Evolving: Removes less effective or failed instructions.

In-depth (blue) and in-breadth (red) evolutions of ‘1+1’ query

There are several ways to perform in-depth evolution, such as complicating inputs, increasing the need for reasoning, or adding multiple steps to complete a task. Each approach contributes to a higher level of sophistication in the generated data.

In-depth evolution ensures the creation of nuanced, high-quality queries, while in-breadth evolution enhances diversity and comprehensiveness. By evolving each query or instruction multiple times, we increase its complexity, resulting in a rich and multifaceted dataset. But enough of me talking, let's show you how to put everything in action.

Take this query for example:

What is 1+1?

We can in-depth-evolve it to something like this instead:

In what situation does 1+1 not equal to 2?

Which I hope we can all agree is more complicated and realistic than a generic 1+1. In the next section, we'll show how to actually employ these evolution methods when generating synthetic datasets.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Step-By-Step Guide: Generating Synthetic Data Using LLMs

Before we begin, lets be reminded of the data synthesizer architecture we'll be building:

You'll notice there are five main steps:

Document Chunking
Context Generation
Query Generation
Data Evolution
Label/Expected Output Generation (optional)

For those that want get something working right away, I've open-sourced this process in DeepEval and you can be ready for synthetic dataset generation with support for filtering and styling (which I'll show you later in the final section) immediately.

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
    document_paths=['example.txt', 'example.docx', 'example.pdf'],
)

If you're here to learn how it works, please read on.

1. Document Chunking

The first step is to chunk your document. As the name suggests, document chunking means dividing it into smaller, meaningful ‘chunks.’ This way, you can break down larger documents into manageable sub-documents while maintaining their context. Chunking also allows for embedding generation in documents that exceed the token limit of the embedding model.

This step is essential because it helps identify semantically similar chunks and generate queries or tasks based on shared contexts.

There are several chunking strategies like fixed-size chunking and context-aware chunking. You can also adjust hyperparameters such as character size and chunk overlap. In the example below, we’ll use token-based chunking with a character size of 1024 and no overlap. Here’s how you can chunk your document:

pip install langchain langchain_openai

# Step 1. Chunk Documents
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=1024, chunk_overlap=0)
loader = PyPDFLoader("chatbot_information.pdf")
raw_chunks = loader.load_and_split(text_splitter)

Once you have the chunks, convert each one into embeddings. These embeddings capture the semantic meaning of each chunk and are combined with the chunk’s content to form a list of Chunk objects.

from langchain_openai import OpenAIEmbeddings
...

embedding_model = OpenAIEmbeddings(api_key="...")
content = [rc.page_content for rc in raw_chunks]
embeddings = embedding_model.embed_documents(content)

2. Context Generation

To generate context, start by randomly selecting a chunk of data to act as your focal anchor for finding related information.

# Step 2: Generate context by selecting chunks
import random
...

reference_index = random.randint(0, len(embeddings) - 1)
reference_embedding = embeddings[reference_index]
contexts = [content[reference_index]]

Next, set a similarity threshold and use cosine similarity to identify related chunks to build your context:

...

similarity_threshold = 0.8
similar_indices = []
for i, embedding in enumerate(embeddings):
    product = np.dot(reference_embedding, embedding)
    norm = np.linalg.norm(reference_embedding) * np.linalg.norm(embedding)
    similarity = product / norm
    if similarity >= similarity_threshold:
        similar_indices.append(i)

for i in similar_indices:
    contexts.append(content[i])

This step is crucial because it allows you to enhance the robustness of your queries by diversifying sources of information around the same topic. By including multiple chunks of data that share a similar theme, you also provide the model with richer, more nuanced information on the subject.

This ensures that your queries cover the topic comprehensively, resulting in more well-rounded and accurate responses.

3. Query Generation

Now comes the fun part with LLMs. Use a GPT model to generate a series of tasks or queries for the context created using a structured prompt.

Provide a prompt that asks the model to act as a copywriter, generating JSON objects containing an input key, which is the query. Each input should either be a question or statement answerable using the provided context.

# Step 3. Generate a series of queries for similar chunks
from langchain_openai import ChatOpenAI
...

prompt = f"""I want you act as a copywriter. Based on the given context, 
which is list of strings, please generate a list of JSON objects 
with a `input` key. The `input` can either be a question or a 
statement that can be addressed by the given context.

contexts:
{contexts}"""

query = ChatOpenAI(openai_api_key="...").invoke(prompt)

This step forms the basis of your queries, which will be evolved and included in the final dataset.

4. Query Evolution

Finally, we’ll evolve our queries from Step 3 using multiple evolution templates. You can define as many templates as you want, but we’ll focus on three: multi-context understanding, multi-step reasoning, and hypothetical scenario.

# Evolution prompt templates as strings
multi_context_template = f"""
I want you to rewrite the given `input` so that it requires readers to use information from all elements in `Context`.

1. `Input` should require information from all `Context` elements. 
2. `Rewritten Input` must be concise and fully answerable from `Context`. 
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

reasoning_template = f"""
I want you to rewrite the given `input` so that it explicitly requests multi-step reasoning.

1. `Rewritten Input` should require multiple logical connections or inferences.
2. `Rewritten Input` should be concise and understandable.
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` must be fully answerable from `Context`.
5. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

hypothetical_scenario_template = f"""
I want you to rewrite the given `input` to incorporate a hypothetical or speculative scenario.

1. `Rewritten Input` should encourage applying knowledge from `Context` to deduce outcomes.
2. `Rewritten Input` should be concise and understandable.
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` must be fully answerable from `Context`.
5. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

You can see that each template imposes specific constraints on the output. Feel free to adjust them based on how you want your evaluation queries to appear in the final dataset. We’ll use these templates to evolve the original queries multiple times, randomly selecting templates each time.

# Step 4. Evolve Queries
...

example_generated_query = "How do chatbots use natural language understanding?"
context = contexts 
original_input = example_generated_query 
evolution_templates = [multi_context_template, reasoning_template, hypothetical_scenario_template]

# Number of evolution steps to apply
num_evolution_steps = 3

# Function to perform random evolution steps
def evolve_query(original_input, context, steps):
    current_input = original_input
    for _ in range(steps):
        # Choose a random (or using custom logic) template from the list
        chosen_template = random.choice(evolution_templates)
        # Replace the placeholders with the current context and input
        evolved_prompt = chosen_template.replace("{context}", str(context)).replace("{original_input}", current_input)
        # Update the current input with the "Rewritten Input" section
        current_input = ChatOpenAI(openai_api_key="...").invoke(evolved_prompt)
    return current_input

# Evolve the input by randomly selecting the evolution type
evolved_query = evolve_query(original_input, context, num_evolution_steps)

And there you have it, our final evolved query! Repeat this process to generate more queries and further refine your dataset. For evaluation purposes, you’ll need to properly format these input queries and contexts into a suitable testing framework.

5. Expected Output Generation

Although this step is optional, I would highly recommend generating expected outputs for each evolved query. This is because it is easier for a human evaluator to correct and annotate expected outputs than to create them from scratch.

# Step 5. Generate Expected Output
...

# Define prompt template
expected_output_template = f"""
I want you to generate an answer for the given `input`. This answer has to be factually aligned to the provided context.

Context: {context}
Input: {evolved_query}
Answer:
"""

# Fill in the values
prompt = expected_output_template.replace("{context}", str(context)).replace("{evolved_query}", evolved_query)

# Generate expected output
expected_output = ChatOpenAI(openai_api_key="...").invoke(prompt)

As a final step to wrap things up, combine the evolved query, context, and expected output as a data row in your synthetic dataset.

from pydantic import BaseModel
from typing import Optional, List
...

class SyntheticData(BaseModel):
	query: str
	expected_output: Optional[str]
	context: List[str]

synthetic_data = SyntheticData(
	query=evolved_query, 
	expected_output=expected_output, 
	context=context
)

# Simple implementation of synthetic dataset
synthetic_dataset = []
synthetic_dataset.append(synthetic_data)

Now all you need to do, is repeat steps 1-5 until you have a reasonably sized synthetic dataset, which you can later use to evaluate and test your LLM (systems) on!

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Generating Synthetic Datasets Using DeepEval

In this final section, I'd like to show you a battle-tested data synthesizer I've open-sourced it in DeepEval. This includes from synthetic data generation to formatting it into test cases ready for LLM evaluation and testing, which you can use in just 2 lines of code. And the best part is, you can leverage ANY LLM of your choice. Here’s how you can use DeepEval for synthetic dataset generation:

pip install deepeval

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
    document_paths=['example.txt', 'example.docx', 'example.pdf'],
)

You can read more about how to use DeepEval's synthesizer to generate synthetic datasets in DeepEval's docs, but in summary DeepEval takes in your documents, does all the chunking and context generation for you, before generating synthetic "goldens" that are basically data rows that ultimately form your synthetic dataset. Easy enough?

Conclusion

Generating synthetic datasets using LLMs is great because it is a quick and cheap way to get your hands on large amounts of data. However, generated data can look extremely repetitive and often times doesn't represent the underlying data distribution well enough to be deemed useful. In this article, we talked how to solve this problem by first selecting relevant context from documents, before using it to generated queries that can be used to test and evaluate your LLM systems on.

We also explored data evolution, which we used to make synthetic queries more realistic. If you're looking to build a data synthesizer from scratch, this article serves as a great tutorial. However, if you're looking for something more robust and production ready, you can use DeepEval. It is open-source, extremely easy to use (seriously), and has a whole evaluation and testing suite for you to use the generated synthetic dataset to seamlessly test and evaluate your LLM systems on.

Thank you for reading and if you've found this article useful, don't forget to ⭐ DeepEval a star on GitHub ⭐!

* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrail accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.