The ability to use AI to generate data out of thin air is one of those things that seem too good to be true — think about it, you can get your hands on quality data without needing to manually collect, clean, and annotate massive datasets.
But, as you might expect, synthetic data is not without its caveats. Although it is convenient, efficient, and cost effective, the quality of synthetic data is only as good as the method used to generate it. Settle for rudimentary methods, and you’ll end up with unusable datasets that don’t represent real-world data well.
In this article, I’m going to share how we managed to generate realistic textual synthetic data at Confident AI. Let's dive right into it.
What is synthetic data?
First and foremost, synthetic data is artificially generated data in attempt to simulate real-world data. Unlike real-world data that is collected from observations or actual events (e.g., tweets on the platform formally known as Twitter), synthetic data is made up, sometimes entirely, but more commonly based on a small subset of real-world data (also known as data augmentation).
This kind of data is often used for testing, training, and validating machine learning models, especially in scenarios where using real-world data is scarce or difficult to collect.
The struggles in generating textual data
Historically, while the demand for synthetic data continued to rise steadily over the years, advancements in generation methods struggled to keep pace.
Methods available at the time were often simplistic, perhaps relying on basic statistical methods, or they were too domain-specific and hard to generalize, meaning they lacked the complexity to mimic real-world data in a meaningful way.
Let’s take Generative Adversarial Networks (GANs) as an example. GANs employed a novel architecture of two neural networks — a generator and a discriminator — that competed with each other. The competition between these two networks resulted in the generation of highly realistic and complex synthetic data.
However, as one might have guessed from the title of this article, there were still major drawbacks when leveraging GANs to generate textual data.
- Mode Collapse: A phenomenon where the generator starts to produce the same output (or very similar outputs) over and over again.
- Difficult to train: GANs are notoriously hard to train, with issues like vanishing/exploding gradients and oscillations in loss.
- Long-Range Dependencies: Textual data often involve long-range dependencies (e.g., the subject of a sentence affecting a verb that appears much later), and capturing these effectively is a challenge even for advanced GAN architectures.
- Very Needy: They require lots of data to train on (ironically).
Needless to say, there’s a lot of hurdles to overcome and consider when it comes to textual data. Let’s cut to the chase and see why you should use LLMs instead.
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.
Generating synthetic data with LLMs
Like it or hate it, large language models (LLMs) like GPT-4 has democratized textual synthetic data. Let’s say I want to generate some queries related to the topic of synthetic data. All I have to do is use either ChatGPT or OpenAI’s API to generate a set of tweets. For example, here’s how you can do it in python (note: I’m using GPT-3.5):
Here’s a sample output:
While the generated data is quite varied, it may not accurately reflect real-world conditions, making it less useful for certain applications. Fortunately, by carefully crafting the input prompts, we can improve the authenticity of the synthetic data.
Using Dynamic Prompts Templates to make Synthetic Data Realistic
The pervasive problem with synthetic data generation is there’s often a mismatch between the generative distribution and the distribution of real-world data.
However, due to the versatile and adaptable nature of LLMs, we can easily ground generated data by dynamically changing the prompt (basically string interpolation!). For example, you might want to wrap the OpenAI API call in a function instead and make it accept additional context as parameters :
Here’s a sample output (don’t forget we’re using GPT-3.5!):
As you can see, the output is much morerelevant and significantly improved compared to previous iterations without dynamic prompting.
Conclusion
In this article, we explored ways to contextualize synthetic data effectively. LLMs like GPT-3.5 can offer a simple yet powerful way of generating data through some careful prompt designing.
Stay tuned for our Part 2 guide on diversifying your synthetic data set!
Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.