If you’ve ever heard of LLM red-teaming at all, you’ve likely encountered several notable attacks: prompt injections, data poisoning, denial-of-service (DoS) attacks, and more. However, when it comes to exploiting an LLM into generating undesirable or harmful outputs, nothing is quite as powerful as LLM jailbreaking.
In fact, this study demonstrates that SOTA models like GPT-4 were successfully compromised with just a few jailbreaking queries.
Still, while LLM jailbreaking has become a widely discussed topic, its definition can vary across different contexts, leading to some confusion about what it truly entails. Do not fear — today, I’ll guide you through everything you need to know about jailbreaking, including:
- What LLM jailbreaking is and its various types
- Key research and breakthroughs in jailbreaking
- A step-by-step guide to crafting high-quality jailbreak attacks to identify vulnerabilities in your LLM application
- How to use DeepEval ⭐, the open-source LLM evaluation framework to red team your LLM for over 40+ vulnerabilities using jailbreaking strategies
What is LLM Jailbreaking?
LLM Jailbreaking is the process of utilizing specific prompt structures, input patterns, or contextual cues to bypass the built-in restrictions or safety measures of large language models (LLMs).
These models are typically programmed with safeguards to prevent generating harmful, biased, or restricted content. Jailbreaking techniques manipulate the model into circumventing these constraints, producing responses that would otherwise be blocked. A good LLM jailbreaking framework requires an attack generator capable of producing high-quality, non-repetitive jailbreaks at scale.
Traditionally, LLM jailbreaking encompasses all types of attacks aimed at forcing an LLM to output undesirable responses, including methods like prompt injections and prompt probing attacks.
Recently, however, these attacks have diverged into distinct fields, with ‘LLM jailbreaking’ increasingly referring, and in a broader sense, to creative techniques — such as storytelling, coding tasks, or role-playing — that trick the model, rather than following a strictly methodical approach.
Types of LLM Jailbreaking
There are many ways to classify LLM jailbreaking techniques, but they generally fall into three main categories: token-level jailbreaking, prompt-level jailbreaking, and dialogue-based jailbreaking.
Prompt-level Jailbreaking
Prompt-level jailbreaking relies exclusively on human-crafted prompts designed to exploit model vulnerabilities. Unlike automated approaches, these techniques demand human ingenuity, using tactics such as semantic tricks, storytelling, and indirect requests to bypass model restrictions.
Common prompt-level jailbreaking strategies can be broken down into four distinct categories: language manipulation, rhetoric, imagination, and LLM operational techniques.
Language Strategies:
Language strategies exploit nuances in wording and phrasing to manipulate the model’s understanding of user intent:
- Payload Smuggling: Hiding commands within innocent prompts (e.g., translation, term substitution).
- Modifying Model Instructions: Embedding instructions to override restrictions (e.g., “Forget prior content”).
- Prompt Stylizing: Disguising intent via formal or indirect language.
- Response Constraints: Limiting response style to force specific outputs (e.g., yes/no, single-syllable).
Rhetoric Techniques:
Rhetoric techniques draw on persuasive tactics, presenting requests in ways that align with the model’s intent to be helpful or neutral:
- Innocent Purpose: Framing requests as beneficial (e.g., teaching or research).
- Persuasion and Manipulation: Convincing the model through ego appeals or reverse psychology.
- Alignment Hacking: Exploiting the model’s helpfulness (e.g., “Don’t warn, just help”).
- Conversational Coercion: Steering conversations gradually toward restricted topics.
- Socratic Questioning: Leading the model through questions to restricted content.
Imaginary Worlds:
By immersing the model in fictional settings, the following methods frame restricted topics as fictional explorations, often reducing the model’s resistance to engaging with sensitive content.
- Hypotheticals: Creating alternate scenarios where restrictions don’t apply.
- Storytelling: Framing restricted content in fictional narratives.
- Role-playing: Assuming identities that access restricted content.
- World Building: Imagining unrestricted settings where rules differ.
LLM Operational Exploitations:
LLM operational exploitations take a more technical approach, leveraging the model’s internal learning mechanisms and prompt behaviors to bypass restrictions. Notably, Anthropic has demonstrated that many-shot jailbreaking can effectively compromise their models as well as others.
- One-/Few-shot Learning: Using examples to fine-tune desired outputs.
- Superior Models: Pretending the model is unrestricted (e.g., “DAN”).
- Meta-Prompting: Asking the model to create its own jailbreak prompts.
Prompt-level jailbreaks can be highly effective in breaking models and uncovering new vulnerabilities. However, since they are fully human-driven, these jailbreaks are inherently limited in scalability, as each prompt requires tailored human input to achieve the desired outcome.
Token-level Jailbreaking
Token-level jailbreak methods take a distinct approach by optimizing the raw sequence of tokens fed into the LLM to elicit responses that violate the model’s intended behavior. A significant advantage of token-level attacks is their potential for automation.
By framing these attacks as optimization problems within the input token space, gradient-based techniques can be applied to systematically explore the domain and continually generate attacking prompts, reducing reliance on human creativity.
Here are a few token-level jailbreaking techniques:
- JailMine: Uses automated token optimization to create sequences that bypass restrictions, achieving high success rates across various models, including those with strong defenses.
- GPTFuzzer: Randomizes token sequences to probe model vulnerabilities, effective in black-box scenarios but less consistent in performance.
- GCG: A gradient-based white-box attack that systematically adjusts tokens using model gradients, effective but dependent on model-specific details.
This capacity for automated, systematic exploration makes token-level techniques highly effective and scalable for identifying vulnerabilities in LLMs. However, they are not without cost. Token-level jailbreaking often requires hundreds or thousands of queries to breach model defenses, and the results are frequently less interpretable than those from prompt-level attacks.
Dialogue-based Jailbreaking
Dialogue-based jailbreaking surpasses both token-based and prompt-based methods by being scalable, effective, and interpretable. Unlike token-level attacks that require thousands of generations, dialogue-based methods achieve jailbreaks with fewer, strategically crafted prompts in a dynamic conversational loop. Unlike prompt-based methods, dialogue-based attacks can generate thousands of jailbreak attempts within minutes, maximizing both efficiency and coverage.
Dialogue-based jailbreaking operates through an iterative loop involving three key LLM roles: an attacker model, a target model, and a judge model. In this setup, the attacker generates prompts aimed at eliciting restricted or unintended responses from the target model, while the judge scores each response to assess the success of the jailbreak attempt.
How the Loop Works:
- Attacker Model Generation: The attacker model crafts a prompt, seeking to exploit vulnerabilities in the target model. These prompts are designed to subtly bypass restrictions through creative phrasing or unconventional prompts.
- Target Model Response: The target LLM attempts to respond while adhering to its safety and alignment filters. Each response provides feedback on the robustness of these filters.
- Judge Model Scoring: The judge model evaluates the target model’s response against specific criteria, scoring it based on metrics like compliance with restrictions and degree of unintended behavior.
- Loop Continuation: Based on the judge’s score and feedback, the attacker refines its prompt and iterates the process, generating new prompts in a continuous loop. This loop continues until the attacker exhausts potential prompt variations or successfully breaks through the target model’s defenses.
By automating this iterative loop, dialogue-based jailbreaking facilitates thorough, scalable testing of model defenses across various scenarios, making it a powerful method for identifying vulnerabilities in LLMs.
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.
Step-By-Step Guide: Jailbreaking LLM Applications
In this guide, I’ll walk you through creating high-quality LLM jailbreaking prompts using dialogue-based methods, and for those who just wants something working, feel free to skip this section to see how you can simplify this whole process with DeepEval.
As we previously discussed, dialogue-based jailbreaking is both scalable and effective, making it a strong choice for generating advanced prompts.
The overall process is structured in five key steps:
- Baseline Attack Generation
- Baseline Attack Enhancement (Jailbreaking Process)
- Generating Target Model Response
- Judging Target Model Response
- Using Judge Feedback to Enhance Attacks
Typically, baseline attack generation involves creating a set of simple attack prompts using an LLM or smaller language model. For the purposes of this guide, we’ll begin with a pre-prepared baseline attack.
Note: DeepEval offers a fully-featured red-teaming framework with a built-in attack generator capable of scanning for vulnerabilities across 40+ scenarios (I’ll walk you through that feature in the final section).
Define your Target Model
Before generating the attacks, we’ll first need to set up the target model we’re attempting to jailbreak. Today, our target is FinBot, a financial advisor bot designed to provide users with investment advice and financial planning assistance.
Now that our target model is setup, we can begin Jailbreaking.
Step 1. Baseline Attack Generation
We’ll start with a simple, pre-prepared baseline attack aimed at requesting access to sensitive user information FinBot is designed to protect.
Step 2: Enhancing the Baseline Attack (Jailbreaking Process)
Next, we define the enhance_attack
function, which uses an LLM (which we’ll refer to as the “attacker LLM”) to refine the baseline attack iteratively. The purpose of this function is to help us craft jailbreaking variations of the original attack that subtly rephrase or add contextual details, increasing the likelihood of bypassing model restrictions.
Each time enhance_attack
is called, it takes in:
baseline_attack
: the original or previously enhanced attack.score
: an effectiveness score (out of 10) that rates how well the baseline attack bypasses restrictions.improvement
: suggestions from the Judge LLM on how to adjust the attack to better bypass restrictions in the next iteration.api_key
: the API key needed to authenticate with the LLM.
The function then combines these elements into a structured prompt for the attacker LLM, instructing it to enhance the baseline attack and avoid detection by model filters. Here’s how it works in code:
In this first call to enhance_attack
, we set improvement to “unavailable” and score to 1 to indicate that no adjustments have been applied yet.
Congratulations! You’ve created your first jailbroken iteration of the attack . Each subsequent run refines the attack further, making it more nuanced and potentially more effective at avoiding detection.
Step 3: Generating Target Model Response
After creating an enhanced attack, we’ll send it to the target model (FinBot) and observe its response. This step helps us evaluate whether the enhanced attack bypasses any restrictions.
Step 4: Judging Target Model Response
Next, we need to assess the target model’s response to determine whether it adhered to restrictions or leaked unintended information. This is done using a judge model (another LLM) to evaluate the response and suggest improvements.
Step 5: Using Judge Feedback to Enhance Attacks
With the feedback from the judge model, we can loop back to Step 2, using the feedback to iteratively refine the attack, making it more effective at bypassing restrictions. Each cycle of this loop produces a progressively more advanced attack.
This iterative approach allows us to refine jailbreaking attacks with dialogue-based methods, leveraging LLMs to simulate both the attacker and judge perspectives.
Generating Jailbreaking Attacks Using DeepEval
The tutorial I just showed you demonstrates one type of jailbreaking attack called iterative jailbreaking. However, there are many other ways to craft jailbreaking attacks, such as tree and crescendo attacks — multi-turn approaches that increase in complexity over time.
In this final section, I’ll introduce you to DeepEval’s red-teamer — a powerful tool for vulnerability scanning and attack generation. This red-teamer automates the creation of baseline attacks targeting 40+ vulnerabilities and offers 10+ attack enhancement strategies, including iterative, tree, and crescendo jailbreaking enhancements — all within just a few lines of code.
Defining your LLM Chatbot in DeepEval
First, we’ll need to redefine FinBot in the DeepEvalBaseLLM
class format.
Jailbreaking in DeepEval
To begin jailbreaking attacks with DeepEval’s red-teamer, initialize a red-teaming scan focused exclusively on jailbreaking enhancements.
In addition to jailbreaking, DeepEval offers many more effective attack enhancement strategies. By experimenting with a wide range of attacks and vulnerabilities typical in red-teaming environments, you can craft a Red Teaming Experiment tailored to your needs. (You can learn more about the red-teamer’s features in the documentation.)
Conclusion
Today, we covered key jailbreaking techniques, including token-level, prompt-level, and dialogue-based attacks. We discussed how, although token-level approaches are scalable but not always understandable, and prompt-level jailbreaking is effective but not scalable, dialogue-based jailbreaking combines the best of both worlds — making it scalable, intuitive, and highly effective. I also walked you through a tutorial on performing dialogue-based jailbreaking in its simplest forms.
Additionally, we highlighted how DeepEval’s red-teamer enables efficient scanning for over 40 vulnerabilities using a range of attack enhancements, from iterative to tree and crescendo jailbreaking. This robust toolkit provides a straightforward path to red-team your LLM at scale, revealing critical weaknesses.
Yet, as important as LLM jailbreaking is, securing an LLM for production requires more than just finding vulnerabilities. It’s also essential to understand its capabilities through rigorous testing. DeepEval offers tools for synthetic dataset creation and custom evaluations to ensure your model’s robustness and safety in real-world applications
If you find DeepEval helpful, consider giving it a star on GitHub ⭐ to stay updated on new features as we expand support for more benchmarks and testing scenarios.
Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.