The Ultimate Guide to Fine-Tune LLaMA 3, With LLM Evaluations

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Common Pitfalls in Fine-Tuning

Poor Training Data

The previous statement on RLHF sheds light on a very important point: The quality of the training dataset is the most important factor when it comes to fine-tuning. In fact, the LIMA paper shown that fine-tuning a 65B LLaMA (1) on 1,000 high-quality samples can outperform OpenAI’s DaVinci003.

Consider another example, which is a fine-tuned gpt-3.5-turbo on 140k slack messages:

It’s pretty hilarious, but maybe only because it’s not coming from my LLM.

Using The Wrong Prompt Template

This actually only matters if you’re using a specific models that was trained on a specific prompt template, such as LLaMA-2’s chat models. In a nutshell, Meta used the following template when training the LLaMA-2 chat models, and you’ll ideally need to have your training data in this format.

[s][INST] [[SYS]]
System prompt
[[/SYS]]

User prompt [/INST] Model answer [/s]

For these reasons, we’ll be using the mlabonne/guanaco-llama2–1k dataset for fine-tuning. It is a set of 1000 high quality instruction-response dataset (derived from the timdettmers/openassistant-guanaco dataset) in LLaMA-2's prompt template reformat.

A Step-By-Step Guide to Fine-Tune LLaMA-3

Step 1 — Installation

To begin, create a new Google Colab notebook. That’s right, we’ll be doing everything in a Colab notebook.

Then, install and import the required libraries:

!pip install transformers peft bitsandbytes trl deepeval

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from peft import LoraConfig
from trl import SFTTrainer

Here, we’re using libraries from the Hugging Face and Confident AI ecosystem:

transformers: to load models, tokenizers, etc.
peft: to perform parameter efficient fine-tuning
bitsandbytes: to setup 4-bit quantization
trl: for supervised fine-tuning
deepeval: to evaluate the fine-tuned LLM

Step 2— Quantization Setup

To optimize Colab RAM usage during LLaMA-3 8B fine-tuning, we use QLoRA (quantized low-rank approximation). Here’s a breakdown of its key principles:

4-Bit Quantization: QLoRA compresses the pre-trained LLaMA-3 8B model by representing weights with only 4 bits (as opposed to standard 32-bit floating-point). This significantly shrinks the model’s memory footprint.
Frozen Pre-trained Model: After quantization, the vast majority of LLaMA-3’s parameters are frozen. This prevents direct updates to the core model during fine-tuning.
Low-Rank Adapters: QLoRA introduces lightweight, trainable adapter layers into the model’s architecture. These adapters capture task-specific knowledge without drastically increasing the number of parameters.
Gradient-Based Fine-tuning: During the fine-tuning process, gradients flow through the frozen 4-bit quantized model but are used to update solely the parameters within the low-rank adapters. This isolated optimization greatly reduces computational overhead.

Here’s a visual representation of QLoRA from the original paper.

As an implementation, we can take advantage of the bitsandbytes library:

...

#################################
### Setup Quantization Config ###
#################################
compute_dtype = getattr(torch, "float16")
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Step 3 — Load LLaMA-3 with QLoRA Configuration

This step is pretty straightforward. We will simply load the LLaMA-3 8B model from Hugging Face.

Note that although LLaMA-3 is open-source and available on Hugging Face, you’ll have to send a request to Meta to gain access which typically takes up to a week.

...

#######################
### Load Base Model ###
#######################
base_model_name = "meta-llama/Meta-Llama-3-8B"
llama_3 = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=quant_config,
    device_map={"": 0}
)

Step 4 — Load Tokenizer

When an LLM reads text, it first has to convert the text to a readable format. This process is known a tokenization, which is carried out by a tokenizer.

Tokenizers are usually designed to work with their respective models. Copy the following code to load the tokenizer for LLaMA-3:

...

######################
### Load Tokenizer ###
######################
tokenizer = AutoTokenizer.from_pretrained(
  base_model_name, 
  trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Step 5 — Load Dataset

As explained in the previous section, we’ll be using the mlabonne/guanaco-llama2–1k dataset for fine-tuning given its high quality data labels and compatibility with LLaMA-3's prompt templates.

...

####################
### Load Dataset ###
####################
train_dataset_name = "mlabonne/guanaco-llama2-1k"
train_dataset = load_dataset(train_dataset_name, split="train")

Step 6 — Load LoRA Configurations for PEFT

I’m not going to go into the detailed differences between QLoRA and LoRA, but LoRA is essentially a less memory-efficient version of QLoRA since it does not use quantization, but may yield slightly higher accuracy. (You can read more about LoRA here.)

In this step, we configure LoRA for parameter efficient fine-tuning (PEFT), which updates a small subset of parameters in contrast to normal fine-tuning, where all model parameters are updated instead.

...

#########################################
### Load LoRA Configurations for PEFT ###
#########################################
peft_config = LoraConfig(
    lora_alpha = 16
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

Step 7 — Set Training Arguments and SFT Parameters

We’re almost there, all that’s left is to set the arguments required for training and the supervised fine-tuning (SFT) parameters for the trainer:

...

##############################
### Set Training Arguments ###
##############################
training_arguments = TrainingArguments(
    output_dir="./tuning_results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant"
)


##########################
### Set SFT Parameters ###
##########################
trainer = SFTTrainer(
    model=llama_3,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

I’m going to spare you from what all these parameters mean, but if you’re interested you can check out Hugging Face’s documentation:

Step 8 — Fine-Tune and Save Model

Run the following code to start fine-tuning:

...

#######################
### Fine-Tune Model ###
#######################
trainer.train()

You should expect the training to last up to an hour. Here’s an image of a wild LLaMA party to keep you entertained in the meantime :)

Once fine-tuning has completed, save your model and tokenizer, and you can start testing the results of your fine-tuned model immediately!

...

##################
### Save Model ###
##################
new_model = "tuned-llama-3-8b"
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

#################
### Try Model ###
#################
prompt = "What is a large language model?"
pipe = pipeline(
  task="text-generation", 
  model=llama_3, 
  tokenizer=tokenizer, 
  max_length=200
)
result = pipe(f"[s][INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrails accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.

Evaluating a Fine-Tuned LLM with DeepEval

I know what you’re thinking (or at least I hope I do). You’re expecting some plot of loss over the course of fine-tuning on something like tensorboard, but fortunately I’m not going to bore you with this “evaluation” approach. Instead, we’ll be using DeepEval, an open-source LLM evaluation framework for LLMs.

Since we fine-tuned LLaMA-3 8B to essentially make them useful as assistants, we’ll evaluate our model based on 3 metrics: bias, toxicity, and helpfulness. In DeepEval, these metrics are evaluated using LLMs using a mixture of careful prompt engineering and frameworks such as QAG and G-Eval.

For those interested, here is another great read on the rationale behind why we’re using LLMs as evaluators. To begin, set your OpenAI API key and define LLM evaluation metrics:

%env OPENAI_API_KEY=your-openai-api-key

from deepeval.metrics import GEval, BiasMetric, ToxicityMetric
from deepeval.test_case import LLMTestCaseParams

helpfulness_metric = GEval(
    name="Helpfulness",
    criteria="Helpfulness - determine if how helpful the actual output is in response with the input.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.5
)
bias_metric = BiasMetric(threshold=0.5)
toxicity_metric = ToxicityMetric(threshold=0.5)

DeepEval’s metrics returns a score (0–1) and provides a reason for its score. A metric is only considered successful if the computed score passes the threshold (which depending on the metric can either be a maximum or minimum threshold). For those wondering how these metrics are implemented, you can poke around ⭐ DeepEval’s open-source repo ⭐, or learn about everything you need to know about scoring an LLM evaluation metric.

To wrap things up, create a list of inputs you want to evaluate your model’s outputs on by creating test cases using DeepEval's synthetic data generator:

from deepeval.synthesizer import Synthesizer
from deepeval.test_case import LLMTestCase
...

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
		# Generate queries from your documents
    document_paths=['example_1.txt', 'example_2.docx', 'example_3.pdf'],
    max_goldens_per_document=2
)

pipe = pipeline(
  task="text-generation", 
  model=llama_3, 
  tokenizer=tokenizer, 
  max_length=200
)

test_cases = []
for golden in synthesizer.synthetic_goldens:
  actual_output = pipe(f"[s][INST] {input} [/INST]")[0]['generated_text']
  test_case = LLMTestCase(input=golden.input, acutal_output=actual_output)
  test_cases.append(test_case)

We hardcoded the inputs for simplicity, but you get the point. Lastly, create and evaluate your dataset using the LLM evaluation metrics you previously defined:

from deepeval.dataset import EvaluationDataset
...

evaluation_dataset = EvaluationDataset(test_cases=test_cases)
evaluation_dataset.evaluate([bias_metric, helpfulness_metric, toxicity_metric])

And you’re done! Congratulations for making to the end of this tutorial, but with this setup, you’ll be able to add additional metrics and test cases to further evaluate and iterate on your fine-tuned LLaMA-3.

PS. DeepEval also integrates with Hugging Face to allow real-time evaluations during fine-tuning.

Conclusion

In this article, we explored what LLaMA-3 is, why you should fine-tune and what it involves, and things to watch out for when fine-tuning, including using the right dataset, and formatting them to fit the prompt templates the base model was trained in.

We also saw how we can use the Hugging Face ecosystem to seamlessly carry out fine-tuning in a Google Colab notebook, using quantization techniques such as QLoRA. Lastly, we saw how the fine-tuned model can be evaluated using DeepEval ⭐. We’ve done all the hard work for you already, and offers an entire ecosystem for LLM fine-tuning evaluation.

Thank you for reading and as always, till next time.

* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)

10+ LLM guardrails to guard malicious I/O

40+ plug-and-play vulnerabilities and 10+ attacks

Guardrail accuracy and latency reporting

Publicly sharable risk assessments.

On-demand custom guards available.