Fine-Tuning Llama 2 with Hugging Face PEFT Library

Fine-Tuning Llama 2 with Hugging Face PEFT Library

What Is LLaMA 2?

LLaMA2, introduced by Meta in 2023, is a family of open-source large language model (LLM). It includes models with either 7 billion or 70 billion parameters. The number of parameters in an LLM determines the model’s ability to learn from data and generate responses.

LLaMA2 is trained on a dataset comprising 2 trillion tokens, offering a context length of 4,096 tokens—double that of its predecessor, LLaMA1. Context length is vital for understanding and generating coherent and contextually relevant responses.

LLaMA2 includes models fine-tuned for specific applications. LLaMA Chat is optimized for dialogue use cases, trained with over 1 million human annotations to improve conversational abilities. Another variant, Code LLaMA, is useful in code generation across multiple programming languages like Python, Java, and C++, trained on a 500 billion tokens of code.

**Note: **In April 2024, Meta released a more capable open source model, LLaMA 3.

This is part of a series of articles about fine tuning LLM

What Is LLaMA 2 Fine-Tuning?

LLaMA 2 fine-tuning involves adjusting the model’s parameters specifically to perform better on a given dataset or task. This process takes the pre-trained LLaMA 2 model and continues its training on a smaller, more specialized dataset. This additional training phase allows the model to adapt its responses and predictions to the specific requirements of the target task or domain.

The fine-tuning process leverages the general understanding of language that LLaMA 2 has developed during its initial training phase, covering multiple topics and text styles. By further training it on a focused dataset, fine-tuning enhances the model’s ability to grasp and generate content for a specific use case or industry.

Key Concepts in LLM Fine-Tuning

Supervised Fine-Tuning (SFT)

SFT involves refining a pre-trained language model on a specific, smaller dataset under human supervision. This technique aims to tailor the broad knowledge of the model to particular tasks or domains.

For example, to specialize LLaMA 2 for medical data analysis, it would undergo SFT using medical texts and patient records as training data. The model learns from examples that include both input (e.g., questions or statements) and their corresponding outputs (answers or continuations), enhancing its ability to generate accurate and relevant responses.

Key parameters such as learning rate, batch size, and number of training epochs are adjusted during SFT to prevent overfitting on the task-specific dataset. Overfitting could diminish the model’s performance on more general tasks by making it too focused on the specifics of the training data. Evaluation metrics like accuracy or F1 score measure the model’s effectiveness.

Reinforcement Learning from Human Feedback (RLHF)

RLHF is used to refine the responses of language models such as LLaMA 2. It relies on human evaluators to interact with the model by providing inputs and assessing the outputs. Their feedback lets the model learn which types of responses are more appropriate or accurate in given contexts.

Through this iterative process, RLHF enables the model to align its outputs more closely with human expectations, enhancing its ability to generate context-sensitive responses. It is particularly effective for tasks that require a nuanced understanding of human language and preferences, such as conversation generation or creative writing.

Prompt Template

Prompt templates guide the model to produce outputs in a specific format or style. This is useful for tasks that require outputs to follow a certain structure. A prompt template typically consists of a fixed portion that outlines the desired output’s format and a variable portion where the model inserts information based on the given input.

For example, in generating weather forecasts, a template might start with “The weather forecast for [location] on [date] is:”, which the model then completes with forecast details. The efficiency of prompt templates largely depends on their design and how effectively they can elicit the intended response from the model.

Parameter-Efficient Fine-Tuning (PEFT) with LoRA or QLoRA

PEFT allows for the customization of large language models without requiring adjustments to all model parameters. The focus is on modifying a select subset of parameters. PEFT utilizes techniques such as LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation), to adapt the model for specific tasks or domains while minimizing resource usage.

LoRA alteris only the weights of certain layers, applying low-rank matrices to adjust these weights during the model’s forward pass, which preserves the original parameter count and simplifies deployment. QLoRA extends this concept by quantizing the model’s parameters, lowering their precision but maintaining overall performance levels. This quantization reduces the model’s memory footprint and computational requirements.

Tutorial: Fine-Tuning LLaMA 2 with PEFT LoRA

Fine-tuning LLaMA 2 using the Hugging Face PEFT library with LoRA (Low-Rank Adaptation) allows you to customize the model efficiently. Here is a step-by-step guide to get you started.

  1. Ensure you have the necessary libraries installed:

    pip install transformers datasets peft
    `pip install trl`
  2. Make sure you have a Hugging Face account. Install hugging CLI using the command:

    `pip3 install -U "huggingface_hub[cli]"`

    Login into Hugging face using the login command:

    huggingface-cli login
  3. Use the Hugging Face transformers library to load the pre-trained LLaMA 2 model and tokenizer:

    ``` from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, logging, ) import torch from datasets import load_dataset from peft import LoraConfig, get_peft_model from trl import SFTTrainer model_name = "meta-llama/Llama-2-7b-hf" # Tokenizer setup tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" compute_dtype = getattr(torch, "float16") quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=False, ) model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quant_config, device_map="auto" # Automatically maps layers to available GPUs ) model.config.use_cache = False #To conserve memory model.config.pretraining_tp = 1 #Turn off tensor parallelism for simplified computation ```
  4. Load and preprocess your dataset. For this example, we'll use a sample dataset from Hugging Face:

    ``` from datasets import load_dataset dataset_name = "mlabonne/guanaco-llama2-1k" dataset = load_dataset(dataset_name, split='train') ```
  5. Configure PEFT with LoRA settings:

    ``` from peft import LoraConfig, get_peft_model lora_config = LoraConfig( lora_alpha=8, lora_dropout=0.5, r=8, bias="none", task_type="CAUSAL_LM", ) peft_model = get_peft_model(model, lora_config) ```
  6. Set up the training arguments and start the training process:

    ``` from transformers import Trainer, TrainingArguments training_args = TrainingArguments( output_dir="./results", evaluation_strategy="no", learning_rate=2e-5, per_device_train_batch_size=1, num_train_epochs=3 fp16=True, save_total_limit=2, logging_steps=1000, save_steps=250, ) trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=peft_params, dataset_text_field="text", max_seq_length=512, # Set max sequence length tokenizer=tokenizer, args=training_args, packing=False, ) # Clear CUDA cache before training torch.cuda.empty_cache() trainer.train() ```
  7. After training, save your fine-tuned model for future use:

    peft_model.save_pretrained("./fine-tuned-llama2")

Building LLM Applications with LLaMA and Acorn

To get started building your LLM applications, check out GPTScript, Acorn’s framework that allows LLMs to operate and interact with various systems using natural language prompts.