Parameter-Efficient Fine-Tuning (PEFT): The Basics and a Quick Tutorial
Parameter-efficient fine-tuning (PEFT) modifies a subset of parameters in pre-trained neural networks, rather than updating all model parameters. Traditional fine-tuning methods can be computationally intensive, requiring significant resources and storage. PEFT aims to mitigate these challenges, focusing on adjusting a limited number of parameters to achieve similar or better performance with reduced computational overhead.
By targeting only specific layers or components of the model, PEFT can significantly reduce the training time and resources needed. This approach is especially beneficial for training large language models (LLMs)—state of the art models have billions or even trillions or parameters, and can only be trained with massive amounts of computational power. PEFT makes it possible to fine tune LLMs with a tiny fraction of the computational power required to train a full foundation model.
In this article:
- PEFT Benefits
- Fine-Tuning vs. PEFT
- Parameter-Efficient Fine-Tuning Techniques
- Tutorial: Training an Open Source LLM with LoRA
PEFT Benefits {#peft-benefits}
Parameter-efficient fine-tuning (PEFT) offers substantial benefits, particularly in terms of computational efficiency and resource savings. One of the most significant advantages is the dramatic reduction in memory and storage requirements, making PEFT suitable for a wide range of use cases, especially on consumer hardware.
PEFT methods can often reduce memory usage of LLM fine tuning to a third of the original requirements. Additionally, PEFT allows for substantial storage savings. Instead of storing full models that could be several gigabytes in size, PEFT techniques often result in much smaller checkpoints—sometimes just a few megabytes—without sacrificing performance.
PEFT’s efficiency doesn’t compromise model performance. In many cases, models fine-tuned with PEFT deliver performance that is comparable to fully fine-tuned models, or in specific cases, even superior to that of traditional fine tuning.
Quantization, another technique that reduces memory requirements by lowering data precision, can also be combined with PEFT to further optimize model training and deployment, with only a negligible impact on model performance.
Fine-Tuning vs. PEFT {#fine-tuning-vs-peft}
Fine-tuning and parameter-efficient fine-tuning (PEFT) both aim to adapt pre-trained models to specific tasks, but they differ significantly in their approaches and resource requirements.
Fine-tuning involves updating all the parameters of a pre-trained model using new data specific to the task at hand. This process requires a substantial amount of computational power and time, especially for large models. The entire model, including all layers, is retrained, making fine-tuning computationally intensive. It is well-suited for scenarios where there is abundant data and resources, often resulting in high performance on the specific task. However, the extensive modification of the model can lead to overfitting, particularly when the available data is limited.
PEFT, in contrast, focuses on adjusting only a small subset of the model’s parameters. By selectively updating the most crucial parameters, PEFT drastically reduces the computational load and training time. This method is particularly advantageous in low-resource settings or when working with very large models where traditional fine-tuning would be impractical. Although PEFT may not always match the performance of full fine-tuning, it strikes a balance by delivering good performance with significantly less resource usage. Additionally, PEFT is less prone to overfitting due to the minimal changes made to the model.
Parameter-Efficient Fine-Tuning Techniques {#parameter-efficient-fine-tuning-techniques}
1. LoRA
Low-Rank Adaptation (LoRA) is a method for fine-tuning large language models that involves adding small, trainable rank decomposition matrices into the model’s architecture. These matrices are injected into each layer of the transformer, working in parallel with the model’s feed-forward layers. By focusing on only these additional matrices and keeping the original model weights frozen, LoRA significantly reduces the number of trainable parameters.
In practical terms, LoRA can dramatically reduce the number of trainable parameters, often to as little as 0.01% of the original model. It also lowers GPU memory requirements, sometimes to a third of the original requirements, while maintaining or even improving model performance on various tasks. LoRA achieves this by adding an incremental change to the hidden representations.
2. Adapter
Adapters are small neural network modules added to pre-trained language models to modify their hidden representations during fine-tuning. These modules are inserted after the multi-head attention and feed-forward layers within the transformer’s architecture. Unlike traditional fine-tuning, which updates all model parameters, adapters focus solely on updating their own parameters, leaving the rest of the model frozen.
The adapter module comprises two feed-forward layers connected by a non-linear activation function. The first layer reduces the dimensionality of the input, which is the hidden representation from the preceding layer, before passing it through an activation function. The second layer then projects this output back to the original dimensionality. This process creates an incremental change that, when added to the original hidden representation via a skip connection, modifies the model’s output.
3. Prefix Tuning
Prefix tuning is a technique for fine-tuning large language models, particularly for text generation tasks. Instead of adjusting all model parameters, prefix tuning focuses on optimizing a small, continuous vector known as the “prefix.” This prefix is essentially a sequence of tokens that guide the model’s attention and output during the generation process.
The prefix is prepended to the input sequence, and the model then processes this extended sequence through its transformer layers. Only the prefix parameters are updated during fine-tuning, while the rest of the model remains fixed. This approach drastically reduces the number of parameters that need to be trained, which not only saves computational resources but also enhances the model’s ability to generalize from limited data.
4. P-Tuning
P-tuning enhances the performance of language models on natural language understanding (NLU) tasks by employing continuous prompt embeddings. Unlike standard prompt engineering, which involves manually crafting prompts, P-tuning automatically optimizes these embeddings during training. This technique is especially effective for models like GPTs, which traditionally struggle with certain NLU tasks.
P-tuning operates by modifying the input embeddings, adjusting them based on downstream task requirements. The optimized prompts are then used to guide the model’s predictions, improving accuracy and generalization without the need for extensive manual prompt design. This method has shown significant performance gains on benchmarks like SuperGLUE, making it a valuable approach for refining language models.
5. Infused Adapter (IA3)
Infused Adapter by Inhibiting and Amplifying Inner Activations (IA3) is a PEFT technique that optimizes the fine-tuning process by focusing on rescaling the inner activations of a pre-trained model. IA3 introduces learned vectors into the model’s attention and feed-forward layers, which are the only components updated during fine-tuning. The original model weights remain frozen, which significantly reduces the number of trainable parameters.
IA3 maintains the model’s performance while enhancing fine-tuning efficiency. Additionally, IA3 does not introduce any additional inference latency, ensuring that models fine-tuned with this technique remain practical for real-time applications.
Tutorial: Training an Open Source LLM with LoRA {#tutorial-training-an-open-source-llm-with-lora}
This tutorial was adapted from a blog by Hugging Face. We’ll show the process of fine-tuning a simple LLM, the bigscience/mt0-large
model, using the LoRA technique.
1. Setting Up the Environment
First, you need to set up your environment by importing the necessary libraries. We will be using the transformers
library by Hugging Face and the peft
library to apply the PEFT methods.
from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_model, LoraConfig, TaskType
2. Loading the Pre-Trained Model and Tokenizer
Next, load the pre-trained model and tokenizer:
model_name_or_path = "bigscience/mt0-large"
tokenizer_name_or_path = "bigscience/mt0-large"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
3. Configuring LoRA for PEFT
Now, configure the LoRA settings. This involves specifying the task type (sequence-to-sequence in this case), and other LoRA-specific parameters such as the rank (r
), lora_alpha
, and lora_dropout
.
peft_config = LoraConfig(
task_type=TaskType.SEQ_2_SEQ_LM,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1
)
4. Applying the PEFT Configuration to the Model
With the LoRA configuration ready, wrap the base model with the PEFT model using get_peft_model
. This step integrates the LoRA method into the model’s architecture.
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
The output shows that only a small fraction of the model’s parameters (about 0.19% of a total of 2.3 billion) are trainable, highlighting the efficiency of PEFT in this case.
5. Training the Model
The training loop remains the same as with traditional fine-tuning. You can proceed with your standard training process, applying optimizers and loss functions as usual. The difference here is that only the LoRA-specific parameters will be updated during training, significantly reducing the computational load.
6. Saving the Fine-Tuned Model
Once training is complete, save the fine-tuned model. Only the LoRA weights will be saved, which are much smaller in size compared to saving the entire model.
model.save_pretrained("output_dir")
7. Loading the Model for Inference
To load the fine-tuned model for inference, you will need to reload both the base model and the LoRA-adapted weights.
from transformers import AutoModelForSeq2SeqLM
from peft import PeftModel, PeftConfig
peft_model_id = "smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)
8. Running Inference
Finally, move the model to the appropriate device (e.g., GPU) and prepare it for inference. You can then input data into the model and generate outputs as needed.
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = model.to("cuda")
model.eval()
inputs = tokenizer(
"Tweet text: @HondaCustSvc Your customer service has been horrible during the recall process. I will never purchase a Honda again. Label:",
return_tensors="pt"
)
with torch.no_grad():
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])
# Expected output: 'complaint'
This concludes the tutorial on fine-tuning LLMs using PEFT with LoRA. By following these steps, you can efficiently adapt large models to specific tasks while significantly reducing resource requirements.
Building Fine Tuned LLMs with Acorn
To get started building your LLM applications, check out GPTScript: Build AI assistants that interact with your systems, Acorn’s framework that allows LLMs to operate and interact with various systems using natural language prompts.