How to Fine-Tune Open Source LLMs for My Specific Purpose

Anson Park

10 min read

∙

Dec 20, 2023

The latest LLaMA models, known as Llama2, come with a commercial license, increasing their accessibility to a broader range of organizations. Moreover, new methods enable fine-tuning on consumer GPUs with restricted memory capacity.

This democratization of AI technology is essential for its widespread adoption. By removing entry barriers, it allows even small businesses to develop tailored models that align with their specific requirements and financial constraints.

Fine-tuning an open-source Large Language Model (LLM) like Llama2 typically involves the following steps:

Understanding Llama2: This model is a collection of second-generation open-source LLMs from Meta, designed for a wide range of natural language processing tasks.
Setting Up: Install necessary libraries such as accelerate, peft, bitsandbytes, transformers, and trl.
Model Configuration: Choose a base model from Hugging Face, such as NousResearch’s LLaMA-2-7b-chat-hf, and select a suitable dataset for fine-tuning.
Loading Dataset and Model: Load the chosen dataset and the Llama2 model using 4-bit precision for efficient training.
Quantization Configuration: Apply 4-bit quantization via QLoRA, which involves adding trainable Low-Rank Adapter layers to the model.
Training Parameters: Configure various hyperparameters like batch size, learning rate, and training epochs using TrainingArguments.
Model Fine-Tuning: Use the SFT Trainer from the TRL library for Supervised Fine-Tuning (SFT), providing the model, dataset, PEFT configuration, and tokenizer.
Saving and Evaluation: After training, save the model and tokenizer, and evaluate the model's performance.

This process requires a balance of technical knowledge in machine learning, programming skills, and understanding of the specific requirements of the task at hand.

Let's take a closer look. The fine-tuning example below comes from a tutorial on brev.dev. Greate work 👍🏻👍🏻

Data Preparation

Data is incredibly crucial! For your custom LLM to align with its intended use, you need to carefully prepare training data for fine-tuning that's tailored to the characteristics of your business, product, or service. This process is vital yet challenging. You can view fine-tuning data as being composed of pairs of inputs and outputs.

{"input": "How do you make a cup of coffee?", "output": "You brew it using ground coffee beans and hot water."}
{"input": "What's the fastest way to learn a new language?", "output": "Immersive practice and consistent study."}

Install the necessary packages

Install the necessary packages as follows. The open-source ecosystem for LLM utilization is evolving rapidly.

!pip install --upgrade pip
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U

Hugging Face Accelerate

Hugging Face's Accelerate library is a tool designed to simplify and optimize the training and usage of PyTorch models across various hardware configurations, including multi-GPU, TPU, and mixed-precision setups.

from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

Weights & Biases for tracking metrics

Weights & Biases (W&B) is an AI developer platform designed to enhance the efficiency and effectiveness of building machine learning models.

import wandb, os
wandb.login()

wandb_project = "deepnatural-finetune"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project

Load the prepared dataset

Load the prepared fine-tuning data.

from datasets import load_dataset

train_dataset = load_dataset('json', data_files='dataset.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='dataset_validation.jsonl', split='train')

Load Base Model

Load the base model of your choice. I have selected mistralai/Mixtral-8x7B-v0.1.

The Mixtral-8x7B LLM, a part of the Mistral AI series, is a significant advancement in generative language models. Here are some key details about the model:

Architecture and Performance: Mixtral-8x7B is a pretrained generative model, specifically a Sparse Mixture of Experts. It has been tested and shown to outperform the Llama2 70B model on most benchmarks.
Compatibility and Usage: The repository contains weights compatible with the vLLM serving of the model and the Hugging Face Transformers library.
Optimizations and Precision: By default, the Transformers library loads the model in full precision. However, optimizations available in the Hugging Face ecosystem can further reduce memory requirements for running the model.
Model Specifications: The Mixtral-8x7B model is a pretrained base model without any moderation mechanisms and features a substantial size of 46.7 billion parameters

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "mistralai/Mixtral-8x7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True)

Prompt Template & Tokenization

Prepare the tokenizer. After reviewing the length distribution of samples in your dataset, it's essential to establish a suitable value based on this analysis. The prompt's format can be tailored to align with the specific objectives of using the LLM.

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token

# After examining the length distribution of samples in the dataset,
# it's important to set an appropriate value accordingly.
max_length = 512

# Here is a typical prompt template.
# You can design the format of the prompt according to the purpose of
# using the LLM.
def formatting_func(example):
    text = f"### Question: {example['input']}\n ### Answer: {example['output']}"
    return text

def generate_and_tokenize_prompt(prompt):
    result = tokenizer(
        formatting_func(prompt),
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

Set Up LoRA

In language model development, fine-tuning an existing model for a specific task using specific data is standard. This might entail adding a task-specific head and updating neural network weights through backpropagation, different from training a model from scratch where weights start randomly initialized. Full fine-tuning, updating all layers, yields the best results but is resource-heavy.

Parameter-efficient methods like Low Rank Adaptation (LoRA) offer a compelling alternative, sometimes outperforming full fine-tuning by avoiding catastrophic forgetting. LoRA fine-tunes two smaller matrices approximating the pre-trained model’s larger weight matrix.

QLoRA, a more memory-efficient variant, uses quantized 4-bit weights for the pretrained model.

Both LoRA and QLoRA are part the Hugging Face PEFT library, and their combination with the TRL library streamlines the process of supervised fine-tuning.

from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "w1",
        "w2",
        "w3",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

Low-Rank Adaptation (LoRA) is an efficient technique used for fine-tuning Large Language Models (LLMs). Here's a detailed look at LoRA:

Concept and Application: LoRA is a method for fine-tuning LLMs in a more efficient manner compared to traditional fine-tuning methods. It addresses the challenge of adjusting millions or billions of parameters in LLMs, which is computationally intensive. LoRA proposes adding and adjusting a small set of new parameters, instead of altering all the parameters of the model.
Mechanism: Practically, LoRA involves introducing two smaller matrices A and B in place of adjusting the entire weight matrix W of an LLM layer. For example, for a layer with a 40,000 x 10,000 weight matrix, LoRA would use A: 40,000 x 2 and B: 2 x 10,000, resulting in a significant reduction in the number of parameters to tune. This approach simplifies the fine-tuning process by focusing only on A and B. The updated weight matrix is represented as W' = W + A * B, introducing a low-rank structure that captures the necessary changes for the new task and reduces the training parameters.
Advantages: LoRA's efficiency stems from training only a small number of parameters, making it faster and less memory-intensive than conventional fine-tuning. Its flexibility allows for selective adaptation of different model parts, such as specific layers or components, offering a modular approach to model customization.

LoRA stands out as a 'smart hack' in machine learning for updating large models efficiently by focusing on a manageable set of changes, streamlining the process through linear algebra and matrix operations.

Start Training!

Let's start the training now!

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token

if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

import transformers
from datetime import datetime

project = "deepnaturalfinetune"
base_model_name = "mixtral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=1,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=1,
        gradient_checkpointing=True,
        max_steps=300,
        learning_rate=2.5e-5,
        fp16=True, 
        optim="paged_adamw_8bit",
        logging_steps=25,
        logging_dir="./logs",
        save_strategy="steps",tep
        save_steps=25,
        evaluation_strategy="steps",
        eval_steps=25,
        do_eval=True,
        report_to="wandb",
        run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False
trainer.train()

—

In this post, we've explored the overall process and concepts of how to fine-tune open-source LLMs.

In practice, fine-tuning doesn't end with just one attempt. It involves preparing various types of datasets, setting different combinations of hyperparameters, fine-tuning numerous model versions, and comparing their performances. Ultimately, this process helps in selecting the most suitable model for production deployment. It's indeed a long and challenging journey.

At DeepNatural, we plan to offer a tool through LangNode that enables even those without programming skills to fine-tune open-source LLMs. Stay tuned for more updates!

References:

Written by Anson Park

CEO of DeepNatural. MSc in Computer Science from KAIST & TU Berlin. Specialized in Machine Learning and Natural Language Processing.

What is LLMOps?

LLMOps, or Large Language Model Operations, is a specialized field emerging at the intersection of AI and operational management. It primarily focuses on the lifecycle management of large language ...

Anson Park

∙

5 min read

∙

Dec 8, 2023

Llama2 plays a crucial role in the open-source LLM ecosystem.

Trending Open Source LLMs in 2024

Open-source AI, including LLMs, has played a crucial role in the advancement of AI technologies. Most of the popular LLMs are built on open-source architectures like Transformers, which have been foun ...

Anson Park

∙

5 min read

∙

Dec 19, 2023

Which LLM is Better - Open LLM Leaderboard

The Open LLM Leaderboard is a significant initiative on Hugging Face, aimed at tracking, ranking, and evaluating open Large Language Models (LLMs) and chatbots. This leaderboard is an essential resour ...

Anson Park

∙

8 min read

∙

Jan 7, 2023

Which LLM is better - Chabot Arena

Chatbot Arena is a benchmarking platform for Large Language Models (LLMs), utilizing a unique approach to assess their capabilities in real-world scenarios. Here are some key aspects of Chatbot Arena ...

Anson Park

∙

7 min read

∙

Jan 9, 2023