πŸ€– AI Explained
Emerging area 5 min read

LoRA, QLoRA & PEFT

LoRA lets you adapt a large model by training only a tiny fraction of its parameters β€” keeping the base weights frozen and adding small trainable matrices on top. This module covers the mechanics, the quantised variant QLoRA, and what production adapter serving actually looks like.

Layer 1: Surface

Standard fine-tuning updates every weight in the model. For a 7 billion parameter model, that means storing and computing gradients for 7 billion numbers β€” which requires substantial GPU memory and time. LoRA (Low-Rank Adaptation) sidesteps this by keeping all original weights frozen and injecting small trainable matrices alongside specific layers.

The key insight: the changes needed to adapt a model to a new task have low intrinsic dimensionality. You don’t need to move 7 billion weights; you need to move a much smaller set of parameters in a targeted way.

What LoRA adds to a layer:

Original weight matrix W (frozen):  7680 Γ— 7680 = ~59M parameters
LoRA matrices A and B (trainable):  7680 Γ— 8 + 8 Γ— 7680 = ~123K parameters

During training:  output = WΒ·x + BΒ·AΒ·x
After merging:    output = (W + BΒ·A)Β·x   ← same as fine-tuning, zero overhead

The number 8 in the example is the rank β€” the key hyperparameter. Lower rank = fewer trainable parameters = faster training but less capacity to adapt. Typical ranks: 4, 8, 16, 32.

MethodWhat trainsVRAM neededQuality vs. full FT
Full fine-tuneAll weightsVery high (2–4Γ— model size)Baseline
LoRALow-rank adapter matrices only~1.2Γ— model sizeClose (5–10% gap)
QLoRALow-rank adapters on 4-bit base~0.5Γ— model sizeSlightly below LoRA

QLoRA combines 4-bit quantisation of the base model (to reduce VRAM) with LoRA adapters trained in higher precision. This is what makes fine-tuning a 13B parameter model on a consumer GPU feasible.

Production Gotcha: Running multiple LoRA adapters on a single base model is an emerging production pattern that most serving infrastructure does not support out of the box. Plan for adapter versioning and hot-swap capability from day one, or you will re-architect under pressure later.


Layer 2: Guided

Training a LoRA adapter with PEFT

The Hugging Face PEFT library is the standard interface for LoRA/QLoRA training. The following trains a LoRA adapter on a causal language model for a classification task.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset

# Step 1: Load a quantised base model (QLoRA β€” 4-bit base + LoRA adapters)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
tokenizer.pad_token = tokenizer.eos_token

# Step 2: Configure the LoRA adapter
lora_config = LoraConfig(
    r=16,                      # rank β€” the core hyperparameter
    lora_alpha=32,             # scaling factor (usually 2x rank)
    target_modules=[           # which layers to apply LoRA to
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Step 3: Wrap the model β€” only adapter parameters will train
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 20,971,520 || all params: 7,262,330,880 || trainable%: 0.29
# Less than 0.3% of parameters are being trained

# Step 4: Train with standard Transformers Trainer
from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq

training_args = TrainingArguments(
    output_dir="./adapter-checkpoint",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # effective batch size = 16
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

# Step 5: Save the adapter β€” only ~40MB for a 7B model
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")

Two serving strategies: merged vs. adapter-on-base

After training, you have a choice.

Option A β€” Merge the adapter into the base model weights:

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load adapter and merge
peft_model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
merged_model = peft_model.merge_and_unload()

# Save the merged model β€” now a standard model, no LoRA overhead
merged_model.save_pretrained("./merged-model")

After merging, the model is indistinguishable from a standard fine-tuned model. No runtime overhead. But you’ve lost the ability to swap adapters at serving time.

Option B β€” Serve the base model and load adapters dynamically:

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3", ...)

# Load different adapters for different tasks
support_model = PeftModel.from_pretrained(base_model, "./adapter-support")
legal_model = PeftModel.from_pretrained(base_model, "./adapter-legal")
code_model = PeftModel.from_pretrained(base_model, "./adapter-code")

This shares the base model VRAM across all adapters. A 7B base model uses ~14GB VRAM; each LoRA adapter adds ~80MB. You can serve dozens of task-specific adapters on a single GPU. This is the multi-adapter pattern β€” powerful but requires infrastructure that handles adapter routing.

Adapter versioning

Treat adapters like software artefacts: version, test, and roll back independently of the base model.

./adapters/
  support/
    v1/      ← 2026-01-15, training run: run-001
    v2/      ← 2026-02-10, training run: run-007  (current)
  legal/
    v1/      ← 2026-03-01, training run: run-012

When the base model is updated, adapters trained on the old base are not directly compatible β€” you must re-adapt. This is the base model update lock-in risk: if you’re on a provider’s managed API, adapter compatibility across model versions is your problem to manage.


Layer 3: Deep Dive

The linear algebra of low-rank adaptation

A weight matrix in a transformer attention layer is typically large: for a 7B model, q_proj might be W ∈ R^(4096Γ—4096). Fine-tuning updates every element of W. LoRA instead parameterises the update as:

Ξ”W = B Β· A

where:
  A ∈ R^(rΓ—d)    β€” the "down-projection" matrix, initialized from N(0, σ²)
  B ∈ R^(dΓ—r)    β€” the "up-projection" matrix, initialized to zero
  r << d          β€” rank is much smaller than the weight dimension

B is initialised to zero so that at the start of training, Ξ”W = 0 β€” the adapter contributes nothing and training begins from the original weights. A is initialised randomly to break symmetry.

The adapted forward pass is:

h = WΒ·x + (BΒ·A)Β·x Β· (Ξ±/r)

The scaling factor Ξ±/r (where Ξ± is lora_alpha in the config) controls the magnitude of the adapter’s contribution relative to the frozen weights. Setting Ξ± = 2r is a common default that keeps adapter contributions at a reasonable scale.

Why does this work? The hypothesis β€” supported empirically by Hu et al. (2021) β€” is that the adaptation task has low intrinsic dimensionality: the model only needs to move through a small subspace of parameter space to acquire the new behaviour. A rank-8 update captures the dominant directions of that subspace. The remaining directions matter less for task adaptation.

Why not just use fewer layers? Adapting fewer layers (or different layers) is an alternative. Empirically, attention projection matrices (q, k, v, o projections) respond better to LoRA adaptation than MLP layers for most instruction-following tasks. This is an area of active research.

QLoRA: quantisation-aware LoRA training

QLoRA (Dettmers et al., 2023) adds three key innovations on top of LoRA:

  1. 4-bit NormalFloat (NF4) quantisation β€” a data type specifically designed for normally distributed weights. Standard int4 quantisation introduces error at the extremes of the weight distribution; NF4 minimises quantisation error for weights following a normal distribution (which pre-trained transformers typically do).

  2. Double quantisation β€” quantise the quantisation constants themselves, recovering ~0.37 bits per parameter in memory savings.

  3. Paged optimisers β€” use unified memory to move optimiser state (the largest memory consumer during training) between GPU and CPU as needed, preventing OOM during training spikes.

The practical result: a 65B parameter model that would normally require ~130GB VRAM for fine-tuning can be fine-tuned on a single 48GB GPU with QLoRA.

Named failure modes

Rank too low. If the adapter rank is too small relative to the complexity of the adaptation task, the model will converge to a lower accuracy ceiling. Increasing rank helps β€” at the cost of more trainable parameters. If your training loss plateaus early and accuracy is below expectations, try doubling the rank.

Alpha misconfigured. Getting lora_alpha wrong (e.g., equal to r instead of 2r) effectively halves or doubles the learning rate on the adapter, causing instability or slow convergence. Keep alpha = 2 Γ— r as a default until you have a reason to deviate.

Adapter-base mismatch. Loading an adapter trained on model version A onto model version B silently fails: the adapter matrices have the same shape but were trained against different base weights. Add model version assertions to your adapter loading code.

Merging in the wrong precision. Merging a LoRA adapter onto a 4-bit quantised base model and then saving produces a weight-compressed merged model where the quantisation error is baked in. Merge in fp16 or bf16 using the full-precision base, not the quantised one.

Multi-adapter serving without infrastructure support. Standard serving stacks (vLLM, TGI as of early 2026) have limited support for dynamic adapter switching under concurrent load. If you plan to serve dozens of adapters with low latency, validate your serving stack against your concurrency requirements before committing to the pattern.

Further reading

✏ Suggest an edit on GitHub

LoRA, QLoRA & PEFT β€” Check your understanding

Q1

You're fine-tuning a 13B parameter model but your GPU only has 24GB VRAM. Full fine-tuning requires roughly 2Γ— the model size in VRAM. Which approach makes this feasible?

Q2

Your team has trained three LoRA adapters β€” one for support, one for legal, and one for code generation β€” all on the same 7B base model. The plan is to serve all three from a single GPU. What is the key advantage of this architecture?

Q3

You update your base model from version 1 to version 2 and load your existing LoRA adapter onto it. The model's outputs are degraded and inconsistent. What is the most likely cause?

Q4

You train a LoRA adapter with rank 8 and lora_alpha 8. Training converges but evaluation accuracy plateaus 8 percentage points below your target. A colleague suggests the issue is the alpha configuration. Why?

Q5

You merge a LoRA adapter onto a 4-bit quantised base model and save the result. Later you notice the merged model produces more errors than the adapter-on-base configuration during validation. What went wrong?