Layer 1: Surface
Standard fine-tuning updates every weight in the model. For a 7 billion parameter model, that means storing and computing gradients for 7 billion numbers β which requires substantial GPU memory and time. LoRA (Low-Rank Adaptation) sidesteps this by keeping all original weights frozen and injecting small trainable matrices alongside specific layers.
The key insight: the changes needed to adapt a model to a new task have low intrinsic dimensionality. You donβt need to move 7 billion weights; you need to move a much smaller set of parameters in a targeted way.
What LoRA adds to a layer:
Original weight matrix W (frozen): 7680 Γ 7680 = ~59M parameters
LoRA matrices A and B (trainable): 7680 Γ 8 + 8 Γ 7680 = ~123K parameters
During training: output = WΒ·x + BΒ·AΒ·x
After merging: output = (W + BΒ·A)Β·x β same as fine-tuning, zero overhead
The number 8 in the example is the rank β the key hyperparameter. Lower rank = fewer trainable parameters = faster training but less capacity to adapt. Typical ranks: 4, 8, 16, 32.
| Method | What trains | VRAM needed | Quality vs. full FT |
|---|---|---|---|
| Full fine-tune | All weights | Very high (2β4Γ model size) | Baseline |
| LoRA | Low-rank adapter matrices only | ~1.2Γ model size | Close (5β10% gap) |
| QLoRA | Low-rank adapters on 4-bit base | ~0.5Γ model size | Slightly below LoRA |
QLoRA combines 4-bit quantisation of the base model (to reduce VRAM) with LoRA adapters trained in higher precision. This is what makes fine-tuning a 13B parameter model on a consumer GPU feasible.
Production Gotcha: Running multiple LoRA adapters on a single base model is an emerging production pattern that most serving infrastructure does not support out of the box. Plan for adapter versioning and hot-swap capability from day one, or you will re-architect under pressure later.
Layer 2: Guided
Training a LoRA adapter with PEFT
The Hugging Face PEFT library is the standard interface for LoRA/QLoRA training. The following trains a LoRA adapter on a causal language model for a classification task.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset
# Step 1: Load a quantised base model (QLoRA β 4-bit base + LoRA adapters)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.3",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
tokenizer.pad_token = tokenizer.eos_token
# Step 2: Configure the LoRA adapter
lora_config = LoraConfig(
r=16, # rank β the core hyperparameter
lora_alpha=32, # scaling factor (usually 2x rank)
target_modules=[ # which layers to apply LoRA to
"q_proj",
"k_proj",
"v_proj",
"o_proj",
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
# Step 3: Wrap the model β only adapter parameters will train
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 20,971,520 || all params: 7,262,330,880 || trainable%: 0.29
# Less than 0.3% of parameters are being trained
# Step 4: Train with standard Transformers Trainer
from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq
training_args = TrainingArguments(
output_dir="./adapter-checkpoint",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch size = 16
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=100,
evaluation_strategy="steps",
eval_steps=100,
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
# Step 5: Save the adapter β only ~40MB for a 7B model
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")
Two serving strategies: merged vs. adapter-on-base
After training, you have a choice.
Option A β Merge the adapter into the base model weights:
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Load base model in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.3",
torch_dtype=torch.float16,
device_map="auto",
)
# Load adapter and merge
peft_model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
merged_model = peft_model.merge_and_unload()
# Save the merged model β now a standard model, no LoRA overhead
merged_model.save_pretrained("./merged-model")
After merging, the model is indistinguishable from a standard fine-tuned model. No runtime overhead. But youβve lost the ability to swap adapters at serving time.
Option B β Serve the base model and load adapters dynamically:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3", ...)
# Load different adapters for different tasks
support_model = PeftModel.from_pretrained(base_model, "./adapter-support")
legal_model = PeftModel.from_pretrained(base_model, "./adapter-legal")
code_model = PeftModel.from_pretrained(base_model, "./adapter-code")
This shares the base model VRAM across all adapters. A 7B base model uses ~14GB VRAM; each LoRA adapter adds ~80MB. You can serve dozens of task-specific adapters on a single GPU. This is the multi-adapter pattern β powerful but requires infrastructure that handles adapter routing.
Adapter versioning
Treat adapters like software artefacts: version, test, and roll back independently of the base model.
./adapters/
support/
v1/ β 2026-01-15, training run: run-001
v2/ β 2026-02-10, training run: run-007 (current)
legal/
v1/ β 2026-03-01, training run: run-012
When the base model is updated, adapters trained on the old base are not directly compatible β you must re-adapt. This is the base model update lock-in risk: if youβre on a providerβs managed API, adapter compatibility across model versions is your problem to manage.
Layer 3: Deep Dive
The linear algebra of low-rank adaptation
A weight matrix in a transformer attention layer is typically large: for a 7B model, q_proj might be W β R^(4096Γ4096). Fine-tuning updates every element of W. LoRA instead parameterises the update as:
ΞW = B Β· A
where:
A β R^(rΓd) β the "down-projection" matrix, initialized from N(0, ΟΒ²)
B β R^(dΓr) β the "up-projection" matrix, initialized to zero
r << d β rank is much smaller than the weight dimension
B is initialised to zero so that at the start of training, ΞW = 0 β the adapter contributes nothing and training begins from the original weights. A is initialised randomly to break symmetry.
The adapted forward pass is:
h = WΒ·x + (BΒ·A)Β·x Β· (Ξ±/r)
The scaling factor Ξ±/r (where Ξ± is lora_alpha in the config) controls the magnitude of the adapterβs contribution relative to the frozen weights. Setting Ξ± = 2r is a common default that keeps adapter contributions at a reasonable scale.
Why does this work? The hypothesis β supported empirically by Hu et al. (2021) β is that the adaptation task has low intrinsic dimensionality: the model only needs to move through a small subspace of parameter space to acquire the new behaviour. A rank-8 update captures the dominant directions of that subspace. The remaining directions matter less for task adaptation.
Why not just use fewer layers? Adapting fewer layers (or different layers) is an alternative. Empirically, attention projection matrices (q, k, v, o projections) respond better to LoRA adaptation than MLP layers for most instruction-following tasks. This is an area of active research.
QLoRA: quantisation-aware LoRA training
QLoRA (Dettmers et al., 2023) adds three key innovations on top of LoRA:
-
4-bit NormalFloat (NF4) quantisation β a data type specifically designed for normally distributed weights. Standard int4 quantisation introduces error at the extremes of the weight distribution; NF4 minimises quantisation error for weights following a normal distribution (which pre-trained transformers typically do).
-
Double quantisation β quantise the quantisation constants themselves, recovering ~0.37 bits per parameter in memory savings.
-
Paged optimisers β use unified memory to move optimiser state (the largest memory consumer during training) between GPU and CPU as needed, preventing OOM during training spikes.
The practical result: a 65B parameter model that would normally require ~130GB VRAM for fine-tuning can be fine-tuned on a single 48GB GPU with QLoRA.
Named failure modes
Rank too low. If the adapter rank is too small relative to the complexity of the adaptation task, the model will converge to a lower accuracy ceiling. Increasing rank helps β at the cost of more trainable parameters. If your training loss plateaus early and accuracy is below expectations, try doubling the rank.
Alpha misconfigured. Getting lora_alpha wrong (e.g., equal to r instead of 2r) effectively halves or doubles the learning rate on the adapter, causing instability or slow convergence. Keep alpha = 2 Γ r as a default until you have a reason to deviate.
Adapter-base mismatch. Loading an adapter trained on model version A onto model version B silently fails: the adapter matrices have the same shape but were trained against different base weights. Add model version assertions to your adapter loading code.
Merging in the wrong precision. Merging a LoRA adapter onto a 4-bit quantised base model and then saving produces a weight-compressed merged model where the quantisation error is baked in. Merge in fp16 or bf16 using the full-precision base, not the quantised one.
Multi-adapter serving without infrastructure support. Standard serving stacks (vLLM, TGI as of early 2026) have limited support for dynamic adapter switching under concurrent load. If you plan to serve dozens of adapters with low latency, validate your serving stack against your concurrency requirements before committing to the pattern.
Further reading
- LoRA: Low-Rank Adaptation of Large Language Models; Hu et al., 2021. The original paper β explains the rank hypothesis and includes ablations over rank, alpha, and which layers to adapt.
- QLoRA: Efficient Finetuning of Quantized LLMs; Dettmers et al., 2023. Introduces NF4, double quantisation, and paged optimisers β the paper that made 65B fine-tuning on consumer hardware feasible.
- PEFT: State-of-the-Art Parameter-Efficient Fine-Tuning; Hugging Face, 2023. The library documentation for the implementation used in Layer 2 β covers additional PEFT methods beyond LoRA (prefix tuning, IA3, etc.).
- LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models; Chen et al., 2023. Extension of LoRA to context-length adaptation β relevant if you need both parameter efficiency and extended context.