You've heard the hype: fine-tune an open-source model on your own data and get GPT-4-level performance at a fraction of the cost. Sometimes that's true. But if you try to fine-tune without understanding the fundamentals, you'll burn cloud credits and get mediocre results. This guide gives you the full picture — from the math that makes LoRA possible to the practical cost of running a 7B fine-tune on a real cloud GPU.

What Is Fine-Tuning?

Before we compare methods, we need to be precise about what fine-tuning actually is — and what it isn't. People use the term loosely to mean anything from swapping a system prompt to re-training billions of parameters.

Pre-Training vs Fine-Tuning

Think of it like education. Pre-training is the university degree: the model reads hundreds of billions of tokens scraped from the internet, books, and code, learning general language patterns, factual knowledge, and reasoning abilities. This costs millions of dollars and months of compute — OpenAI, Meta, and Mistral do this so you don't have to.

Fine-tuning is on-the-job training: you take a freshly graduated model and teach it the specific skills, tone, and knowledge your business needs. Instead of reading all of Wikipedia, it reads your 2,000 customer-support transcripts. Instead of answering everything generically, it learns to reply in your company voice, follow your escalation rules, and cite your product documentation.

Supervised Fine-Tuning (SFT)

The most common variant is Supervised Fine-Tuning (SFT), where you provide labelled input-output pairs and minimize the cross-entropy loss between the model's predictions and your ground-truth outputs. Formally, for a sequence of tokens y₁…yₙ given context x, we minimize:

Math
ℒ(θ) = −∑ᵢ log P_θ(yᵢ | y₁…yᵢ₋₁, x) # In plain English: maximize the probability the model assigns # to each correct next token, given everything before it.

Fine-Tuning vs Prompting vs RAG

These three are not interchangeable. Each solves a different problem:

  • Prompting — changes the model's instructions but not its weights. Fast, cheap, zero training. Works when the model already knows how to do the task.
  • RAG (Retrieval-Augmented Generation) — keeps weights frozen but plugs in external documents at inference time. Perfect for keeping facts up-to-date without retraining.
  • Fine-Tuning — actually changes the model weights. Use it when you need a new skill, a specific style, or consistently structured output that prompting can't reliably produce.
⚠️

Don't fine-tune as a first resort

Try prompt engineering and RAG first. Fine-tuning is expensive, time-consuming, and can cause catastrophic forgetting of the model's general capabilities. Reserve it for cases where the simpler methods genuinely fail.

Fine-Tuning Methods Compared

Not all fine-tuning is equal. The spectrum runs from re-training every single parameter (maximum control, maximum cost) to updating tiny adapter matrices of just a few million parameters (surprisingly powerful, affordable on a single consumer GPU). Here's the full comparison:

Method Trainable Params VRAM (7B model) Speed Quality Est. Cost* Best For
Full Fine-Tuning 100% (~7B) ~80 GB+ Slow Highest $200–800 Domain-critical tasks, max accuracy
LoRA 0.1–1% (~7M) ~16 GB Fast Very Good $10–40 Most production use cases
QLoRA 0.1–1% + 4-bit base ~6–8 GB Moderate Good $5–20 Free Colab, consumer GPUs
IA³ <0.01% (~100K) ~14 GB Very Fast Good $3–10 Many tasks / low data
Prefix Tuning <0.1% ~14 GB Fast Moderate $3–10 Text generation style tasks

* Approximate cloud GPU costs for a 7B model on 1,000 examples. A100 80GB at ~$3.50/hr.

💡

The practical winner for 90% of projects

LoRA hits the sweet spot: near-full-fine-tuning quality with 10–50× less memory and cost. Start here unless you have a specific reason to go elsewhere. If you're on a consumer GPU or free Colab, start with QLoRA instead.

LoRA Explained Simply

LoRA (Low-Rank Adaptation) is the most important fine-tuning idea of the last few years. The paper showed that the change in weights during fine-tuning has a low intrinsic rank — meaning you don't need to update billions of parameters to capture a new skill. You just need to learn a small "delta" on top of the frozen original.

The Analogy

Imagine you're an expert chef (the pre-trained model). Instead of re-learning all of cooking from scratch to add "master sushi chef" to your repertoire, you learn only the differences — the specific knife techniques, rice preparation ratios, and presentation styles that are new. That delta is LoRA.

The Math

For any weight matrix W₀ (say, a 4096×4096 attention projection), LoRA keeps W₀ frozen and learns two tiny matrices B and A such that:

Math
# Forward pass with LoRA applied: W = W₀ + B × A # Where: # W₀ → original frozen weight (4096 × 4096) — NOT updated # A → low-rank matrix (4096 × r) — learned # B → low-rank matrix (r × 4096) — learned # r → rank hyperparameter (typically 4–64) # # Total new params: 2 × 4096 × 16 = 131,072 (vs 16,777,216 original) # That's a 128× reduction for rank=16!

The rank r is your main knob. Lower rank = fewer parameters = faster training and less memory, but potentially lower quality. A rank of 16 is a solid default. Ranks of 64–128 start to approach full fine-tuning territory.

LoRA in Code: Using HuggingFace PEFT

The peft library from HuggingFace wraps any transformer model with LoRA adapters in just a few lines:

Python
from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig, get_peft_model, TaskType # 1. Load the base model model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-v0.1", torch_dtype="auto", device_map="auto", ) # 2. Define the LoRA configuration lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, # or SEQ_CLS for classification r=16, # rank — controls adapter size lora_alpha=32, # scaling factor (usually 2×r) lora_dropout=0.05, # regularization target_modules=[ # which attention layers to adapt "q_proj", "k_proj", "v_proj", "o_proj", ], bias="none", ) # 3. Wrap the model — only adapter params are trainable model = get_peft_model(model, lora_config) model.print_trainable_parameters() # Output: trainable params: 8,388,608 || all params: 7,249,268,736 # trainable %: 0.1158 ← only 0.12% of params trained! # 4. Train with standard HuggingFace Trainer or TRL's SFTTrainer # (see the Tools section below for SFTTrainer example)
ℹ️

Merging LoRA adapters for inference

After training, you can merge the LoRA matrices back into the base weights with model.merge_and_unload(). This gives you a single model file with zero inference overhead — the LoRA math is baked in permanently.

RLHF & Alignment

SFT trains a model to imitate examples. But imitation alone has a ceiling — the model can learn to generate plausible-sounding text without actually being helpful, honest, and harmless. This is where alignment techniques come in. They teach the model to prefer good responses over bad ones, as judged by humans.

The Classic RLHF Pipeline (3 Stages)

RLHF (Reinforcement Learning from Human Feedback) was the breakthrough behind ChatGPT's conversational quality. It works in three stages:

  • Stage 1 — Supervised Fine-Tuning (SFT): Fine-tune the base model on high-quality demonstrations. This gives it a starting policy that generates reasonable outputs.
  • Stage 2 — Reward Model Training: Human labelers rank multiple model outputs for the same prompt (e.g., "Response A is better than B"). A separate model — the reward model — is trained to predict these human preferences and assign a scalar score to any response.
  • Stage 3 — RL Optimization (PPO): Use the reward model as a feedback signal to further fine-tune the SFT model via Proximal Policy Optimization (PPO). The model learns to generate outputs that score high on the reward model while not drifting too far from its SFT baseline (controlled by a KL penalty).

DPO: The Simpler Alternative

RLHF works, but it's notoriously unstable and expensive — you're training two models simultaneously and managing the RL dynamics. In 2023, DPO (Direct Preference Optimization) offered a cleaner alternative.

Instead of training a separate reward model, DPO directly uses preference pairs (chosen response, rejected response) to update the model. The training objective rearranges the RLHF math to show that you can skip the reward model entirely and just fine-tune on a specially-derived loss:

Math
# DPO loss (simplified): ℒ_DPO = −log σ(β · (log π_θ(y_w|x) − log π_ref(y_w|x)) − β · (log π_θ(y_l|x) − log π_ref(y_l|x))) # Where: # y_w = the "chosen" (winning) response # y_l = the "losing" (rejected) response # π_θ = model being trained # π_ref = frozen reference model (the SFT checkpoint) # β = temperature controlling deviation from reference

Your dataset is simply pairs of responses to the same prompt, one labelled chosen and one rejected. You can create this data with human labelers or — a common shortcut — by using a stronger model (like GPT-4) to judge which of two outputs is better.

ℹ️

DPO has largely replaced full RLHF for most teams

Most open-source aligned models released in 2024–2025 (Llama 3, Qwen 2.5, Mistral Instruct) use DPO or a variant of it. Full PPO-based RLHF is primarily used by labs with dedicated RL infrastructure. For a practical project, start with DPO — it's stable, cheap, and surprisingly effective.

Data Requirements

Data is almost always the bottleneck. The architecture, method, and hardware matter far less than whether your training examples actually represent the task you want the model to learn. Here's a realistic guide to how much you need:

Method Minimum Examples Sweet Spot Diminishing Returns After Notes
Full Fine-Tuning 10,000+ 50k–200k 500k+ Needs diverse data to prevent forgetting
LoRA 500–1,000 2k–10k 50k+ Great quality-to-data ratio
QLoRA 500–1,000 2k–10k 50k+ Same as LoRA, just quantized base
DPO 1,000 pairs 5k–20k pairs 100k pairs Pairs must cover diverse preference scenarios
IA³ / Prefix 100–500 500–2k 10k+ Fewer params = fewer data needed
🌟

Quality beats quantity — every time

200 carefully hand-written, diverse, high-quality examples will outperform 5,000 scraped, noisy, or repetitive ones. Before you go hunting for more data, deduplicate, filter for quality, and manually review a 50-example random sample. Bad data trains confidently wrong models.

Data Formats

The two most common formats for instruction fine-tuning datasets are:

JSON — Alpaca Format
{ "instruction": "Summarize the following customer complaint in one sentence.", "input": "I ordered a laptop two weeks ago and it still hasn't arrived...", "output": "Customer reports a delayed laptop order placed two weeks ago." }
JSON — ShareGPT / Conversation Format
{ "conversations": [ { "from": "system", "value": "You are a helpful customer support agent for Acme Inc." }, { "from": "human", "value": "Where is my order? It's been 2 weeks." }, { "from": "gpt", "value": "I'm sorry for the delay! Let me pull up your order details..." } ] }

ShareGPT format is preferred for multi-turn conversations and is natively supported by most modern fine-tuning frameworks. Alpaca format works well for single-instruction tasks.

Tools & Platforms

The ecosystem for fine-tuning has matured rapidly. You no longer need to write training loops from scratch — the hard parts are abstracted by well-maintained libraries. Here are the tools that actually matter in 2025:

🤗
HuggingFace TRL — SFTTrainer
The reference implementation. SFTTrainer wraps the HuggingFace Trainer with dataset formatting, packing, and native PEFT integration. Works with any model on the Hub. Best for production pipelines.
Unsloth — 2× Faster, Free Colab
Custom CUDA kernels that make LoRA/QLoRA 2× faster with 40–70% less VRAM. Works on free Google Colab T4 GPUs. The go-to choice for anyone without paid cloud GPU access. Highly recommended for beginners.
🪓
Axolotl — Config-Driven Training
Define your entire fine-tuning run in a YAML config file — no Python required. Supports LoRA, QLoRA, full FT, Flash Attention, multi-GPU, and dozens of dataset formats out of the box. Excellent for teams wanting reproducibility.
☁️
Modal — Serverless Cloud GPUs
Run fine-tuning jobs on A100/H100 GPUs with a simple Python decorator. No cluster management. Pay per second of actual GPU usage. Great for burst workloads — spin up 4 GPUs for 3 hours and shut them off automatically.
🔗
Together AI Fine-Tuning API
Upload a JSONL dataset and fine-tune Llama 3, Mistral, or Qwen via a REST API. No infrastructure setup. Competitive pricing (~$3/M tokens). The easiest path if you want a managed fine-tune without touching GPUs yourself.
🧠
OpenAI Fine-Tuning API
Fine-tune GPT-4o-mini with your data via the OpenAI API. Easiest DX, proprietary model. ~$8/M training tokens. Best if you're already on the OpenAI stack and need a specialized ChatGPT-style model without managing open-source infra.

Quick Start: TRL SFTTrainer

Here's a complete, runnable fine-tuning script using TRL's SFTTrainer with LoRA:

Python
from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig, TaskType from trl import SFTTrainer, SFTConfig import torch # ─── 1. QLoRA: load model in 4-bit precision ─── bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-Instruct-v0.3", quantization_config=bnb_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3") tokenizer.pad_token = tokenizer.eos_token # ─── 2. LoRA adapters ─── lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, ) # ─── 3. Load dataset (ShareGPT format) ─── dataset = load_dataset("json", data_files="train.jsonl", split="train") # ─── 4. Training config ─── training_args = SFTConfig( output_dir="./mistral-7b-finetuned", num_train_epochs=3, per_device_train_batch_size=2, gradient_accumulation_steps=4, # effective batch = 8 warmup_ratio=0.03, learning_rate=2e-4, bf16=True, logging_steps=10, save_strategy="epoch", max_seq_length=2048, ) # ─── 5. Train! ─── trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, peft_config=lora_config, tokenizer=tokenizer, ) trainer.train() trainer.save_model() # saves LoRA adapters only

Real Cost Estimates

Cloud GPU pricing fluctuates, but these numbers give you a realistic ballpark based on A100-80GB pricing (~$3.50/hr) and H100 pricing (~$5.00/hr). QLoRA numbers assume a single A100; full fine-tuning often requires multi-GPU with gradient checkpointing.

Model Size Method Dataset Size GPU Hours Cost (A100) Practical GPU
7B QLoRA 1,000 examples ~0.5 hr ~$2 T4, A10G, free Colab
7B QLoRA 10,000 examples ~3–4 hr ~$12 A10G, A100
7B LoRA 1,000 examples ~0.3 hr ~$1 A10G (24 GB)
7B LoRA 10,000 examples ~2–3 hr ~$9 A100-40GB
13B QLoRA 5,000 examples ~3–5 hr ~$15 A100-80GB
13B LoRA 10,000 examples ~5–8 hr ~$25 A100-80GB
70B QLoRA 5,000 examples ~12–18 hr ~$55 2× A100-80GB
70B Full FT 10,000 examples ~80–120 hr ~$350+ 8× A100-80GB
💰

Cut costs by 3–5× with spot instances

AWS, GCP, and Azure all offer interruptible/spot GPU instances at 60–80% discount vs on-demand. Use them with checkpoint saving every epoch so you can resume if the instance is preempted. Modal and RunPod also offer competitive spot pricing for single-job fine-tunes.

Deciding When to Fine-Tune vs Use an API

Fine-tuning pays off when your API bill would exceed your training cost within a few months. A rough formula:

Math / Business Logic
# Should you fine-tune? Monthly_API_cost = (daily_requests × avg_tokens_per_request × 30) × api_price_per_1M_tokens / 1_000_000 # If Monthly_API_cost > Fine_tune_cost / 6 months → fine-tuning likely pays off # Also consider: latency, data privacy, offline requirements, control # Example: # 10,000 requests/day × 1,500 tokens × 30 days = 450M tokens/month # GPT-4o-mini at $0.60/1M = $270/month # LoRA fine-tune of Mistral-7B on 5k examples ≈ $20 one-time # Self-hosting on A10G ≈ $200/month # → API wins at low volume; self-hosted fine-tune wins at high volume