home / skills / doanchienthangdev / omgkit / finetuning
This skill helps you determine when to finetune foundation models and apply LoRA, QLoRA, and PEFT techniques to improve domain performance and cost efficiency.
npx playbooks add skill doanchienthangdev/omgkit --skill finetuningReview the files below or copy the command above to add this skill to your agents.
---
name: finetuning
description: Finetuning Foundation Models - when to finetune, LoRA, QLoRA, PEFT techniques, memory optimization, model merging. Use when adapting models to specific domains, reducing costs, or improving performance.
---
# Finetuning
Adapting Foundation Models for specific tasks.
## When to Finetune
### DO Finetune
- Improve quality on specific domain
- Reduce latency (smaller model)
- Reduce cost (fewer tokens)
- Ensure consistent style
- Add specialized capabilities
### DON'T Finetune
- Prompt engineering is enough
- Insufficient data (<1000 examples)
- Need frequent updates
- RAG can solve the problem
## Memory Requirements
```python
def training_memory_gb(num_params_billion, precision="fp16"):
bytes_per = {"fp32": 4, "fp16": 2, "int8": 1}
model = num_params_billion * 1e9 * bytes_per[precision]
optimizer = num_params_billion * 1e9 * 4 * 2 # AdamW states
gradients = num_params_billion * 1e9 * bytes_per[precision]
return (model + optimizer + gradients) / 1e9
# 7B model full finetuning: ~112 GB!
# With LoRA: ~16 GB
# With QLoRA: ~6 GB
```
## LoRA (Low-Rank Adaptation)
```python
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8, # Rank (lower = fewer params)
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
# ~0.06% of 7B trainable!
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
```
## QLoRA (4-bit + LoRA)
```python
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
model = get_peft_model(model, lora_config)
# 7B on 16GB GPU!
```
## Training
```python
from transformers import Trainer, TrainingArguments
args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
warmup_steps=100,
fp16=True,
gradient_checkpointing=True,
optim="paged_adamw_8bit"
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_data,
eval_dataset=eval_data
)
trainer.train()
# Merge LoRA back
merged = model.merge_and_unload()
merged.save_pretrained("./finetuned")
```
## Model Merging
### Task Arithmetic
```python
def task_vector_merge(base, finetuned_models, scale=0.3):
merged = base.state_dict()
for ft in finetuned_models:
for key in merged:
task_vector = ft.state_dict()[key] - merged[key]
merged[key] += scale * task_vector
return merged
```
## Best Practices
1. Start with small rank (r=8)
2. Use QLoRA for limited GPU
3. Monitor validation loss
4. Test merged models carefully
5. Keep base model for comparison
This skill explains practical finetuning techniques for foundation models, including LoRA, QLoRA, PEFT, memory optimization, and model merging. It helps you decide when to finetune versus using prompts or retrieval, and gives hands-on guidance to reduce GPU requirements and costs. The content is focused on actionable configuration and workflows for adapting models to specific domains.
The skill describes how parameter-efficient finetuning (PEFT) methods like LoRA inject low-rank adapters so only a tiny fraction of parameters are trained. It shows how QLoRA combines 4-bit quantization with LoRA to run finetuning on limited GPUs and how to configure training arguments for stable runs. It also covers merging techniques (including task-vector arithmetic) to combine or persist finetuned weights into a base model.
How much GPU memory do I need?
Full finetuning can require hundreds of GB for large models; LoRA typically reduces that to ~16 GB for a 7B model and QLoRA can work around ~6–16 GB depending on config.
When should I not finetune?
Avoid finetuning when prompt engineering or RAG solves the task, data is very limited (<1000 examples), or you need very frequent updates.