home / skills / doanchienthangdev / omgkit / finetuning

This skill helps you determine when to finetune foundation models and apply LoRA, QLoRA, and PEFT techniques to improve domain performance and cost efficiency.

npx playbooks add skill doanchienthangdev/omgkit --skill finetuning

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.0 KB
---
name: finetuning
description: Finetuning Foundation Models - when to finetune, LoRA, QLoRA, PEFT techniques, memory optimization, model merging. Use when adapting models to specific domains, reducing costs, or improving performance.
---

# Finetuning

Adapting Foundation Models for specific tasks.

## When to Finetune

### DO Finetune
- Improve quality on specific domain
- Reduce latency (smaller model)
- Reduce cost (fewer tokens)
- Ensure consistent style
- Add specialized capabilities

### DON'T Finetune
- Prompt engineering is enough
- Insufficient data (<1000 examples)
- Need frequent updates
- RAG can solve the problem

## Memory Requirements

```python
def training_memory_gb(num_params_billion, precision="fp16"):
    bytes_per = {"fp32": 4, "fp16": 2, "int8": 1}

    model = num_params_billion * 1e9 * bytes_per[precision]
    optimizer = num_params_billion * 1e9 * 4 * 2  # AdamW states
    gradients = num_params_billion * 1e9 * bytes_per[precision]

    return (model + optimizer + gradients) / 1e9

# 7B model full finetuning: ~112 GB!
# With LoRA: ~16 GB
# With QLoRA: ~6 GB
```

## LoRA (Low-Rank Adaptation)

```python
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,                          # Rank (lower = fewer params)
    lora_alpha=32,                # Scaling factor
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, config)

# ~0.06% of 7B trainable!
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
```

## QLoRA (4-bit + LoRA)

```python
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

model = get_peft_model(model, lora_config)
# 7B on 16GB GPU!
```

## Training

```python
from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_steps=100,
    fp16=True,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_data,
    eval_dataset=eval_data
)

trainer.train()

# Merge LoRA back
merged = model.merge_and_unload()
merged.save_pretrained("./finetuned")
```

## Model Merging

### Task Arithmetic
```python
def task_vector_merge(base, finetuned_models, scale=0.3):
    merged = base.state_dict()
    for ft in finetuned_models:
        for key in merged:
            task_vector = ft.state_dict()[key] - merged[key]
            merged[key] += scale * task_vector
    return merged
```

## Best Practices

1. Start with small rank (r=8)
2. Use QLoRA for limited GPU
3. Monitor validation loss
4. Test merged models carefully
5. Keep base model for comparison

Overview

This skill explains practical finetuning techniques for foundation models, including LoRA, QLoRA, PEFT, memory optimization, and model merging. It helps you decide when to finetune versus using prompts or retrieval, and gives hands-on guidance to reduce GPU requirements and costs. The content is focused on actionable configuration and workflows for adapting models to specific domains.

How this skill works

The skill describes how parameter-efficient finetuning (PEFT) methods like LoRA inject low-rank adapters so only a tiny fraction of parameters are trained. It shows how QLoRA combines 4-bit quantization with LoRA to run finetuning on limited GPUs and how to configure training arguments for stable runs. It also covers merging techniques (including task-vector arithmetic) to combine or persist finetuned weights into a base model.

When to use it

  • Improve quality on a narrow domain where base model outputs are inconsistent
  • Reduce latency or inference cost by finetuning a smaller model
  • When consistent style, specialized behaviors, or new capabilities are required
  • If prompt engineering and RAG are insufficient
  • When you have sufficient training data (typically thousands of examples)

Best practices

  • Start with small LoRA rank (r=4–8) and increase only if needed
  • Prefer QLoRA for constrained GPU memory (4-bit + LoRA) to fit larger models on 16 GB-class GPUs
  • Monitor validation loss and hold out a reliable eval set to detect overfitting
  • Use gradient checkpointing, fp16, and optimizer choices (paged AdamW / 8-bit optimizers) to reduce memory
  • Keep the original base model and test merges carefully before deployment

Example use cases

  • Adapt a 7B model to legal or medical terminology for improved accuracy and consistent tone
  • Train a customer-support assistant to follow company policy and canned responses without changing the base model
  • Finetune a small model to reduce inference cost and latency for on-device or real-time use
  • Use QLoRA to finetune large models on a single 16 GB GPU for prototype experiments
  • Merge multiple task adapters to create a multi-skill model using task-vector arithmetic

FAQ

How much GPU memory do I need?

Full finetuning can require hundreds of GB for large models; LoRA typically reduces that to ~16 GB for a 7B model and QLoRA can work around ~6–16 GB depending on config.

When should I not finetune?

Avoid finetuning when prompt engineering or RAG solves the task, data is very limited (<1000 examples), or you need very frequent updates.