home / skills / eyadsibai / ltk / transformers

This skill enables rapid text generation, classification, and QA using HuggingFace transformers pipelines with automatic model loading from the hub.

npx playbooks add skill eyadsibai/ltk --skill transformers

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.1 KB
---
name: transformers
description: Use when "HuggingFace Transformers", "pre-trained models", "pipeline API", or asking about "text generation", "text classification", "question answering", "NER", "fine-tuning transformers", "AutoModel", "Trainer API"
version: 1.0.0
---

# HuggingFace Transformers

Access thousands of pre-trained models for NLP, vision, audio, and multimodal tasks.

## When to Use

- Quick inference with pipelines
- Text generation, classification, QA, NER
- Image classification, object detection
- Fine-tuning on custom datasets
- Loading pre-trained models from HuggingFace Hub

---

## Pipeline Tasks

### NLP Tasks

| Task | Pipeline Name | Output |
|------|---------------|--------|
| **Text Generation** | `text-generation` | Completed text |
| **Classification** | `text-classification` | Label + confidence |
| **Question Answering** | `question-answering` | Answer span |
| **Summarization** | `summarization` | Shorter text |
| **Translation** | `translation_en_to_fr` | Translated text |
| **NER** | `ner` | Entity spans + types |
| **Fill Mask** | `fill-mask` | Predicted tokens |

### Vision Tasks

| Task | Pipeline Name | Output |
|------|---------------|--------|
| **Image Classification** | `image-classification` | Label + confidence |
| **Object Detection** | `object-detection` | Bounding boxes |
| **Image Segmentation** | `image-segmentation` | Pixel masks |

### Audio Tasks

| Task | Pipeline Name | Output |
|------|---------------|--------|
| **Speech Recognition** | `automatic-speech-recognition` | Transcribed text |
| **Audio Classification** | `audio-classification` | Label + confidence |

---

## Model Loading Patterns

### Auto Classes

| Class | Use Case |
|-------|----------|
| **AutoModel** | Base model (embeddings) |
| **AutoModelForCausalLM** | Text generation (GPT-style) |
| **AutoModelForSeq2SeqLM** | Encoder-decoder (T5, BART) |
| **AutoModelForSequenceClassification** | Classification head |
| **AutoModelForTokenClassification** | NER, POS tagging |
| **AutoModelForQuestionAnswering** | Extractive QA |

**Key concept**: Always use Auto classes unless you need a specific architecture—they handle model detection automatically.

---

## Generation Parameters

| Parameter | Effect | Typical Values |
|-----------|--------|----------------|
| **max_new_tokens** | Output length | 50-500 |
| **temperature** | Randomness (0=deterministic) | 0.1-1.0 |
| **top_p** | Nucleus sampling threshold | 0.9-0.95 |
| **top_k** | Limit vocabulary per step | 50 |
| **num_beams** | Beam search (disable sampling) | 4-8 |
| **repetition_penalty** | Discourage repetition | 1.1-1.3 |

**Key concept**: Higher temperature = more creative but less coherent. For factual tasks, use low temperature (0.1-0.3).

---

## Memory Management

### Device Placement Options

| Option | When to Use |
|--------|-------------|
| **device_map="auto"** | Let library decide GPU allocation |
| **device_map="cuda:0"** | Specific GPU |
| **device_map="cpu"** | CPU only |

### Quantization Options

| Method | Memory Reduction | Quality Impact |
|--------|------------------|----------------|
| **8-bit** | ~50% | Minimal |
| **4-bit** | ~75% | Small for most tasks |
| **GPTQ** | ~75% | Requires calibration |
| **AWQ** | ~75% | Activation-aware |

**Key concept**: Use `torch_dtype="auto"` to automatically use the model's native precision (often bfloat16).

---

## Fine-Tuning Concepts

### Trainer Arguments

| Argument | Purpose | Typical Value |
|----------|---------|---------------|
| **num_train_epochs** | Training passes | 3-5 |
| **per_device_train_batch_size** | Samples per GPU | 8-32 |
| **learning_rate** | Step size | 2e-5 for fine-tuning |
| **weight_decay** | Regularization | 0.01 |
| **warmup_ratio** | LR warmup | 0.1 |
| **evaluation_strategy** | When to eval | "epoch" or "steps" |

### Fine-Tuning Strategies

| Strategy | Memory | Quality | Use Case |
|----------|--------|---------|----------|
| **Full fine-tuning** | High | Best | Small models, enough data |
| **LoRA** | Low | Good | Large models, limited GPU |
| **QLoRA** | Very Low | Good | 7B+ models on consumer GPU |
| **Prefix tuning** | Low | Moderate | When you can't modify weights |

---

## Tokenization Concepts

| Parameter | Purpose |
|-----------|---------|
| **padding** | Make sequences same length |
| **truncation** | Cut sequences to max_length |
| **max_length** | Maximum tokens (model-specific) |
| **return_tensors** | Output format ("pt", "tf", "np") |

**Key concept**: Always use the tokenizer that matches the model—different models use different vocabularies.

---

## Best Practices

| Practice | Why |
|----------|-----|
| Use pipelines for inference | Handles preprocessing automatically |
| Use device_map="auto" | Optimal GPU memory distribution |
| Batch inputs | Better throughput |
| Use quantization for large models | Run 7B+ on consumer GPUs |
| Match tokenizer to model | Vocabularies differ between models |
| Use Trainer for fine-tuning | Built-in best practices |

## Resources

- Docs: <https://huggingface.co/docs/transformers>
- Model Hub: <https://huggingface.co/models>
- Course: <https://huggingface.co/course>

Overview

This skill helps you leverage HuggingFace Transformers for quick inference, model loading, and fine-tuning across NLP, vision, and audio tasks. It focuses on pipelines, AutoModel patterns, generation controls, device placement, quantization, and Trainer-based fine-tuning. The guidance emphasizes practical defaults and memory-aware strategies for running models locally or in the cloud.

How this skill works

It inspects common Transformer usage patterns: pipeline selection for task-level inference, Auto classes for safe model loading, and generation/tuning parameters for controllable outputs. It outlines device_map and quantization options to reduce memory, plus Trainer and LoRA/QLoRA strategies for fine-tuning. It also summarizes tokenizer handling and batching best practices to avoid common runtime errors.

When to use it

  • Run quick inference without custom preprocessing using pipeline API
  • Generate text, classify text, answer questions, or perform NER
  • Deploy vision or audio models using image/audio pipelines
  • Fine-tune a pretrained model on a custom dataset
  • Load models from the HuggingFace Hub with automatic architecture detection
  • Optimize large models for limited-GPU environments with quantization or LoRA

Best practices

  • Prefer pipelines for production inference to handle preprocessing and postprocessing automatically
  • Use AutoModel/AutoTokenizer to let the library detect architecture and vocabularies
  • Set device_map="auto" or specific cuda devices to control GPU placement
  • Batch inputs to improve throughput and reduce overhead
  • Apply 8-bit/4-bit quantization or QLoRA for running 7B+ models on consumer GPUs
  • For generation, tune max_new_tokens, temperature, top_p/top_k, and num_beams to balance creativity and fidelity

Example use cases

  • Text generation using AutoModelForCausalLM with controlled temperature and max_new_tokens
  • Question answering with a fine-tuned AutoModelForQuestionAnswering and pipeline('question-answering')
  • Named entity recognition via pipeline('ner') and AutoModelForTokenClassification
  • Fine-tuning a seq2seq model (T5/BART) using Trainer with per_device_train_batch_size and warmup_ratio
  • Deploying an image-classification pipeline that returns labels and confidence scores

FAQ

Which Auto class should I pick for my task?

Use the AutoModelFor... class that matches your task: causal LM for generation, seq2seq LM for encoder-decoder generation, sequence classification for labels, token classification for NER, and question answering for extractive QA.

How do I reduce memory to run large models locally?

Use device_map='auto', quantization (8-bit/4-bit), GPTQ/AWQ when available, or parameter-efficient fine-tuning like LoRA/QLoRA to lower memory while maintaining quality.

What generation settings produce reliable factual outputs?

Use low temperature (0.1–0.3), moderate max_new_tokens, and disable heavy sampling (lower top_p/top_k) or use beam search (num_beams 4–8) for more deterministic results.