home / skills / eyadsibai / ltk / transformers

transformers skill

safe

This skill enables rapid text generation, classification, and QA using HuggingFace transformers pipelines with automatic model loading from the hub.

npx playbooks add skill eyadsibai/ltk --skill transformers

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.1 KB

---
name: transformers
description: Use when "HuggingFace Transformers", "pre-trained models", "pipeline API", or asking about "text generation", "text classification", "question answering", "NER", "fine-tuning transformers", "AutoModel", "Trainer API"
version: 1.0.0
---

# HuggingFace Transformers

Access thousands of pre-trained models for NLP, vision, audio, and multimodal tasks.

## When to Use

- Quick inference with pipelines
- Text generation, classification, QA, NER
- Image classification, object detection
- Fine-tuning on custom datasets
- Loading pre-trained models from HuggingFace Hub

---

## Pipeline Tasks

### NLP Tasks

| Task | Pipeline Name | Output |
|------|---------------|--------|
| **Text Generation** | `text-generation` | Completed text |
| **Classification** | `text-classification` | Label + confidence |
| **Question Answering** | `question-answering` | Answer span |
| **Summarization** | `summarization` | Shorter text |
| **Translation** | `translation_en_to_fr` | Translated text |
| **NER** | `ner` | Entity spans + types |
| **Fill Mask** | `fill-mask` | Predicted tokens |

### Vision Tasks

| Task | Pipeline Name | Output |
|------|---------------|--------|
| **Image Classification** | `image-classification` | Label + confidence |
| **Object Detection** | `object-detection` | Bounding boxes |
| **Image Segmentation** | `image-segmentation` | Pixel masks |

### Audio Tasks

| Task | Pipeline Name | Output |
|------|---------------|--------|
| **Speech Recognition** | `automatic-speech-recognition` | Transcribed text |
| **Audio Classification** | `audio-classification` | Label + confidence |

---

## Model Loading Patterns

### Auto Classes

| Class | Use Case |
|-------|----------|
| **AutoModel** | Base model (embeddings) |
| **AutoModelForCausalLM** | Text generation (GPT-style) |
| **AutoModelForSeq2SeqLM** | Encoder-decoder (T5, BART) |
| **AutoModelForSequenceClassification** | Classification head |
| **AutoModelForTokenClassification** | NER, POS tagging |
| **AutoModelForQuestionAnswering** | Extractive QA |

**Key concept**: Always use Auto classes unless you need a specific architecture—they handle model detection automatically.

---

## Generation Parameters

| Parameter | Effect | Typical Values |
|-----------|--------|----------------|
| **max_new_tokens** | Output length | 50-500 |
| **temperature** | Randomness (0=deterministic) | 0.1-1.0 |
| **top_p** | Nucleus sampling threshold | 0.9-0.95 |
| **top_k** | Limit vocabulary per step | 50 |
| **num_beams** | Beam search (disable sampling) | 4-8 |
| **repetition_penalty** | Discourage repetition | 1.1-1.3 |

**Key concept**: Higher temperature = more creative but less coherent. For factual tasks, use low temperature (0.1-0.3).

---

## Memory Management

### Device Placement Options

| Option | When to Use |
|--------|-------------|
| **device_map="auto"** | Let library decide GPU allocation |
| **device_map="cuda:0"** | Specific GPU |
| **device_map="cpu"** | CPU only |

### Quantization Options

| Method | Memory Reduction | Quality Impact |
|--------|------------------|----------------|
| **8-bit** | ~50% | Minimal |
| **4-bit** | ~75% | Small for most tasks |
| **GPTQ** | ~75% | Requires calibration |
| **AWQ** | ~75% | Activation-aware |

**Key concept**: Use `torch_dtype="auto"` to automatically use the model's native precision (often bfloat16).

---

## Fine-Tuning Concepts

### Trainer Arguments

| Argument | Purpose | Typical Value |
|----------|---------|---------------|
| **num_train_epochs** | Training passes | 3-5 |
| **per_device_train_batch_size** | Samples per GPU | 8-32 |
| **learning_rate** | Step size | 2e-5 for fine-tuning |
| **weight_decay** | Regularization | 0.01 |
| **warmup_ratio** | LR warmup | 0.1 |
| **evaluation_strategy** | When to eval | "epoch" or "steps" |

### Fine-Tuning Strategies

| Strategy | Memory | Quality | Use Case |
|----------|--------|---------|----------|
| **Full fine-tuning** | High | Best | Small models, enough data |
| **LoRA** | Low | Good | Large models, limited GPU |
| **QLoRA** | Very Low | Good | 7B+ models on consumer GPU |
| **Prefix tuning** | Low | Moderate | When you can't modify weights |

---

## Tokenization Concepts

| Parameter | Purpose |
|-----------|---------|
| **padding** | Make sequences same length |
| **truncation** | Cut sequences to max_length |
| **max_length** | Maximum tokens (model-specific) |
| **return_tensors** | Output format ("pt", "tf", "np") |

**Key concept**: Always use the tokenizer that matches the model—different models use different vocabularies.

---

## Best Practices

| Practice | Why |
|----------|-----|
| Use pipelines for inference | Handles preprocessing automatically |
| Use device_map="auto" | Optimal GPU memory distribution |
| Batch inputs | Better throughput |
| Use quantization for large models | Run 7B+ on consumer GPUs |
| Match tokenizer to model | Vocabularies differ between models |
| Use Trainer for fine-tuning | Built-in best practices |

## Resources

- Docs: <https://huggingface.co/docs/transformers>
- Model Hub: <https://huggingface.co/models>
- Course: <https://huggingface.co/course>

Overview

This skill helps you leverage HuggingFace Transformers for quick inference, model loading, and fine-tuning across NLP, vision, and audio tasks. It focuses on pipelines, AutoModel patterns, generation controls, device placement, quantization, and Trainer-based fine-tuning. The guidance emphasizes practical defaults and memory-aware strategies for running models locally or in the cloud.

How this skill works

It inspects common Transformer usage patterns: pipeline selection for task-level inference, Auto classes for safe model loading, and generation/tuning parameters for controllable outputs. It outlines device_map and quantization options to reduce memory, plus Trainer and LoRA/QLoRA strategies for fine-tuning. It also summarizes tokenizer handling and batching best practices to avoid common runtime errors.

When to use it

Run quick inference without custom preprocessing using pipeline API
Generate text, classify text, answer questions, or perform NER
Deploy vision or audio models using image/audio pipelines
Fine-tune a pretrained model on a custom dataset
Load models from the HuggingFace Hub with automatic architecture detection
Optimize large models for limited-GPU environments with quantization or LoRA

Best practices

Prefer pipelines for production inference to handle preprocessing and postprocessing automatically
Use AutoModel/AutoTokenizer to let the library detect architecture and vocabularies
Set device_map="auto" or specific cuda devices to control GPU placement
Batch inputs to improve throughput and reduce overhead
Apply 8-bit/4-bit quantization or QLoRA for running 7B+ models on consumer GPUs
For generation, tune max_new_tokens, temperature, top_p/top_k, and num_beams to balance creativity and fidelity

Example use cases

Text generation using AutoModelForCausalLM with controlled temperature and max_new_tokens
Question answering with a fine-tuned AutoModelForQuestionAnswering and pipeline('question-answering')
Named entity recognition via pipeline('ner') and AutoModelForTokenClassification
Fine-tuning a seq2seq model (T5/BART) using Trainer with per_device_train_batch_size and warmup_ratio
Deploying an image-classification pipeline that returns labels and confidence scores

FAQ

Which Auto class should I pick for my task?

Use the AutoModelFor... class that matches your task: causal LM for generation, seq2seq LM for encoder-decoder generation, sequence classification for labels, token classification for NER, and question answering for extractive QA.

How do I reduce memory to run large models locally?

Use device_map='auto', quantization (8-bit/4-bit), GPTQ/AWQ when available, or parameter-efficient fine-tuning like LoRA/QLoRA to lower memory while maintaining quality.

What generation settings produce reliable factual outputs?

Use low temperature (0.1–0.3), moderate max_new_tokens, and disable heavy sampling (lower top_p/top_k) or use beam search (num_beams 4–8) for more deterministic results.