home / skills / davila7 / claude-code-templates / optimization-hqq

optimization-hqq skill

safe

/cli-tool/components/skills/ai-research/optimization-hqq

This skill quantizes models without calibration data, enabling fast, deployment-ready 4/3/2-bit quantization for HuggingFace and vLLM workflows.

This is most likely a fork of the hqq skill from orchestra-research

npx playbooks add skill davila7/claude-code-templates --skill optimization-hqq

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

11.2 KB

---
name: hqq-quantization
description: Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Quantization, HQQ, Optimization, Memory Efficiency, Inference, Model Compression]
dependencies: [hqq>=0.2.0, torch>=2.0.0]
---

# HQQ - Half-Quadratic Quantization

Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.

## When to use HQQ

**Use HQQ when:**
- Quantizing models without calibration data (no dataset needed)
- Need fast quantization (minutes vs hours for GPTQ/AWQ)
- Deploying with vLLM or HuggingFace Transformers
- Fine-tuning quantized models with LoRA/PEFT
- Experimenting with extreme quantization (2-bit, 1-bit)

**Key advantages:**
- **No calibration**: Quantize any model instantly without sample data
- **Multiple backends**: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference
- **Flexible precision**: 8/4/3/2/1-bit with configurable group sizes
- **Framework integration**: Native HuggingFace and vLLM support
- **PEFT compatible**: Fine-tune quantized models with LoRA

**Use alternatives instead:**
- **AWQ**: Need calibration-based accuracy, production serving
- **GPTQ**: Maximum accuracy with calibration data available
- **bitsandbytes**: Simple 8-bit/4-bit without custom backends
- **llama.cpp/GGUF**: CPU inference, Apple Silicon deployment

## Quick start

### Installation

```bash
pip install hqq

# With specific backend
pip install hqq[torch]      # PyTorch backend
pip install hqq[torchao]    # TorchAO int4 backend
pip install hqq[bitblas]    # BitBlas backend
pip install hqq[marlin]     # Marlin backend
```

### Basic quantization

```python
from hqq.core.quantize import BaseQuantizeConfig, HQQLinear
import torch.nn as nn

# Configure quantization
config = BaseQuantizeConfig(
    nbits=4,           # 4-bit quantization
    group_size=64,     # Group size for quantization
    axis=1             # Quantize along output dimension
)

# Quantize a linear layer
linear = nn.Linear(4096, 4096)
hqq_linear = HQQLinear(linear, config)

# Use normally
output = hqq_linear(input_tensor)
```

### Quantize full model with HuggingFace

```python
from transformers import AutoModelForCausalLM, HqqConfig

# Configure HQQ
quantization_config = HqqConfig(
    nbits=4,
    group_size=64,
    axis=1
)

# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=quantization_config,
    device_map="auto"
)

# Model is quantized and ready to use
```

## Core concepts

### Quantization configuration

HQQ uses `BaseQuantizeConfig` to define quantization parameters:

```python
from hqq.core.quantize import BaseQuantizeConfig

# Standard 4-bit config
config_4bit = BaseQuantizeConfig(
    nbits=4,           # Bits per weight (1-8)
    group_size=64,     # Weights per quantization group
    axis=1             # 0=input dim, 1=output dim
)

# Aggressive 2-bit config
config_2bit = BaseQuantizeConfig(
    nbits=2,
    group_size=16,     # Smaller groups for low-bit
    axis=1
)

# Mixed precision per layer type
layer_configs = {
    "self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64),
    "self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64),
    "self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64),
    "mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32),
    "mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32),
    "mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64),
}
```

### HQQLinear layer

The core quantized layer that replaces `nn.Linear`:

```python
from hqq.core.quantize import HQQLinear
import torch

# Create quantized layer
linear = torch.nn.Linear(4096, 4096)
hqq_layer = HQQLinear(linear, config)

# Access quantized weights
W_q = hqq_layer.W_q           # Quantized weights
scale = hqq_layer.scale       # Scale factors
zero = hqq_layer.zero         # Zero points

# Dequantize for inspection
W_dequant = hqq_layer.dequantize()
```

### Backends

HQQ supports multiple inference backends for different hardware:

```python
from hqq.core.quantize import HQQLinear

# Available backends
backends = [
    "pytorch",          # Pure PyTorch (default)
    "pytorch_compile",  # torch.compile optimized
    "aten",            # Custom CUDA kernels
    "torchao_int4",    # TorchAO int4 matmul
    "gemlite",         # GemLite CUDA kernels
    "bitblas",         # BitBlas optimized
    "marlin",          # Marlin 4-bit kernels
]

# Set backend globally
HQQLinear.set_backend("torchao_int4")

# Or per layer
hqq_layer.set_backend("marlin")
```

**Backend selection guide:**
| Backend | Best For | Requirements |
|---------|----------|--------------|
| pytorch | Compatibility | Any GPU |
| pytorch_compile | Moderate speedup | torch>=2.0 |
| aten | Good balance | CUDA GPU |
| torchao_int4 | 4-bit inference | torchao installed |
| marlin | Maximum 4-bit speed | Ampere+ GPU |
| bitblas | Flexible bit-widths | bitblas installed |

## HuggingFace integration

### Load pre-quantized models

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load HQQ-quantized model from Hub
model = AutoModelForCausalLM.from_pretrained(
    "mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# Use normally
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
```

### Quantize and save

```python
from transformers import AutoModelForCausalLM, HqqConfig

# Quantize
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=config,
    device_map="auto"
)

# Save quantized model
model.save_pretrained("./llama-8b-hqq-4bit")

# Push to Hub
model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")
```

### Mixed precision quantization

```python
from transformers import AutoModelForCausalLM, HqqConfig

# Different precision per layer type
config = HqqConfig(
    nbits=4,
    group_size=64,
    # Attention layers: higher precision
    # MLP layers: lower precision for memory savings
    dynamic_config={
        "attn": {"nbits": 4, "group_size": 64},
        "mlp": {"nbits": 2, "group_size": 32}
    }
)
```

## vLLM integration

### Serve HQQ models with vLLM

```python
from vllm import LLM, SamplingParams

# Load HQQ-quantized model
llm = LLM(
    model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
    quantization="hqq",
    dtype="float16"
)

# Generate
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(["What is machine learning?"], sampling_params)
```

### vLLM with custom HQQ config

```python
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-8B",
    quantization="hqq",
    quantization_config={
        "nbits": 4,
        "group_size": 64
    }
)
```

## PEFT/LoRA fine-tuning

### Fine-tune quantized models

```python
from transformers import AutoModelForCausalLM, HqqConfig
from peft import LoraConfig, get_peft_model

# Load quantized model
quant_config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=quant_config,
    device_map="auto"
)

# Apply LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Train normally with Trainer or custom loop
```

### QLoRA-style training

```python
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./hqq-lora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator
)

trainer.train()
```

## Quantization workflows

### Workflow 1: Quick model compression

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

# 1. Configure quantization
config = HqqConfig(nbits=4, group_size=64)

# 2. Load and quantize (no calibration needed!)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# 3. Verify quality
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

# 4. Save
model.save_pretrained("./llama-8b-hqq")
tokenizer.save_pretrained("./llama-8b-hqq")
```

### Workflow 2: Optimize for inference speed

```python
from hqq.core.quantize import HQQLinear
from transformers import AutoModelForCausalLM, HqqConfig

# 1. Quantize with optimal backend
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=config,
    device_map="auto"
)

# 2. Set fast backend
HQQLinear.set_backend("marlin")  # or "torchao_int4"

# 3. Compile for additional speedup
import torch
model = torch.compile(model)

# 4. Benchmark
import time
inputs = tokenizer("Hello", return_tensors="pt").to(model.device)
start = time.time()
for _ in range(10):
    model.generate(**inputs, max_new_tokens=100)
print(f"Avg time: {(time.time() - start) / 10:.2f}s")
```

## Best practices

1. **Start with 4-bit**: Best quality/size tradeoff for most models
2. **Use group_size=64**: Good balance; smaller for extreme quantization
3. **Choose backend wisely**: Marlin for 4-bit Ampere+, TorchAO for flexibility
4. **Verify quality**: Always test generation quality after quantization
5. **Mixed precision**: Keep attention at higher precision, compress MLP more
6. **PEFT training**: Use LoRA r=16-32 for good fine-tuning results

## Common issues

**Out of memory during quantization:**
```python
# Quantize layer-by-layer
from hqq.models.hf.base import AutoHQQHFModel

model = AutoHQQHFModel.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=config,
    device_map="sequential"  # Load layers sequentially
)
```

**Slow inference:**
```python
# Switch to optimized backend
from hqq.core.quantize import HQQLinear
HQQLinear.set_backend("marlin")  # Requires Ampere+ GPU

# Or compile
model = torch.compile(model, mode="reduce-overhead")
```

**Poor quality at 2-bit:**
```python
# Use smaller group size
config = BaseQuantizeConfig(
    nbits=2,
    group_size=16,  # Smaller groups help at low bits
    axis=1
)
```

## References

- **[Advanced Usage](references/advanced-usage.md)** - Custom backends, mixed precision, optimization
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks

## Resources

- **Repository**: https://github.com/mobiusml/hqq
- **Paper**: Half-Quadratic Quantization
- **HuggingFace Models**: https://huggingface.co/mobiuslabsgmbh
- **Version**: 0.2.0+
- **License**: Apache 2.0

Overview

This skill provides Half-Quadratic Quantization (HQQ) tooling to quantize large language models to 8/4/3/2/1-bit precision without any calibration data. It is a CLI and Python library designed for fast, calibration-free compression and integrates with HuggingFace Transformers and vLLM for deployment. Use it to reduce model size and speed up inference while preserving the ability to fine-tune with PEFT/LoRA.

How this skill works

HQQ replaces linear layers with HQQLinear that stores quantized weights, scale factors, and zero points and performs dequantized matmuls via multiple optimized backends. Quantization is configured with BaseQuantizeConfig or HqqConfig (nbits, group_size, axis) and can be applied per-layer or as mixed precision. Backends like pytorch, torchao_int4, marlin, and bitblas provide hardware-tuned kernels for fast inference.

When to use it

You need to quantize a model but don’t have calibration datasets.
You want fast quantization workflows (minutes vs hours for calibration-based methods).
Deploying quantized models with HuggingFace Transformers or vLLM.
Fine-tuning compressed models with LoRA/PEFT or QLoRA-style training.
Experimenting with extreme low-bit regimes (2-bit, 1-bit) and mixed precision strategies.

Best practices

Start with 4-bit for a good quality/size tradeoff before trying lower bits.
Use group_size=64 as a sensible default; reduce group size for aggressive (2-bit) configs.
Keep attention layers at higher precision and compress MLP layers more (mixed precision).
Select the backend that matches your hardware (marlin on Ampere+ for max 4-bit speed, torchao_int4 for int4 workloads).
Verify generation quality after quantization and run benchmarks to choose backend and compilation settings.

Example use cases

Quickly compress a 7–13B model to 4-bit and save it for edge or multi-GPU deployment with HuggingFace.
Serve an HQQ-quantized Llama model in vLLM for low-latency inference with reduced memory footprint.
Apply LoRA on top of an HQQ-quantized base model to fine-tune for a downstream task with minimal additional memory.
Benchmark different backends (marlin, torchao_int4, bitblas) and group sizes to optimize throughput on your GPU fleet.
Create mixed-precision quantization profiles to keep attention at 4-bit and MLPs at 2-bit for maximum compression while preserving quality.

FAQ

Do I need calibration data for HQQ?

No. HQQ is calibration-free and designed to quantize weights without sample datasets.

Can I fine-tune HQQ-quantized models?

Yes. HQQ is compatible with PEFT/LoRA and supports QLoRA-style training workflows.