home / skills / orchestra-research / ai-research-skills / sentencepiece

sentencepiece skill

This skill helps you implement language-independent tokenization with SentencePiece to support multilingual models and reproducible vocabularies.

npx playbooks add skill orchestra-research/ai-research-skills --skill sentencepiece

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

5.5 KB

---
name: sentencepiece
description: Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Tokenization, SentencePiece, Language-Independent, BPE, Unigram, Multilingual, CJK Languages, Unicode, Deterministic, Google]
dependencies: [sentencepiece, transformers]
---

# SentencePiece - Language-Independent Tokenization

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

## When to use SentencePiece

**Use SentencePiece when:**
- Building multilingual models (no language-specific rules)
- Working with CJK languages (Chinese, Japanese, Korean)
- Need reproducible tokenization (deterministic vocabulary)
- Want to train on raw text (no pre-tokenization needed)
- Require lightweight deployment (6MB memory, 50k sentences/sec)

**Performance**:
- **Speed**: 50,000 sentences/sec
- **Memory**: ~6MB for loaded model
- **Languages**: All (language-independent)

**Use alternatives instead**:
- **HuggingFace Tokenizers**: Faster training, more flexibility
- **tiktoken**: OpenAI models (GPT-3.5/4)
- **BERT WordPiece**: English-centric tasks

## Quick start

### Installation

```bash
# Python
pip install sentencepiece

# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install
```

### Train model

```bash
# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

# Python API
import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='m',
    vocab_size=8000,
    model_type='bpe'
)
```

**Training time**: ~1-2 minutes for 100MB corpus

### Encode and decode

```python
import sentencepiece as spm

# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')

# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces)  # ['▁This', '▁is', '▁a', '▁test']

# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids)  # [284, 47, 11, 1243]

# Decode
text = sp.decode(ids)
print(text)  # "This is a test"
```

## Language-independent design

### Whitespace as symbol (▁)

```python
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']

# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded)  # "Hello world"
```

**Key principle**: Treat text as raw Unicode, whitespace = ▁ (meta symbol)

## Tokenization algorithms

### BPE (Byte-Pair Encoding)

```python
spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)
```

**Used by**: mBART

### Unigram (default)

```python
spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)
```

**Used by**: T5, ALBERT, XLNet

## Training configuration

### Essential parameters

```python
spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # 1.0 for CJK
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)
```

### Character coverage

| Language Type | Coverage | Rationale |
|---------------|----------|-----------|
| English       | 0.9995   | Most common chars |
| CJK (Chinese) | 1.0      | All characters needed |
| Multilingual  | 0.9995   | Balance |

## Encoding options

### Subword regularization

```python
# Sample different tokenizations
for _ in range(3):
    pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
    print(pieces)

# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']
```

**Use case**: Data augmentation for robustness.

## Common patterns

### T5-style training

```python
spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)
```

### Integration with transformers

```python
from transformers import T5Tokenizer

# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
```

## Performance benchmarks

### Training speed

| Corpus | BPE (16k) | Unigram (8k) |
|--------|-----------|--------------|
| 100 MB | 1-2 min   | 3-4 min      |
| 1 GB   | 10-15 min | 30-40 min    |

### Tokenization speed

- **SentencePiece**: 50,000 sentences/sec
- **HF Tokenizers**: 200,000 sentences/sec (4× faster)

## Supported models

**T5 family**: `t5-base`, `t5-large` (32k vocab, Unigram)
**ALBERT**: `albert-base-v2` (30k vocab, Unigram)
**XLNet**: `xlnet-base-cased` (32k vocab, Unigram)
**mBART**: `facebook/mbart-large-50` (250k vocab, BPE)

## References

- **[Training Guide](references/training.md)** - Detailed options, corpus preparation
- **[Algorithms](references/algorithms.md)** - BPE vs Unigram, subword regularization

## Resources

- **GitHub**: https://github.com/google/sentencepiece ⭐ 10,000+
- **Paper**: https://arxiv.org/abs/1808.06226 (EMNLP 2018)
- **Version**: 0.2.0+

Overview

This skill packages SentencePiece, a language-independent tokenizer that treats text as raw Unicode and supports BPE and Unigram algorithms. It is lightweight, fast, and deterministic, making it suitable for multilingual and CJK applications and reproducible preprocessing in research workflows. The skill exposes training, encoding, decoding, and subword-regularization features for production and research use.

How this skill works

SentencePiece trains a subword vocabulary directly on raw text without language-specific preprocessing. It represents whitespace as a special symbol and produces deterministic piece IDs. You can train BPE or Unigram models, load the compact ~6MB model, encode text to pieces or IDs, decode back to text, and enable sampling for subword regularization.

When to use it

Building multilingual or cross-lingual models where language-specific rules are undesirable.
Processing CJK (Chinese, Japanese, Korean) text where character coverage must be complete.
When you need reproducible tokenization and a deterministic vocabulary across runs.
Training tokenizers directly on raw text without pre-tokenization or complex pipelines.
Deploying lightweight tokenization in resource-constrained environments (low memory, high throughput).

Best practices

Choose character_coverage=1.0 for CJK corpora and ~0.9995 for Latin-dominant multilingual corpora.
Pick Unigram for models like T5/ALBERT/XLNet and BPE when emulating mBART-style vocabularies.
Set user_defined_symbols for special tokens (e.g., <extra_id_*> or task markers) before training.
Use multiple threads (num_threads) and an appropriate vocab_size for corpus scale to reduce training time.
Enable sampling (enable_sampling, alpha) only for data augmentation during training, not at inference.

Example use cases

Train a 32k Unigram tokenizer on a massive multilingual corpus for a transformer pretraining run.
Create a compact 8k BPE model to tokenize CJK-heavy datasets for a multilingual translation system.
Integrate SentencePiece with Transformers (T5 tokenizer) to ensure exact token ID alignment between training and inference.
Use subword regularization during classifier training to improve robustness to tokenization variance.
Deploy a tiny SentencePiece model in an edge service to tokenize user inputs with minimal memory footprint.

FAQ

How fast and memory-efficient is SentencePiece?

Typical tokenization runs at ~50k sentences/sec and a loaded model is around 6MB, making it suitable for high-throughput and low-memory environments.

When should I prefer Unigram vs BPE?

Use Unigram for T5/ALBERT/XLNet-style models and when you want probabilistic tokenization options; use BPE when mirroring mBART or when BPE-specific subword behavior is required.