home / skills / davila7 / claude-code-templates / tokenization-sentencepiece

tokenization-sentencepiece skill

safe

/cli-tool/components/skills/ai-research/tokenization-sentencepiece

This skill helps you tokenize multilingual text with SentencePiece, delivering fast, deterministic subword vocabularies for CJK and multilingual models.

This is most likely a fork of the sentencepiece skill from orchestra-research

npx playbooks add skill davila7/claude-code-templates --skill tokenization-sentencepiece

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

5.5 KB

---
name: sentencepiece
description: Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Tokenization, SentencePiece, Language-Independent, BPE, Unigram, Multilingual, CJK Languages, Unicode, Deterministic, Google]
dependencies: [sentencepiece, transformers]
---

# SentencePiece - Language-Independent Tokenization

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

## When to use SentencePiece

**Use SentencePiece when:**
- Building multilingual models (no language-specific rules)
- Working with CJK languages (Chinese, Japanese, Korean)
- Need reproducible tokenization (deterministic vocabulary)
- Want to train on raw text (no pre-tokenization needed)
- Require lightweight deployment (6MB memory, 50k sentences/sec)

**Performance**:
- **Speed**: 50,000 sentences/sec
- **Memory**: ~6MB for loaded model
- **Languages**: All (language-independent)

**Use alternatives instead**:
- **HuggingFace Tokenizers**: Faster training, more flexibility
- **tiktoken**: OpenAI models (GPT-3.5/4)
- **BERT WordPiece**: English-centric tasks

## Quick start

### Installation

```bash
# Python
pip install sentencepiece

# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install
```

### Train model

```bash
# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

# Python API
import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='m',
    vocab_size=8000,
    model_type='bpe'
)
```

**Training time**: ~1-2 minutes for 100MB corpus

### Encode and decode

```python
import sentencepiece as spm

# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')

# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces)  # ['▁This', '▁is', '▁a', '▁test']

# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids)  # [284, 47, 11, 1243]

# Decode
text = sp.decode(ids)
print(text)  # "This is a test"
```

## Language-independent design

### Whitespace as symbol (▁)

```python
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']

# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded)  # "Hello world"
```

**Key principle**: Treat text as raw Unicode, whitespace = ▁ (meta symbol)

## Tokenization algorithms

### BPE (Byte-Pair Encoding)

```python
spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)
```

**Used by**: mBART

### Unigram (default)

```python
spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)
```

**Used by**: T5, ALBERT, XLNet

## Training configuration

### Essential parameters

```python
spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # 1.0 for CJK
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)
```

### Character coverage

| Language Type | Coverage | Rationale |
|---------------|----------|-----------|
| English       | 0.9995   | Most common chars |
| CJK (Chinese) | 1.0      | All characters needed |
| Multilingual  | 0.9995   | Balance |

## Encoding options

### Subword regularization

```python
# Sample different tokenizations
for _ in range(3):
    pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
    print(pieces)

# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']
```

**Use case**: Data augmentation for robustness.

## Common patterns

### T5-style training

```python
spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)
```

### Integration with transformers

```python
from transformers import T5Tokenizer

# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
```

## Performance benchmarks

### Training speed

| Corpus | BPE (16k) | Unigram (8k) |
|--------|-----------|--------------|
| 100 MB | 1-2 min   | 3-4 min      |
| 1 GB   | 10-15 min | 30-40 min    |

### Tokenization speed

- **SentencePiece**: 50,000 sentences/sec
- **HF Tokenizers**: 200,000 sentences/sec (4× faster)

## Supported models

**T5 family**: `t5-base`, `t5-large` (32k vocab, Unigram)
**ALBERT**: `albert-base-v2` (30k vocab, Unigram)
**XLNet**: `xlnet-base-cased` (32k vocab, Unigram)
**mBART**: `facebook/mbart-large-50` (250k vocab, BPE)

## References

- **[Training Guide](references/training.md)** - Detailed options, corpus preparation
- **[Algorithms](references/algorithms.md)** - BPE vs Unigram, subword regularization

## Resources

- **GitHub**: https://github.com/google/sentencepiece ⭐ 10,000+
- **Paper**: https://arxiv.org/abs/1808.06226 (EMNLP 2018)
- **Version**: 0.2.0+

Overview

This skill provides a concise guide to SentencePiece, a language-independent tokenizer that treats text as raw Unicode and uses whitespace as a meta symbol. It supports BPE and Unigram algorithms, offers deterministic vocabularies, and is optimized for multilingual and CJK use cases. The implementation is lightweight and fast, suitable for training on raw corpora without pre-tokenization.

How this skill works

SentencePiece trains a subword vocabulary from raw Unicode text using either BPE or Unigram models, producing a deterministic mapping from text to piece IDs. It represents spaces with a special underscore symbol so tokenization and decoding preserve original spacing. The library exposes command-line tools and a Python API to train models, encode/decode text as pieces or IDs, and enable features like subword regularization for sampling-based tokenization.

When to use it

Training multilingual models where language-specific preprocessing is undesirable
Handling CJK languages (Chinese, Japanese, Korean) that require full character coverage
Needing reproducible, deterministic tokenization and stable vocabularies
Training directly on raw text without pre-tokenization or whitespace splitting
Deploying a lightweight tokenizer (small memory footprint, high throughput)

Best practices

Choose Unigram for T5-style models and BPE for some mBART setups; match model type to downstream expectations
Set character_coverage to 1.0 for CJK corpora and ~0.9995 for Latin-dominant multilingual data
Include user_defined_symbols for special tokens (e.g., <pad>, <eos>, <extra_id_*>), and set unk/eos/pad IDs explicitly for model compatibility
Use multiple threads during training and tune vocab_size based on model capacity and target language mix
Enable subword regularization (sampling) for data augmentation and robustness during training

Example use cases

Train a 32k Unigram vocabulary on a multilingual corpus for a T5-style encoder-decoder model
Create a small 8k BPE tokenizer for resource-constrained on-device inference supporting CJK scripts
Preprocess raw web-scale corpora without any language-specific tokenization rules
Generate sampled tokenizations for augmentation when training robust language models
Export piece IDs for integration with frameworks like Transformers or custom model inputs

FAQ

Should I use Unigram or BPE?

Use Unigram for T5/ALBERT/XLNet-style setups and when you prefer its probabilistic training; choose BPE for models or pipelines that expect pairwise merges or for certain mBART configurations.

How do I handle CJK characters?

Set character_coverage to 1.0 to ensure all CJK characters are included and avoid pre-tokenization; SentencePiece treats spaces explicitly so CJK text is handled naturally.

Can I train on raw mixed-language corpora?

Yes. SentencePiece is language-independent and designed to train directly on raw mixed-language corpora without language-specific preprocessing.