SentencePiece - Language-Independent Tokenization

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

25 stars

Best use case

SentencePiece - Language-Independent Tokenization is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

Teams using SentencePiece - Language-Independent Tokenization should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/sentencepiece/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/Orchestra-Research/AI-Research-SKILLs/sentencepiece/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/sentencepiece/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How SentencePiece - Language-Independent Tokenization Compares

Feature / AgentSentencePiece - Language-Independent TokenizationStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# SentencePiece - Language-Independent Tokenization

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

## When to use SentencePiece

**Use SentencePiece when:**
- Building multilingual models (no language-specific rules)
- Working with CJK languages (Chinese, Japanese, Korean)
- Need reproducible tokenization (deterministic vocabulary)
- Want to train on raw text (no pre-tokenization needed)
- Require lightweight deployment (6MB memory, 50k sentences/sec)

**Performance**:
- **Speed**: 50,000 sentences/sec
- **Memory**: ~6MB for loaded model
- **Languages**: All (language-independent)

**Use alternatives instead**:
- **HuggingFace Tokenizers**: Faster training, more flexibility
- **tiktoken**: OpenAI models (GPT-3.5/4)
- **BERT WordPiece**: English-centric tasks

## Quick start

### Installation

```bash
# Python
pip install sentencepiece

# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install
```

### Train model

```bash
# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

# Python API
import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='m',
    vocab_size=8000,
    model_type='bpe'
)
```

**Training time**: ~1-2 minutes for 100MB corpus

### Encode and decode

```python
import sentencepiece as spm

# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')

# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces)  # ['▁This', '▁is', '▁a', '▁test']

# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids)  # [284, 47, 11, 1243]

# Decode
text = sp.decode(ids)
print(text)  # "This is a test"
```

## Language-independent design

### Whitespace as symbol (▁)

```python
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']

# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded)  # "Hello world"
```

**Key principle**: Treat text as raw Unicode, whitespace = ▁ (meta symbol)

## Tokenization algorithms

### BPE (Byte-Pair Encoding)

```python
spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)
```

**Used by**: mBART

### Unigram (default)

```python
spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)
```

**Used by**: T5, ALBERT, XLNet

## Training configuration

### Essential parameters

```python
spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # 1.0 for CJK
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)
```

### Character coverage

| Language Type | Coverage | Rationale |
|---------------|----------|-----------|
| English       | 0.9995   | Most common chars |
| CJK (Chinese) | 1.0      | All characters needed |
| Multilingual  | 0.9995   | Balance |

## Encoding options

### Subword regularization

```python
# Sample different tokenizations
for _ in range(3):
    pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
    print(pieces)

# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']
```

**Use case**: Data augmentation for robustness.

## Common patterns

### T5-style training

```python
spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)
```

### Integration with transformers

```python
from transformers import T5Tokenizer

# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
```

## Performance benchmarks

### Training speed

| Corpus | BPE (16k) | Unigram (8k) |
|--------|-----------|--------------|
| 100 MB | 1-2 min   | 3-4 min      |
| 1 GB   | 10-15 min | 30-40 min    |

### Tokenization speed

- **SentencePiece**: 50,000 sentences/sec
- **HF Tokenizers**: 200,000 sentences/sec (4× faster)

## Supported models

**T5 family**: `t5-base`, `t5-large` (32k vocab, Unigram)
**ALBERT**: `albert-base-v2` (30k vocab, Unigram)
**XLNet**: `xlnet-base-cased` (32k vocab, Unigram)
**mBART**: `facebook/mbart-large-50` (250k vocab, BPE)

## References

- **[Training Guide](references/training.md)** - Detailed options, corpus preparation
- **[Algorithms](references/algorithms.md)** - BPE vs Unigram, subword regularization

## Resources

- **GitHub**: https://github.com/google/sentencepiece ⭐ 10,000+
- **Paper**: https://arxiv.org/abs/1808.06226 (EMNLP 2018)
- **Version**: 0.2.0+

Related Skills

next-intl-add-language

25
from ComeOnOliver/skillshub

Add new language to a Next.js + next-intl application

answering-natural-language-questions-with-dbt

25
from ComeOnOliver/skillshub

Writes and executes SQL queries against the data warehouse using dbt's Semantic Layer or ad-hoc SQL to answer business questions. Use when a user asks about analytics, metrics, KPIs, or data (e.g., "What were total sales last quarter?", "Show me top customers by revenue"). NOT for validating, testing, or building dbt models during development.

Zig — Modern Systems Programming Language

25
from ComeOnOliver/skillshub

## Overview

Stagehand — AI Browser Automation in Natural Language

25
from ComeOnOliver/skillshub

You are an expert in Stagehand by BrowserBase, the AI-powered browser automation framework that lets you control web pages using natural language instructions. You help developers build web automations that act, extract data, and observe pages using plain English commands instead of brittle CSS selectors — powered by GPT-4o or Claude for visual understanding of page layouts.

Mojo — Python-Speed Systems Language for AI

25
from ComeOnOliver/skillshub

## Overview

Malloy — Semantic Data Language

25
from ComeOnOliver/skillshub

## Overview

Gleam — Type-Safe Language for the BEAM

25
from ComeOnOliver/skillshub

## Overview

Elixir — Functional Language for Scalable Applications

25
from ComeOnOliver/skillshub

You are an expert in Elixir, the functional programming language built on the Erlang VM (BEAM). You help developers build highly concurrent, fault-tolerant, and distributed systems using Elixir's process model, pattern matching, GenServer, supervision trees, Phoenix web framework, LiveView for real-time UI, and Ecto for database interactions — achieving massive concurrency with lightweight processes and "let it crash" reliability.

LLaVA - Large Language and Vision Assistant

25
from ComeOnOliver/skillshub

Open-source vision-language model for conversational image understanding.

HuggingFace Tokenizers - Fast Tokenization for NLP

25
from ComeOnOliver/skillshub

Fast, production-ready tokenizers with Rust performance and Python ease-of-use.

CLIP - Contrastive Language-Image Pre-Training

25
from ComeOnOliver/skillshub

OpenAI's model that understands images from natural language.

BLIP-2: Vision-Language Pre-training

25
from ComeOnOliver/skillshub

Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.