tokenizer-guide

Reference guide for cl100k_base tokenization and LLM cost optimization. Use when you need to understand how the GPT tokenizer works, why text splits the way it does, estimate token counts for common content types, understand context window limits, or find strategies for reducing token usage and API costs. Triggers include "how does tokenization work", "why is my token count high", "BPE", "byte-pair encoding", "context window", "reduce tokens", "prompt optimization", "cost optimization", or any question about how LLMs count text.

7 stars

byheldernoid

View on GitHub Installation ↓

Best use case

tokenizer-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using tokenizer-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/tokenizer-guide/SKILL.md --create-dirs "https://raw.githubusercontent.com/heldernoid/agentic-build-templates/main/projects/ai-llm-tools/token-counter/skills/tokenizer-guide/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/tokenizer-guide/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How tokenizer-guide Compares

Feature / Agent	tokenizer-guide	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# tokenizer-guide

Reference guide for cl100k_base (the GPT tokenizer) and strategies for reducing token usage in LLM applications.

## What is a token?

A token is the basic unit of text that LLMs process. Tokens are not words, characters, or bytes - they are subword fragments produced by Byte-Pair Encoding (BPE). On average in English text:

- 1 token = approximately 4 characters
- 1,000 tokens = approximately 750 words
- 1 page of text = approximately 500-800 tokens

## cl100k_base encoding

cl100k_base is the tokenizer used by:

- GPT-4 (all variants)
- GPT-4o
- GPT-3.5-turbo
- GPT-4-turbo
- text-embedding-ada-002

It has a vocabulary of ~100,000 tokens (hence "100k"). Spaces are typically merged with the following word to form one token (" hello" is different from "hello").

## How BPE works

1. Start with individual UTF-8 bytes as the initial vocabulary
2. Count the most frequently occurring byte pair in the training corpus
3. Merge that pair into a new single token
4. Repeat until the target vocabulary size is reached

The resulting tokenizer efficiently represents common words as single tokens while splitting rare words into multiple subword tokens.

## Tokenization rules

### Case sensitivity

Uppercase and lowercase letters can map to different tokens. "Python" (capitalized) and "python" (lowercase) are often different token IDs.

### Leading spaces

" word" (with leading space) and "word" (without) are usually different tokens. This is a quirk of how GPT tokenizers are trained on concatenated text.

### Numbers

Individual digits are single tokens. Multi-digit numbers may be split:
- "123" - likely 1 token
- "12345" - may be 2-3 tokens

### Punctuation

Most punctuation marks are single tokens. Some combinations (like "...") may be one token.

### Code and special characters

Code tokenizes differently from prose. Variable names with underscores, camelCase, and symbols often split in non-obvious ways.

## Common token counts by content type

| Content | Approximate tokens |
|---|---|
| Single common word | 1 |
| Single rare/long word | 2-4 |
| One sentence (~15 words) | 15-25 |
| One paragraph (~100 words) | 100-150 |
| System prompt (concise) | 50-150 |
| System prompt (verbose) | 300-800 |
| Short story (1,000 words) | 1,200-1,500 |
| 1,000-row CSV (compact) | 5,000-15,000 |
| Large codebase file | 2,000-10,000 |

## Context windows by model

| Model | Context window | Approximate pages |
|---|---|---|
| gpt-4 | 8,192 tokens | 6 pages |
| gpt-3.5-turbo | 16,385 tokens | 12 pages |
| gpt-4-turbo | 128,000 tokens | 96 pages |
| gpt-4o | 128,000 tokens | 96 pages |
| claude-3-5-sonnet | 200,000 tokens | 150 pages |
| claude-3-opus | 200,000 tokens | 150 pages |
| gemini-1.5-pro | 1,000,000 tokens | 750 pages |

## Reducing token usage

### System prompts

Keep system prompts under 200 tokens when possible. Every request includes the full system prompt - a 300-token reduction at 1,000 requests/day saves 300,000 tokens/day.

Remove redundant instructions: "Always be helpful and professional and polite and respond in a timely manner" can become "Be concise and professional."

### JSON payloads

Compact JSON uses significantly fewer tokens than pretty-printed:

```json
{"key":"value","count":1}
```

vs:

```json
{
  "key": "value",
  "count": 1
}
```

### RAG context

Limit retrieved document chunks to the minimum needed. Truncate each chunk to the first N characters. Use overlap only when semantic continuity is important.

### Conversation history

Summarize older messages rather than passing the full history. Use a rolling window that keeps only the last N turns.

### Whitespace

Remove trailing spaces, multiple blank lines, and repeated whitespace. This has a measurable impact on large prompts.

## Cost formula

```
input_cost  = (input_tokens  / 1_000_000) * input_price_per_million
output_cost = (output_tokens / 1_000_000) * output_price_per_million
total_cost  = input_cost + output_cost
```

Monthly projection:

```
monthly_cost = total_cost * requests_per_day * 30
```

## Token counting in code (Node.js)

```typescript
import { get_encoding } from 'js-tiktoken';

const enc = get_encoding('cl100k_base');
const tokens = enc.encode('Hello, world!');
console.log(tokens.length);  // 4
enc.free();
```

## Token counting in code (Python)

```python
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Hello, world!")
print(len(tokens))  # 4
```

Related Skills

Skill: Uptime Monitoring

from heldernoid/agentic-build-templates

## Overview

Skill: Status Page

from heldernoid/agentic-build-templates

## Overview

Skill: unit-conversion

from heldernoid/agentic-build-templates

## Overview

Skill: recipe-scaler

from heldernoid/agentic-build-templates

## Overview

reading-list

from heldernoid/agentic-build-templates

Operate the reading-list API to save, manage, tag, search, and export articles.

email-digest

from heldernoid/agentic-build-templates

Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.

websocket-realtime

from heldernoid/agentic-build-templates

Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".

poll-builder

from heldernoid/agentic-build-templates

Self-hosted poll creation tool with real-time results. Use when you need to create a poll, check vote counts, close a poll, export results, or get the shareable link for a poll. Triggers include "create poll", "vote", "poll results", "survey", "collect votes", "share poll", or any task involving polling or voting.

Skill: personal-finance

from heldernoid/agentic-build-templates

## Overview

Skill: csv-import

from heldernoid/agentic-build-templates

## Overview

Skill: Syntax Highlighting

from heldernoid/agentic-build-templates

## Purpose

Skill: Pastebin Core

from heldernoid/agentic-build-templates

## Purpose