jsonl-format
JSONL format guide for LLM fine-tuning. Covers OpenAI, Anthropic, and Llama formats, format validation rules, conversion between formats, and quality checklist.
Best use case
jsonl-format is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
JSONL format guide for LLM fine-tuning. Covers OpenAI, Anthropic, and Llama formats, format validation rules, conversion between formats, and quality checklist.
Teams using jsonl-format should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/jsonl-validation/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How jsonl-format Compares
| Feature / Agent | jsonl-format | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
JSONL format guide for LLM fine-tuning. Covers OpenAI, Anthropic, and Llama formats, format validation rules, conversion between formats, and quality checklist.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# JSONL Format Guide for LLM Fine-Tuning
## What is JSONL
JSONL (JSON Lines) is a text format where each line is a valid JSON object. Fine-tuning datasets use JSONL because it is easy to stream line-by-line without loading the entire file into memory.
```
{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}]}
{"messages": [{"role": "user", "content": "How are you?"}, {"role": "assistant", "content": "I am doing well."}]}
```
Rules:
- One JSON object per line
- No trailing commas
- Lines must not span multiple lines (no pretty-printed JSON)
- Empty lines are ignored
- UTF-8 encoding
## OpenAI Format (ChatML)
Used for GPT-3.5-turbo, GPT-4, and any model that accepts the messages API.
```json
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
}
```
Rules:
- `messages` key required, must be an array
- Each message must have `role` and `content`
- Valid roles: `system`, `user`, `assistant`
- Array must end with an `assistant` message
- `system` message is optional but must appear first if present
- `content` must not be an empty string
## Anthropic Format
Used for Claude fine-tuning via the Anthropic API.
```json
{
"prompt": "Human: What is the capital of France?\n\nAssistant:",
"completion": " The capital of France is Paris."
}
```
Rules:
- `prompt` key required, must be a string
- `completion` key required, must be a string
- `prompt` should end with `\n\nAssistant:`
- `completion` conventionally starts with a space character
- Human turns use the prefix `Human:` or `\n\nHuman:`
## Llama Format (Instruction-Following)
Used for Llama, Mistral, and similar instruction-tuned models.
```json
{
"instruction": "Translate the following sentence to French.",
"input": "The weather is beautiful today.",
"output": "Le temps est magnifique aujourd'hui."
}
```
Rules:
- `instruction` key required, non-empty string
- `input` key optional; use empty string if no context needed
- `output` key required, non-empty string
- Typically formatted with Alpaca template at training time:
`### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n{output}`
## Format Conversion
### OpenAI to Anthropic
```
messages[0] (system) -> prepend to first Human turn
messages[1] (user) -> "Human: {content}\n\nAssistant:"
messages[2] (asst) -> " {content}"
```
Multi-turn:
```
"Human: {turn1}\n\nAssistant: {turn1_response}\n\nHuman: {turn2}\n\nAssistant:"
```
### OpenAI to Llama
```
messages[0] (system) -> instruction
messages[1] (user) -> input
messages[-1] (asst) -> output
```
Note: multi-turn conversations lose context when converted to Llama single-turn format.
### Anthropic to OpenAI
Split `prompt` on `\n\nHuman:` and `\n\nAssistant:` boundaries to reconstruct messages array.
## Quality Checklist
Before submitting for training:
- [ ] Every line is valid JSON (no trailing commas, no syntax errors)
- [ ] Every sample matches the target format exactly
- [ ] No empty `content` or `output` fields
- [ ] Assistant turns are non-trivial (not just "OK" or "I see")
- [ ] No near-duplicate pairs (similarity > 0.8)
- [ ] Response length is appropriate (10-2000 chars for most tasks)
- [ ] Token count per sample is within model context window
- [ ] Dataset has diverse vocabulary (not the same phrasing repeated)
- [ ] Train/eval split created with fixed random seed for reproducibility
- [ ] No PII (names, emails, phone numbers) unless the dataset purpose requires it
## Token Count Estimation
Quick estimation (not exact):
```
tokens ≈ len(text_in_chars) / 4
```
More accurate: use the model's tokenizer. For GPT models, `tiktoken` library; for Llama, `transformers` tokenizer.
Typical limits:
- GPT-3.5-turbo fine-tune: 4,096 tokens per sample
- GPT-4 fine-tune: 4,096 tokens per sample
- Claude: 200,000 context, training data varies
- Llama 3 8B: 8,192 context windowRelated Skills
Skill: Uptime Monitoring
## Overview
Skill: Status Page
## Overview
Skill: unit-conversion
## Overview
Skill: recipe-scaler
## Overview
reading-list
Operate the reading-list API to save, manage, tag, search, and export articles.
email-digest
Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.
websocket-realtime
Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".
poll-builder
Self-hosted poll creation tool with real-time results. Use when you need to create a poll, check vote counts, close a poll, export results, or get the shareable link for a poll. Triggers include "create poll", "vote", "poll results", "survey", "collect votes", "share poll", or any task involving polling or voting.
Skill: personal-finance
## Overview
Skill: csv-import
## Overview
Skill: Syntax Highlighting
## Purpose
Skill: Pastebin Core
## Purpose