jsonl-format

JSONL format guide for LLM fine-tuning. Covers OpenAI, Anthropic, and Llama formats, format validation rules, conversion between formats, and quality checklist.

7 stars

byheldernoid

View on GitHub Installation ↓

Best use case

jsonl-format is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

JSONL format guide for LLM fine-tuning. Covers OpenAI, Anthropic, and Llama formats, format validation rules, conversion between formats, and quality checklist.

Teams using jsonl-format should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/jsonl-validation/SKILL.md --create-dirs "https://raw.githubusercontent.com/heldernoid/agentic-build-templates/main/projects/ai-llm-tools/finetune-data-curator/skills/jsonl-validation/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/jsonl-validation/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How jsonl-format Compares

Feature / Agent	jsonl-format	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

JSONL format guide for LLM fine-tuning. Covers OpenAI, Anthropic, and Llama formats, format validation rules, conversion between formats, and quality checklist.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# JSONL Format Guide for LLM Fine-Tuning

## What is JSONL

JSONL (JSON Lines) is a text format where each line is a valid JSON object. Fine-tuning datasets use JSONL because it is easy to stream line-by-line without loading the entire file into memory.

```
{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}]}
{"messages": [{"role": "user", "content": "How are you?"}, {"role": "assistant", "content": "I am doing well."}]}
```

Rules:
- One JSON object per line
- No trailing commas
- Lines must not span multiple lines (no pretty-printed JSON)
- Empty lines are ignored
- UTF-8 encoding

## OpenAI Format (ChatML)

Used for GPT-3.5-turbo, GPT-4, and any model that accepts the messages API.

```json
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."}
  ]
}
```

Rules:
- `messages` key required, must be an array
- Each message must have `role` and `content`
- Valid roles: `system`, `user`, `assistant`
- Array must end with an `assistant` message
- `system` message is optional but must appear first if present
- `content` must not be an empty string

## Anthropic Format

Used for Claude fine-tuning via the Anthropic API.

```json
{
  "prompt": "Human: What is the capital of France?\n\nAssistant:",
  "completion": " The capital of France is Paris."
}
```

Rules:
- `prompt` key required, must be a string
- `completion` key required, must be a string
- `prompt` should end with `\n\nAssistant:`
- `completion` conventionally starts with a space character
- Human turns use the prefix `Human:` or `\n\nHuman:`

## Llama Format (Instruction-Following)

Used for Llama, Mistral, and similar instruction-tuned models.

```json
{
  "instruction": "Translate the following sentence to French.",
  "input": "The weather is beautiful today.",
  "output": "Le temps est magnifique aujourd'hui."
}
```

Rules:
- `instruction` key required, non-empty string
- `input` key optional; use empty string if no context needed
- `output` key required, non-empty string
- Typically formatted with Alpaca template at training time:
  `### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n{output}`

## Format Conversion

### OpenAI to Anthropic

```
messages[0] (system) -> prepend to first Human turn
messages[1] (user)   -> "Human: {content}\n\nAssistant:"
messages[2] (asst)   -> " {content}"
```

Multi-turn:
```
"Human: {turn1}\n\nAssistant: {turn1_response}\n\nHuman: {turn2}\n\nAssistant:"
```

### OpenAI to Llama

```
messages[0] (system) -> instruction
messages[1] (user)   -> input
messages[-1] (asst)  -> output
```

Note: multi-turn conversations lose context when converted to Llama single-turn format.

### Anthropic to OpenAI

Split `prompt` on `\n\nHuman:` and `\n\nAssistant:` boundaries to reconstruct messages array.

## Quality Checklist

Before submitting for training:

- [ ] Every line is valid JSON (no trailing commas, no syntax errors)
- [ ] Every sample matches the target format exactly
- [ ] No empty `content` or `output` fields
- [ ] Assistant turns are non-trivial (not just "OK" or "I see")
- [ ] No near-duplicate pairs (similarity > 0.8)
- [ ] Response length is appropriate (10-2000 chars for most tasks)
- [ ] Token count per sample is within model context window
- [ ] Dataset has diverse vocabulary (not the same phrasing repeated)
- [ ] Train/eval split created with fixed random seed for reproducibility
- [ ] No PII (names, emails, phone numbers) unless the dataset purpose requires it

## Token Count Estimation

Quick estimation (not exact):

```
tokens ≈ len(text_in_chars) / 4
```

More accurate: use the model's tokenizer. For GPT models, `tiktoken` library; for Llama, `transformers` tokenizer.

Typical limits:
- GPT-3.5-turbo fine-tune: 4,096 tokens per sample
- GPT-4 fine-tune: 4,096 tokens per sample
- Claude: 200,000 context, training data varies
- Llama 3 8B: 8,192 context window

Related Skills

Skill: Uptime Monitoring

from heldernoid/agentic-build-templates

## Overview

Skill: Status Page

from heldernoid/agentic-build-templates

## Overview

Skill: unit-conversion

from heldernoid/agentic-build-templates

## Overview

Skill: recipe-scaler

from heldernoid/agentic-build-templates

## Overview

reading-list

from heldernoid/agentic-build-templates

Operate the reading-list API to save, manage, tag, search, and export articles.

email-digest

from heldernoid/agentic-build-templates

Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.

websocket-realtime

from heldernoid/agentic-build-templates

Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".

poll-builder

from heldernoid/agentic-build-templates

Self-hosted poll creation tool with real-time results. Use when you need to create a poll, check vote counts, close a poll, export results, or get the shareable link for a poll. Triggers include "create poll", "vote", "poll results", "survey", "collect votes", "share poll", or any task involving polling or voting.

Skill: personal-finance

from heldernoid/agentic-build-templates

## Overview

Skill: csv-import

from heldernoid/agentic-build-templates

## Overview

Skill: Syntax Highlighting

from heldernoid/agentic-build-templates

## Purpose

Skill: Pastebin Core

from heldernoid/agentic-build-templates

## Purpose