finetune-data-curator
Web app for creating, editing, and validating JSONL fine-tuning datasets. Checks format compliance for OpenAI, Anthropic, and Llama formats, detects duplicates, scores quality, and exports clean datasets.
Best use case
finetune-data-curator is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Web app for creating, editing, and validating JSONL fine-tuning datasets. Checks format compliance for OpenAI, Anthropic, and Llama formats, detects duplicates, scores quality, and exports clean datasets.
Teams using finetune-data-curator should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/finetune-data-curator/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How finetune-data-curator Compares
| Feature / Agent | finetune-data-curator | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Web app for creating, editing, and validating JSONL fine-tuning datasets. Checks format compliance for OpenAI, Anthropic, and Llama formats, detects duplicates, scores quality, and exports clean datasets.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# finetune-data-curator Skill
## When to Use
Use this skill when you need to:
- Create or curate JSONL datasets for LLM fine-tuning
- Validate dataset format compliance for OpenAI, Anthropic, or Llama training
- Find and remove near-duplicate samples from a dataset
- Score dataset quality and identify issues
- Split a dataset into train/eval subsets
- Convert datasets between OpenAI, Anthropic, and Llama formats
- Export cleaned datasets for training
## Quick Start
```bash
cd finetune-data-curator
cp .env.example .env
# edit .env: set ADMIN_KEY=<secure random string> and DATA_DIR=./data
pnpm install
pnpm dev # server + SPA on :4400
```
Open http://localhost:4400.
## Docker Quick Start
```bash
ADMIN_KEY=mysecretkey docker compose up
# App: http://localhost:4400
```
## Creating a Dataset
Via the web UI: click "New Dataset", choose a format (OpenAI, Anthropic, or Llama), name it.
Via API:
```bash
curl -X POST http://localhost:4400/api/datasets \
-H "X-Admin-Key: $ADMIN_KEY" \
-H "Content-Type: application/json" \
-d '{"name": "my-dataset", "format": "openai"}'
```
## Importing JSONL
Via the web UI: open a dataset, click Import, paste JSONL or upload a file.
Via API:
```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/samples \
-H "X-Admin-Key: $ADMIN_KEY" \
-H "Content-Type: application/json" \
-d '[{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}]}]'
```
## Running Validation
Via the web UI: open a dataset, click "Run Validation".
Via API:
```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/validate \
-H "X-Admin-Key: $ADMIN_KEY"
```
Validation checks: format compliance, missing fields, empty content, excessive length, degenerate samples.
## Finding Duplicates
```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/dedup \
-H "X-Admin-Key: $ADMIN_KEY"
```
Returns pairs with Jaccard similarity above the configured threshold (default 0.8).
## Splitting
```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/split \
-H "X-Admin-Key: $ADMIN_KEY" \
-H "Content-Type: application/json" \
-d '{"ratio": 0.8, "seed": 42}'
```
Creates `ds_abc.train.jsonl` (80%) and `ds_abc.eval.jsonl` (20%).
## Exporting
```bash
# Download as OpenAI format
curl "http://localhost:4400/api/datasets/ds_abc/export?format=openai" \
-H "X-Admin-Key: $ADMIN_KEY" \
-o my-dataset.jsonl
# Convert to Anthropic format on export
curl "http://localhost:4400/api/datasets/ds_abc/export?format=anthropic" \
-H "X-Admin-Key: $ADMIN_KEY" \
-o my-dataset-anthropic.jsonl
# Export train split only
curl "http://localhost:4400/api/datasets/ds_abc/export?format=openai&split=train" \
-H "X-Admin-Key: $ADMIN_KEY" \
-o train.jsonl
```
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `PORT` | `4400` | Server listen port |
| `DATA_DIR` | `./data` | Directory for JSONL files and index.json |
| `ADMIN_KEY` | (required) | Header value required for write operations |
| `MAX_UPLOAD_MB` | `50` | Maximum upload file size |Related Skills
food-database
No description provided.
database-size-monitor
Dashboard for monitoring PostgreSQL and MySQL table sizes over time, with growth tracking, threshold alerts, and snapshot comparison
sqlite-data
Query and inspect SQLite databases used by data tools. Use when you need to directly inspect stored pipeline runs, metrics, or configuration data stored in a SQLite database file. Triggers include "query the database", "inspect SQLite", "check raw data", "what is in the db", or any task requiring direct database access.
data-pipeline-monitor
Track ETL and data pipeline jobs with success/failure status, duration tracking, heartbeat monitoring, and dependency visualization. Use when you need to monitor scheduled jobs, detect failures, track pipeline health over time, or visualize ETL step dependencies. Triggers include "pipeline monitoring", "job tracking", "ETL status", "cron job health", "heartbeat monitor", "pipeline failed", or any task involving monitoring data workflows.
data-visualization
Chart types, data aggregation patterns, and recharts usage for the csv-explorer chart builder
Skill: Uptime Monitoring
## Overview
Skill: Status Page
## Overview
Skill: unit-conversion
## Overview
Skill: recipe-scaler
## Overview
reading-list
Operate the reading-list API to save, manage, tag, search, and export articles.
email-digest
Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.
websocket-realtime
Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".