finetune-data-curator

Web app for creating, editing, and validating JSONL fine-tuning datasets. Checks format compliance for OpenAI, Anthropic, and Llama formats, detects duplicates, scores quality, and exports clean datasets.

7 stars

byheldernoid

View on GitHub Installation ↓

Best use case

finetune-data-curator is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using finetune-data-curator should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/finetune-data-curator/SKILL.md --create-dirs "https://raw.githubusercontent.com/heldernoid/agentic-build-templates/main/projects/ai-llm-tools/finetune-data-curator/skills/finetune-data-curator/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/finetune-data-curator/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How finetune-data-curator Compares

Feature / Agent	finetune-data-curator	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# finetune-data-curator Skill

## When to Use

Use this skill when you need to:
- Create or curate JSONL datasets for LLM fine-tuning
- Validate dataset format compliance for OpenAI, Anthropic, or Llama training
- Find and remove near-duplicate samples from a dataset
- Score dataset quality and identify issues
- Split a dataset into train/eval subsets
- Convert datasets between OpenAI, Anthropic, and Llama formats
- Export cleaned datasets for training

## Quick Start

```bash
cd finetune-data-curator
cp .env.example .env
# edit .env: set ADMIN_KEY=<secure random string> and DATA_DIR=./data

pnpm install
pnpm dev   # server + SPA on :4400
```

Open http://localhost:4400.

## Docker Quick Start

```bash
ADMIN_KEY=mysecretkey docker compose up
# App: http://localhost:4400
```

## Creating a Dataset

Via the web UI: click "New Dataset", choose a format (OpenAI, Anthropic, or Llama), name it.

Via API:
```bash
curl -X POST http://localhost:4400/api/datasets \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-dataset", "format": "openai"}'
```

## Importing JSONL

Via the web UI: open a dataset, click Import, paste JSONL or upload a file.

Via API:
```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/samples \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '[{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}]}]'
```

## Running Validation

Via the web UI: open a dataset, click "Run Validation".

Via API:
```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/validate \
  -H "X-Admin-Key: $ADMIN_KEY"
```

Validation checks: format compliance, missing fields, empty content, excessive length, degenerate samples.

## Finding Duplicates

```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/dedup \
  -H "X-Admin-Key: $ADMIN_KEY"
```

Returns pairs with Jaccard similarity above the configured threshold (default 0.8).

## Splitting

```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/split \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{"ratio": 0.8, "seed": 42}'
```

Creates `ds_abc.train.jsonl` (80%) and `ds_abc.eval.jsonl` (20%).

## Exporting

```bash
# Download as OpenAI format
curl "http://localhost:4400/api/datasets/ds_abc/export?format=openai" \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -o my-dataset.jsonl

# Convert to Anthropic format on export
curl "http://localhost:4400/api/datasets/ds_abc/export?format=anthropic" \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -o my-dataset-anthropic.jsonl

# Export train split only
curl "http://localhost:4400/api/datasets/ds_abc/export?format=openai&split=train" \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -o train.jsonl
```

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `PORT` | `4400` | Server listen port |
| `DATA_DIR` | `./data` | Directory for JSONL files and index.json |
| `ADMIN_KEY` | (required) | Header value required for write operations |
| `MAX_UPLOAD_MB` | `50` | Maximum upload file size |

Related Skills

food-database

from heldernoid/agentic-build-templates

No description provided.

database-size-monitor

from heldernoid/agentic-build-templates

Dashboard for monitoring PostgreSQL and MySQL table sizes over time, with growth tracking, threshold alerts, and snapshot comparison

sqlite-data

from heldernoid/agentic-build-templates

Query and inspect SQLite databases used by data tools. Use when you need to directly inspect stored pipeline runs, metrics, or configuration data stored in a SQLite database file. Triggers include "query the database", "inspect SQLite", "check raw data", "what is in the db", or any task requiring direct database access.

data-pipeline-monitor

from heldernoid/agentic-build-templates

Track ETL and data pipeline jobs with success/failure status, duration tracking, heartbeat monitoring, and dependency visualization. Use when you need to monitor scheduled jobs, detect failures, track pipeline health over time, or visualize ETL step dependencies. Triggers include "pipeline monitoring", "job tracking", "ETL status", "cron job health", "heartbeat monitor", "pipeline failed", or any task involving monitoring data workflows.

data-visualization

from heldernoid/agentic-build-templates

Chart types, data aggregation patterns, and recharts usage for the csv-explorer chart builder

Skill: Uptime Monitoring

from heldernoid/agentic-build-templates

## Overview

Skill: Status Page

from heldernoid/agentic-build-templates

## Overview

Skill: unit-conversion

from heldernoid/agentic-build-templates

## Overview

Skill: recipe-scaler

from heldernoid/agentic-build-templates

## Overview

reading-list

from heldernoid/agentic-build-templates

Operate the reading-list API to save, manage, tag, search, and export articles.

email-digest

from heldernoid/agentic-build-templates

Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.

websocket-realtime

from heldernoid/agentic-build-templates

Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".