finetune-data-curator

Web app for creating, editing, and validating JSONL fine-tuning datasets. Checks format compliance for OpenAI, Anthropic, and Llama formats, detects duplicates, scores quality, and exports clean datasets.

7 stars

Best use case

finetune-data-curator is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Web app for creating, editing, and validating JSONL fine-tuning datasets. Checks format compliance for OpenAI, Anthropic, and Llama formats, detects duplicates, scores quality, and exports clean datasets.

Teams using finetune-data-curator should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/finetune-data-curator/SKILL.md --create-dirs "https://raw.githubusercontent.com/heldernoid/agentic-build-templates/main/projects/ai-llm-tools/finetune-data-curator/skills/finetune-data-curator/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/finetune-data-curator/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How finetune-data-curator Compares

Feature / Agentfinetune-data-curatorStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Web app for creating, editing, and validating JSONL fine-tuning datasets. Checks format compliance for OpenAI, Anthropic, and Llama formats, detects duplicates, scores quality, and exports clean datasets.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# finetune-data-curator Skill

## When to Use

Use this skill when you need to:
- Create or curate JSONL datasets for LLM fine-tuning
- Validate dataset format compliance for OpenAI, Anthropic, or Llama training
- Find and remove near-duplicate samples from a dataset
- Score dataset quality and identify issues
- Split a dataset into train/eval subsets
- Convert datasets between OpenAI, Anthropic, and Llama formats
- Export cleaned datasets for training

## Quick Start

```bash
cd finetune-data-curator
cp .env.example .env
# edit .env: set ADMIN_KEY=<secure random string> and DATA_DIR=./data

pnpm install
pnpm dev   # server + SPA on :4400
```

Open http://localhost:4400.

## Docker Quick Start

```bash
ADMIN_KEY=mysecretkey docker compose up
# App: http://localhost:4400
```

## Creating a Dataset

Via the web UI: click "New Dataset", choose a format (OpenAI, Anthropic, or Llama), name it.

Via API:
```bash
curl -X POST http://localhost:4400/api/datasets \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-dataset", "format": "openai"}'
```

## Importing JSONL

Via the web UI: open a dataset, click Import, paste JSONL or upload a file.

Via API:
```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/samples \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '[{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}]}]'
```

## Running Validation

Via the web UI: open a dataset, click "Run Validation".

Via API:
```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/validate \
  -H "X-Admin-Key: $ADMIN_KEY"
```

Validation checks: format compliance, missing fields, empty content, excessive length, degenerate samples.

## Finding Duplicates

```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/dedup \
  -H "X-Admin-Key: $ADMIN_KEY"
```

Returns pairs with Jaccard similarity above the configured threshold (default 0.8).

## Splitting

```bash
curl -X POST http://localhost:4400/api/datasets/ds_abc/split \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{"ratio": 0.8, "seed": 42}'
```

Creates `ds_abc.train.jsonl` (80%) and `ds_abc.eval.jsonl` (20%).

## Exporting

```bash
# Download as OpenAI format
curl "http://localhost:4400/api/datasets/ds_abc/export?format=openai" \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -o my-dataset.jsonl

# Convert to Anthropic format on export
curl "http://localhost:4400/api/datasets/ds_abc/export?format=anthropic" \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -o my-dataset-anthropic.jsonl

# Export train split only
curl "http://localhost:4400/api/datasets/ds_abc/export?format=openai&split=train" \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -o train.jsonl
```

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `PORT` | `4400` | Server listen port |
| `DATA_DIR` | `./data` | Directory for JSONL files and index.json |
| `ADMIN_KEY` | (required) | Header value required for write operations |
| `MAX_UPLOAD_MB` | `50` | Maximum upload file size |

Related Skills

food-database

7
from heldernoid/agentic-build-templates

No description provided.

database-size-monitor

7
from heldernoid/agentic-build-templates

Dashboard for monitoring PostgreSQL and MySQL table sizes over time, with growth tracking, threshold alerts, and snapshot comparison

sqlite-data

7
from heldernoid/agentic-build-templates

Query and inspect SQLite databases used by data tools. Use when you need to directly inspect stored pipeline runs, metrics, or configuration data stored in a SQLite database file. Triggers include "query the database", "inspect SQLite", "check raw data", "what is in the db", or any task requiring direct database access.

data-pipeline-monitor

7
from heldernoid/agentic-build-templates

Track ETL and data pipeline jobs with success/failure status, duration tracking, heartbeat monitoring, and dependency visualization. Use when you need to monitor scheduled jobs, detect failures, track pipeline health over time, or visualize ETL step dependencies. Triggers include "pipeline monitoring", "job tracking", "ETL status", "cron job health", "heartbeat monitor", "pipeline failed", or any task involving monitoring data workflows.

data-visualization

7
from heldernoid/agentic-build-templates

Chart types, data aggregation patterns, and recharts usage for the csv-explorer chart builder

Skill: Uptime Monitoring

7
from heldernoid/agentic-build-templates

## Overview

Skill: Status Page

7
from heldernoid/agentic-build-templates

## Overview

Skill: unit-conversion

7
from heldernoid/agentic-build-templates

## Overview

Skill: recipe-scaler

7
from heldernoid/agentic-build-templates

## Overview

reading-list

7
from heldernoid/agentic-build-templates

Operate the reading-list API to save, manage, tag, search, and export articles.

email-digest

7
from heldernoid/agentic-build-templates

Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.

websocket-realtime

7
from heldernoid/agentic-build-templates

Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".