dataset-publishing

Publish local dataset artifacts to a Hugging Face dataset repo. Use when uploading a JSONL dataset, pushing a filtered dataset variant, syncing a matching .metadata.json sidecar, or renaming a dataset file in the target repo. This skill is about USING the checked-in dataset publish script via CLI — never ad hoc Python.

6 stars

Best use case

dataset-publishing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Publish local dataset artifacts to a Hugging Face dataset repo. Use when uploading a JSONL dataset, pushing a filtered dataset variant, syncing a matching .metadata.json sidecar, or renaming a dataset file in the target repo. This skill is about USING the checked-in dataset publish script via CLI — never ad hoc Python.

Teams using dataset-publishing should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/dataset-publishing/SKILL.md --create-dirs "https://raw.githubusercontent.com/ProfSynapse/Synaptic-Tuner/main/.agents/skills/dataset-publishing/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/dataset-publishing/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How dataset-publishing Compares

Feature / Agentdataset-publishingStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Publish local dataset artifacts to a Hugging Face dataset repo. Use when uploading a JSONL dataset, pushing a filtered dataset variant, syncing a matching .metadata.json sidecar, or renaming a dataset file in the target repo. This skill is about USING the checked-in dataset publish script via CLI — never ad hoc Python.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Dataset Publishing

Publish a local dataset JSONL to a Hugging Face dataset repo with the skill-owned script:
`python3 scripts/publish_dataset_to_hf.py`

The script accepts:
- `dataset_path`
- `repo_id`

It also auto-uploads a matching metadata sidecar if present:
- `dataset.jsonl`
- `dataset.metadata.json`

## Quick Reference

| Task | Command |
|------|---------|
| Dry-run a dataset upload | `python3 scripts/publish_dataset_to_hf.py DATASET.jsonl namespace/repo --dry-run` |
| Upload dataset + sidecar | `python3 scripts/publish_dataset_to_hf.py DATASET.jsonl namespace/repo` |
| Upload under a new repo filename | `python3 scripts/publish_dataset_to_hf.py DATASET.jsonl namespace/repo --path-in-repo new_name.jsonl` |
| Upload with explicit metadata file | `python3 scripts/publish_dataset_to_hf.py DATASET.jsonl namespace/repo --metadata-path DATASET.metadata.json` |
| Skip metadata sidecar | `python3 scripts/publish_dataset_to_hf.py DATASET.jsonl namespace/repo --no-metadata` |

## Defaults

- Reads `HF_TOKEN` from the environment or repo `.env`
- Creates the target dataset repo if needed
- Uploads the dataset file to `path_in_repo = basename(dataset_path)`
- Auto-detects `*.metadata.json` sidecars for dotted filenames correctly

## Recommended Workflow

1. Build or filter the dataset locally.
2. Run `--dry-run` first.
3. Run the real upload command.
4. Point the next experiment spec at the uploaded HF dataset file.

## Common Patterns

**Upload a filtered SFT dataset:**
```bash
python3 scripts/publish_dataset_to_hf.py \
  Datasets/synthchat/my_filtered_dataset.jsonl \
  professorsynapse/claudesidian-synthetic-dataset \
  --dry-run

python3 scripts/publish_dataset_to_hf.py \
  Datasets/synthchat/my_filtered_dataset.jsonl \
  professorsynapse/claudesidian-synthetic-dataset
```

**Rename on upload:**
```bash
python3 scripts/publish_dataset_to_hf.py \
  Datasets/synthchat/my_filtered_dataset.jsonl \
  professorsynapse/claudesidian-synthetic-dataset \
  --path-in-repo nonthinking_tools_sft_filtered_03.22.26.jsonl
```

**Upload without a sidecar:**
```bash
python3 scripts/publish_dataset_to_hf.py \
  Datasets/synthchat/my_filtered_dataset.jsonl \
  professorsynapse/claudesidian-synthetic-dataset \
  --no-metadata
```

## CLI Discipline

- Use the checked-in script, not inline Python.
- Run `--dry-run` before the real upload when testing a new dataset variant.
- Keep dataset filenames descriptive and date-stamped.
- If you create a curated filtered variant, keep the rationale in the `.metadata.json` sidecar.

Related Skills

upload-deployment

6
from ProfSynapse/Synaptic-Tuner

Complete reference for model upload and deployment. Covers HuggingFace upload, save strategies (LoRA, merged 16-bit, merged 4-bit), GGUF conversion, model merging, model cards, and the full upload workflow. Use when uploading models, creating GGUF files, merging LoRA adapters, or deploying to HuggingFace. This skill is about USING the upload/deployment tools via CLI — never modifying source code.

synthetic-data-generation

6
from ProfSynapse/Synaptic-Tuner

Complete reference for the SynthChat synthetic dataset generation system. Covers CLI commands (generate, improve, validate), scenario YAML authoring, rubric YAML authoring, settings configuration, evaluation, and full workflow. Use when generating datasets, writing rubrics/scenarios, configuring models/workers, improving dataset quality, or running evaluations. This skill is about USING the system via CLI and YAML — never modifying source code.

research-reporting

6
from ProfSynapse/Synaptic-Tuner

Create structured research notes from experiment runs and analysis artifacts. Use when creating a note at run launch, updating it as training/evaluation/loss stages finish, summarizing a finished run, comparing experiment outcomes, extracting hypotheses from eval/loss artifacts, or proposing next-run actions grounded in `.tracking/experiments/<id>/analysis/` outputs. This skill is about turning repo-native experiment evidence into stable, machine-readable markdown.

fine-tuning

6
from ProfSynapse/Synaptic-Tuner

Complete reference for the fine-tuning pipeline (SFT, KTO, GRPO), cloud HF Jobs workflows, autonomous experiment search, checkpoint evaluation, and LoRA surgery. Covers training CLI flags, YAML configuration, model presets, dataset requirements, LoRA settings, training monitoring, hyperparameter search, and post-training optimization. Use when training models, configuring training runs, choosing hyperparameters, running cloud experiments, inspecting HF jobs, or troubleshooting training issues. This skill is about USING the training system via CLI and YAML — never modifying source code.

evaluation

6
from ProfSynapse/Synaptic-Tuner

Complete reference for the config-first model evaluation system. Covers the Evaluator CLI, assertion-driven YAML scenarios, response views, backend configuration, presets, scoring, LLM-as-judge, model comparison, and HuggingFace integration. Use when evaluating models, writing test prompts, comparing training runs, or interpreting eval results. This skill is about USING the evaluation system via CLI and YAML.

case-studies

6
from ProfSynapse/Synaptic-Tuner

End-to-end case studies showing how to implement the full training pipeline for different skill types. Covers three complete worked examples — tool-calling training, essay-style training, and agentic search (RAG agent) training — demonstrating dataset design, synthetic generation, validation, fine-tuning, evaluation, and iteration. Use when onboarding to the project, understanding how all components fit together, explaining the pipeline to others, or planning a new training capability. This skill is about UNDERSTANDING the system holistically — reference the other skills for specific CLI commands.

hugging-face-datasets

31392
from sickn33/antigravity-awesome-skills

Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.

Data ManagementClaude

hugging-face-dataset-viewer

31392
from sickn33/antigravity-awesome-skills

Query Hugging Face datasets through the Dataset Viewer API for splits, rows, search, filters, and parquet links.

Data Access & ExplorationClaude

arize-dataset

28865
from github/awesome-copilot

INVOKE THIS SKILL when creating, managing, or querying Arize datasets and examples. Covers dataset CRUD, appending examples, exporting data, and file-based dataset creation using the ax CLI.

wechat-auto-publishing-complete

3891
from openclaw/skills

Use this skill to fully reproduce and operate a local end-to-end WeChat Official Account publishing workflow: prepare the environment, validate dependencies, configure non-sensitive placeholders for credentials, gather source material, draft articles, prepare cover and body images, assemble a WeChat-ready Markdown package, publish to draft, optionally submit for formal publication, poll status, archive outputs, and attach scheduling or alerting. Use whenever the user wants a complete reproducible公众号自动发文 skill with environment setup, templates, runbooks, and execution scaffolding, while keeping all secrets and personal account details outside the skill package. Key real-world findings: freepublish does not always behave like manual platform publishing for homepage visibility, production mode should often default to draft-only, image files must be validated by real format rather than extension alone, and multi-account deployments should use isolated directories.

Devvit Publishing Auditor

3891
from openclaw/skills

A specialized auditor for Reddit Devvit developers to verify app readiness before uploading to the Reddit servers. It ensures compliance with Devvit CLI v0.12.x and Reddit’s publishing standards.

dataset-finder

3891
from openclaw/skills

Use this skill when users need to search for datasets, download data files, or explore data repositories. Triggers include: requests to "find datasets", "search for data", "download dataset from Kaggle", "get data from Hugging Face", "find ML datasets", or mentions of data repositories like Kaggle, UCI ML Repository, Data.gov, or Hugging Face. Also use for previewing dataset statistics, generating data cards, or discovering datasets for machine learning projects. Requires OpenClawCLI installation from clawhub.ai.