dvc-dataset-versioning
Dataset versioning skill using DVC for tracking data changes, managing data pipelines, and ensuring reproducibility.
Best use case
dvc-dataset-versioning is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Dataset versioning skill using DVC for tracking data changes, managing data pipelines, and ensuring reproducibility.
Teams using dvc-dataset-versioning should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/dvc-dataset-versioning/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How dvc-dataset-versioning Compares
| Feature / Agent | dvc-dataset-versioning | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Dataset versioning skill using DVC for tracking data changes, managing data pipelines, and ensuring reproducibility.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# dvc-dataset-versioning
## Overview
Dataset versioning skill using DVC (Data Version Control) for tracking data changes, managing data pipelines, and ensuring reproducibility in ML workflows.
## Capabilities
- Dataset version tracking
- Data pipeline definition and execution
- Remote storage management (S3, GCS, Azure, etc.)
- Reproducibility enforcement
- Data lineage tracking
- Experiment comparison with data versions
- Cache management for large datasets
## Target Processes
- Data Collection and Validation Pipeline
- ML Model Retraining Pipeline
- Feature Store Implementation
## Tools and Libraries
- DVC
- Git
- Remote storage SDKs (boto3, google-cloud-storage, etc.)
## Input Schema
```json
{
"type": "object",
"required": ["action"],
"properties": {
"action": {
"type": "string",
"enum": ["init", "add", "push", "pull", "diff", "checkout", "run", "repro"],
"description": "DVC action to perform"
},
"paths": {
"type": "array",
"items": { "type": "string" },
"description": "File or directory paths to track"
},
"remote": {
"type": "string",
"description": "Remote storage name"
},
"revision": {
"type": "string",
"description": "Git revision for checkout/diff"
},
"pipeline": {
"type": "object",
"description": "Pipeline stage definition for run action"
}
}
}
```
## Output Schema
```json
{
"type": "object",
"required": ["status", "action"],
"properties": {
"status": {
"type": "string",
"enum": ["success", "error"]
},
"action": {
"type": "string"
},
"trackedFiles": {
"type": "array",
"items": { "type": "string" }
},
"changes": {
"type": "array",
"items": {
"type": "object",
"properties": {
"path": { "type": "string" },
"status": { "type": "string" },
"hash": { "type": "string" }
}
}
},
"remote": {
"type": "object",
"properties": {
"name": { "type": "string" },
"url": { "type": "string" },
"syncStatus": { "type": "string" }
}
}
}
}
```
## Usage Example
```javascript
{
kind: 'skill',
title: 'Version training dataset',
skill: {
name: 'dvc-dataset-versioning',
context: {
action: 'add',
paths: ['data/train.csv', 'data/test.csv'],
remote: 's3-bucket'
}
}
}
```Related Skills
data-versioning-manager
Skill for managing data versions and provenance
process-builder
Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.
babysitter
Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)
yolo
Run Babysitter autonomously with minimal manual interruption.
user-install
Install the user-level Babysitter Codex setup.
team-install
Install the team-pinned Babysitter Codex workspace setup.
retrospect
Summarize or retrospect on a completed Babysitter run.
resume
Resume an existing Babysitter run from Codex.
project-install
Install the Babysitter Codex workspace integration into the current project.
plan
Plan a Babysitter workflow without executing the run.
observe
Observe, inspect, or monitor a Babysitter run.
model
Inspect or change Babysitter model-routing policy by phase.