dvc-dataset-versioning

Dataset versioning skill using DVC for tracking data changes, managing data pipelines, and ensuring reproducibility.

509 stars

Best use case

dvc-dataset-versioning is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Dataset versioning skill using DVC for tracking data changes, managing data pipelines, and ensuring reproducibility.

Teams using dvc-dataset-versioning should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/dvc-dataset-versioning/SKILL.md --create-dirs "https://raw.githubusercontent.com/a5c-ai/babysitter/main/library/specializations/data-science-ml/skills/dvc-dataset-versioning/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/dvc-dataset-versioning/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How dvc-dataset-versioning Compares

Feature / Agent	dvc-dataset-versioning	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Dataset versioning skill using DVC for tracking data changes, managing data pipelines, and ensuring reproducibility.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# dvc-dataset-versioning

## Overview

Dataset versioning skill using DVC (Data Version Control) for tracking data changes, managing data pipelines, and ensuring reproducibility in ML workflows.

## Capabilities

- Dataset version tracking
- Data pipeline definition and execution
- Remote storage management (S3, GCS, Azure, etc.)
- Reproducibility enforcement
- Data lineage tracking
- Experiment comparison with data versions
- Cache management for large datasets

## Target Processes

- Data Collection and Validation Pipeline
- ML Model Retraining Pipeline
- Feature Store Implementation

## Tools and Libraries

- DVC
- Git
- Remote storage SDKs (boto3, google-cloud-storage, etc.)

## Input Schema

```json
{
  "type": "object",
  "required": ["action"],
  "properties": {
    "action": {
      "type": "string",
      "enum": ["init", "add", "push", "pull", "diff", "checkout", "run", "repro"],
      "description": "DVC action to perform"
    },
    "paths": {
      "type": "array",
      "items": { "type": "string" },
      "description": "File or directory paths to track"
    },
    "remote": {
      "type": "string",
      "description": "Remote storage name"
    },
    "revision": {
      "type": "string",
      "description": "Git revision for checkout/diff"
    },
    "pipeline": {
      "type": "object",
      "description": "Pipeline stage definition for run action"
    }
  }
}
```

## Output Schema

```json
{
  "type": "object",
  "required": ["status", "action"],
  "properties": {
    "status": {
      "type": "string",
      "enum": ["success", "error"]
    },
    "action": {
      "type": "string"
    },
    "trackedFiles": {
      "type": "array",
      "items": { "type": "string" }
    },
    "changes": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "path": { "type": "string" },
          "status": { "type": "string" },
          "hash": { "type": "string" }
        }
      }
    },
    "remote": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "url": { "type": "string" },
        "syncStatus": { "type": "string" }
      }
    }
  }
}
```

## Usage Example

```javascript
{
  kind: 'skill',
  title: 'Version training dataset',
  skill: {
    name: 'dvc-dataset-versioning',
    context: {
      action: 'add',
      paths: ['data/train.csv', 'data/test.csv'],
      remote: 's3-bucket'
    }
  }
}
```

Related Skills

data-versioning-manager

509

from a5c-ai/babysitter

Skill for managing data versions and provenance

process-builder

509

from a5c-ai/babysitter

Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.

Workflow & Productivity

babysitter

509

from a5c-ai/babysitter

Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)