splitting-datasets

Process split datasets into training, validation, and testing sets for ML model development. Use when requesting "split dataset", "train-test split", or "data partitioning". Trigger with relevant phrases based on skill purpose.

Best use case

splitting-datasets is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Process split datasets into training, validation, and testing sets for ML model development. Use when requesting "split dataset", "train-test split", or "data partitioning". Trigger with relevant phrases based on skill purpose.

Teams using splitting-datasets should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/splitting-datasets/SKILL.md --create-dirs "https://raw.githubusercontent.com/foryourhealth111-pixel/Vibe-Skills/main/bundled/skills/splitting-datasets/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/splitting-datasets/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How splitting-datasets Compares

Feature / Agentsplitting-datasetsStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Process split datasets into training, validation, and testing sets for ML model development. Use when requesting "split dataset", "train-test split", or "data partitioning". Trigger with relevant phrases based on skill purpose.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Dataset Splitter

This skill provides automated assistance for dataset splitter tasks.

## Overview

This skill automates the process of dividing a dataset into subsets for training, validating, and testing machine learning models. It ensures proper data preparation and facilitates robust model evaluation.

## How It Works

1. **Analyze Request**: The skill analyzes the user's request to determine the dataset to be split and the desired proportions for each subset.
2. **Generate Code**: Based on the request, the skill generates Python code utilizing standard ML libraries to perform the data splitting.
3. **Execute Splitting**: The code is executed to split the dataset into training, validation, and testing sets according to the specified ratios.

## When to Use This Skill

This skill activates when you need to:
- Prepare a dataset for machine learning model training.
- Create training, validation, and testing sets.
- Partition data to evaluate model performance.

## Examples

### Example 1: Splitting a CSV file

User request: "Split the data in 'my_data.csv' into 70% training, 15% validation, and 15% testing sets."

The skill will:
1. Generate Python code to read the 'my_data.csv' file.
2. Execute the code to split the data according to the specified proportions, creating 'train.csv', 'validation.csv', and 'test.csv' files.

### Example 2: Creating a Train-Test Split

User request: "Create a train-test split of 'large_dataset.csv' with an 80/20 ratio."

The skill will:
1. Generate Python code to load 'large_dataset.csv'.
2. Execute the code to split the dataset into 80% training and 20% testing sets, saving them as 'train.csv' and 'test.csv'.

## Best Practices

- **Data Integrity**: Verify that the splitting process maintains the integrity of the data, ensuring no data loss or corruption.
- **Stratification**: Consider stratification when splitting imbalanced datasets to maintain class distributions in each subset.
- **Randomization**: Ensure the splitting process is randomized to avoid bias in the resulting datasets.

## Integration

This skill can be integrated with other data processing and model training tools within the Claude Code ecosystem to create a complete machine learning workflow.

## Prerequisites

- Appropriate file access permissions
- Required dependencies installed

## Instructions

1. Invoke this skill when the trigger conditions are met
2. Provide necessary context and parameters
3. Review the generated output
4. Apply modifications as needed

## Output

The skill produces structured output relevant to the task.

## Error Handling

- Invalid input: Prompts for correction
- Missing dependencies: Lists required components
- Permission errors: Suggests remediation steps

## Resources

- Project documentation
- Related skills and commands

Related Skills

zinc-database

1174
from foryourhealth111-pixel/Vibe-Skills

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

zarr-python

1174
from foryourhealth111-pixel/Vibe-Skills

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

yeet

1174
from foryourhealth111-pixel/Vibe-Skills

Use only when the user explicitly asks to stage, commit, push, and open a GitHub pull request in one flow using the GitHub CLI (`gh`).

xlsx

1174
from foryourhealth111-pixel/Vibe-Skills

Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.

xan

1174
from foryourhealth111-pixel/Vibe-Skills

High-performance CSV processing with xan CLI for large tabular datasets, streaming transformations, and low-memory pipelines.

writing-plans

1174
from foryourhealth111-pixel/Vibe-Skills

Use when you have a spec or requirements for a multi-step task, before touching code

writing-docs

1174
from foryourhealth111-pixel/Vibe-Skills

Guides for writing and editing Remotion documentation. Use when adding docs pages, editing MDX files in packages/docs, or writing documentation content.

windows-hook-debugging

1174
from foryourhealth111-pixel/Vibe-Skills

Windows环境下Claude Code插件Hook执行错误的诊断与修复。当遇到hook error、cannot execute binary file、.sh regex误匹配、WSL/Git Bash冲突时使用。

weights-and-biases

1174
from foryourhealth111-pixel/Vibe-Skills

Track ML experiments with automatic logging, visualize training in real-time, optimize hyperparameters with sweeps, and manage model registry with W&B - collaborative MLOps platform

webthinker-deep-research

1174
from foryourhealth111-pixel/Vibe-Skills

Deep web research for VCO: multi-hop search+browse+extract with an auditable action trace and a structured report (WebThinker-style).

vscode-release-notes-writer

1174
from foryourhealth111-pixel/Vibe-Skills

Guidelines for writing and reviewing Insiders and Stable release notes for Visual Studio Code.

visualization-best-practices

1174
from foryourhealth111-pixel/Vibe-Skills

Visualization Best Practices - Auto-activating skill for Data Analytics. Triggers on: visualization best practices, visualization best practices Part of the Data Analytics skill category.