data-cog-guide

Upload messy CSVs with minimal prompting for deep automated analysis

191 stars

bywentorai

View on GitHub Installation ↓

Best use case

data-cog-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Upload messy CSVs with minimal prompting for deep automated analysis

Teams using data-cog-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/data-cog-guide/SKILL.md --create-dirs "https://raw.githubusercontent.com/wentorai/research-plugins/main/skills/analysis/wrangling/data-cog-guide/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/data-cog-guide/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How data-cog-guide Compares

Feature / Agent	data-cog-guide	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Upload messy CSVs with minimal prompting for deep automated analysis

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for ChatGPT

Find the best AI skills to adapt into ChatGPT workflows for research, writing, summarization, planning, and repeatable assistant tasks.

SKILL.md Source

# Data Cog Guide

An intelligent data analysis assistant that accepts messy, poorly documented CSV files and automatically infers structure, cleans anomalies, and produces deep analytical reports with minimal user prompting. Designed for researchers who need quick insights from unfamiliar or inherited datasets without spending hours on manual data preparation.

## Overview

Researchers frequently receive datasets from collaborators, public repositories, or legacy systems that lack documentation, use inconsistent formatting, and contain mixed data quality. Traditional analysis requires significant upfront effort to understand and prepare such data. Data Cog automates this process by applying heuristic inference, pattern recognition, and iterative cleaning to produce analysis-ready data along with a comprehensive profile report.

The skill implements a "zero-configuration" philosophy: provide the CSV file path and an optional research question, and it handles encoding detection, delimiter inference, type casting, missingness assessment, and initial exploratory statistics automatically.

## Automated Ingestion Pipeline

### Smart Loading

```python
import pandas as pd
import chardet
import io

def smart_load_csv(filepath: str) -> tuple:
    """
    Intelligently load a CSV file, auto-detecting encoding,
    delimiter, header row, and comment lines.
    """
    # Step 1: Detect encoding
    with open(filepath, 'rb') as f:
        raw = f.read(100000)
    encoding = chardet.detect(raw)['encoding']

    # Step 2: Detect delimiter
    import csv
    with open(filepath, 'r', encoding=encoding, errors='replace') as f:
        sample = f.read(8192)
    sniffer = csv.Sniffer()
    try:
        dialect = sniffer.sniff(sample)
        delimiter = dialect.delimiter
    except csv.Error:
        delimiter = ','

    # Step 3: Detect header row (skip comment lines)
    skip_rows = 0
    with open(filepath, 'r', encoding=encoding, errors='replace') as f:
        for line in f:
            if line.startswith('#') or line.startswith('//') or line.strip() == '':
                skip_rows += 1
            else:
                break

    # Step 4: Load with inferred parameters
    df = pd.read_csv(
        filepath, encoding=encoding, delimiter=delimiter,
        skiprows=skip_rows, low_memory=False
    )

    metadata = {
        'encoding': encoding,
        'delimiter': repr(delimiter),
        'skipped_rows': skip_rows,
        'shape': df.shape
    }
    return df, metadata
```

### Automatic Type Inference

```python
def auto_cast_columns(df: pd.DataFrame) -> pd.DataFrame:
    """
    Automatically cast columns to their most appropriate types.
    Handles dates, numerics stored as strings, booleans, and categories.
    """
    for col in df.columns:
        # Try numeric conversion
        numeric = pd.to_numeric(df[col], errors='coerce')
        if numeric.notna().mean() > 0.85:
            df[col] = numeric
            continue

        # Try datetime conversion
        datetime = pd.to_datetime(df[col], errors='coerce', infer_datetime_format=True)
        if datetime.notna().mean() > 0.85:
            df[col] = datetime
            continue

        # Try boolean detection
        unique_lower = df[col].dropna().astype(str).str.lower().unique()
        if set(unique_lower).issubset({'true', 'false', 'yes', 'no', '1', '0', 'y', 'n'}):
            df[col] = df[col].astype(str).str.lower().map(
                {'true': True, 'false': False, 'yes': True, 'no': False,
                 '1': True, '0': False, 'y': True, 'n': False}
            )
            continue

        # Convert low-cardinality strings to category
        if df[col].nunique() / len(df) < 0.05 and df[col].nunique() < 50:
            df[col] = df[col].astype('category')

    return df
```

## Deep Automated Profiling

### Profile Report Generation

The profiling stage produces a structured report covering:

1. **Schema overview**: Column names, inferred types, semantic roles (ID, feature, target, timestamp).
2. **Univariate statistics**: Mean, median, mode, std, skewness, kurtosis for numeric columns; frequency tables for categoricals.
3. **Missing data matrix**: Heatmap-style report of missingness patterns across all columns.
4. **Correlation analysis**: Pairwise Pearson, Spearman, and Cramér's V correlations.
5. **Distribution flags**: Columns that are heavily skewed, zero-inflated, or constant.
6. **Duplicate detection**: Exact row duplicates and near-duplicate clusters.

| Metric | Numeric Columns | Categorical Columns |
|--------|----------------|-------------------|
| Central tendency | Mean, median, mode | Mode, frequency |
| Dispersion | Std, IQR, range, CV | Unique count, entropy |
| Shape | Skewness, kurtosis | Imbalance ratio |
| Quality | Missing %, zero %, outlier % | Missing %, rare labels % |

## Interactive Analysis Workflow

### Minimal-Prompt Usage Pattern

The recommended workflow requires only three inputs:

1. **File path**: The CSV to analyze.
2. **Research question** (optional): A one-sentence description of what you want to learn.
3. **Output format**: "summary", "full_report", or "cleaned_csv".

```
User: Analyze /data/survey_results_2025.csv
      Question: What factors predict participant satisfaction?
      Output: full_report

Data Cog will:
  1. Load and profile the dataset (auto-detect everything)
  2. Clean and transform (handle missing data, encode categoricals)
  3. Run correlation analysis focused on satisfaction-related columns
  4. Generate regression models predicting satisfaction
  5. Produce a structured report with findings and visualizations
```

### Iterative Refinement

After the initial automated analysis, you can refine by asking targeted follow-up questions:

- "Focus only on respondents from Group A"
- "Exclude the first 50 rows (pilot data)"
- "Treat column X as ordinal with levels: low < medium < high"
- "Run the same analysis but with log-transformed income"

## Best Practices

- Always review the auto-generated profile before trusting downstream results.
- Verify that automatic type inference made sensible choices, especially for ambiguous columns.
- Provide a research question when possible to guide feature selection and analysis focus.
- Save the cleaning audit log alongside your results for reproducibility.
- For datasets over 1 million rows, consider sampling for the initial profile to save time.

## References

- Breck, E., et al. (2019). Data Validation for Machine Learning. *MLSys 2019*.
- Hynes, N., et al. (2017). The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets. *NIPS MLSys Workshop*.
- Pandas Development Team (2024). *pandas: Powerful Python Data Analysis Toolkit*. https://pandas.pydata.org/