data-validation-reporter

Generate interactive validation reports with quality scoring, missing data analysis, and type checking. Combines Pandas validation, Plotly visualization, and YAML configuration for comprehensive data quality reporting.

5 stars

Best use case

data-validation-reporter is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Generate interactive validation reports with quality scoring, missing data analysis, and type checking. Combines Pandas validation, Plotly visualization, and YAML configuration for comprehensive data quality reporting.

Teams using data-validation-reporter should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/data-validation-reporter/SKILL.md --create-dirs "https://raw.githubusercontent.com/vamseeachanta/workspace-hub/main/_archive/skills/workspace-hub/data-validation-reporter/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/data-validation-reporter/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How data-validation-reporter Compares

Feature / Agentdata-validation-reporterStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Generate interactive validation reports with quality scoring, missing data analysis, and type checking. Combines Pandas validation, Plotly visualization, and YAML configuration for comprehensive data quality reporting.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Data Validation Reporter Skill

## Overview

This skill provides a complete data validation and reporting workflow:
- **Data validation** with configurable quality rules
- **Interactive Plotly reports** with 4-panel dashboards
- **YAML configuration** for validation parameters
- **Quality scoring** (0-100 scale)
- **Missing data analysis** with visualizations
- **Type checking** with automated detection

## Pattern Analysis

**Discovered from commit**: `47b64945` (digitalmodel)
**Original file**: `src/data_procurement/validators/data_validator.py`
**Reusability score**: 80/100

**Patterns used**:
- plotly_viz (interactive dashboards)
- pandas_processing (DataFrame validation)
- data_validation (quality scoring)
- yaml_config (configuration loading)
- logging (structured logging)

## Core Capabilities

### 1. Data Validation
```python
validator = DataValidator(config_path="config/validation.yaml")
results = validator.validate_dataframe(
    df=data,
    required_fields=["id", "value", "timestamp"],
    unique_field="id"
)
```

**Validation checks**:
- Empty DataFrame detection
- Required field verification
- Missing data analysis (per-column percentages)
- Duplicate detection
- Data type validation
- Numeric field validation

### 2. Quality Scoring Algorithm

**Score calculation** (0-100 scale):
- Base score: 100
- Missing required fields: -20
- High missing data (>50%): -30
- Moderate missing data (>20%): -15
- Duplicate records: -2 per duplicate (max -20)
- Type issues: -5 per issue (max -15)

**Status thresholds**:
- ✅ PASS: score ≥ 60
- ❌ FAIL: score < 60

### 3. Interactive Reporting

**4-Panel Plotly Dashboard**:
1. **Quality Score Gauge** - Color-coded indicator (green/yellow/red)
2. **Missing Data Chart** - Bar chart showing missing % per column
3. **Type Issues Chart** - Bar chart of validation errors
4. **Summary Table** - Key metrics overview

**Features**:
- Responsive design
- Interactive hover tooltips
- Zoom and pan controls
- Export to PNG/SVG
- CDN-based Plotly (no local dependencies)

### 4. YAML Configuration

```yaml
# config/validation.yaml
validation:
  required_fields:
    - id
    - timestamp
    - value

  unique_fields:
    - id

  numeric_fields:
    - year_built
    - length_m
    - displacement_tonnes

  thresholds:
    max_missing_pct: 0.2  # 20%
    min_quality_score: 60
    max_duplicates: 0
```

## Usage

### Basic Validation

```python
from data_validator import DataValidator
import pandas as pd

# Initialize with config
validator = DataValidator(config_path="config/validation.yaml")

# Load data
df = pd.read_csv("data/input.csv")

# Validate
results = validator.validate_dataframe(
    df=df,
    required_fields=["id", "name", "value"],
    unique_field="id"
)

# Check results
if results['valid']:
    print(f"✅ PASS - Quality Score: {results['quality_score']:.1f}/100")
else:
    print(f"❌ FAIL - Issues: {len(results['issues'])}")
    for issue in results['issues']:
        print(f"  - {issue}")
```

### Generate Interactive Report

```python
from pathlib import Path

# Generate HTML report
validator.generate_interactive_report(
    validation_results=results,
    output_path=Path("reports/validation_report.html")
)

print("📊 Interactive report saved to reports/validation_report.html")
```

### Text Report

```python
# Generate text summary
text_report = validator.generate_report(results)
print(text_report)
```

## Files Included

```
data-validation-reporter/
├── SKILL.md                    # This file
├── validator_template.py       # Validator class template
├── config_template.yaml        # YAML configuration template
├── example_usage.py            # Example implementation
└── README.md                   # Quick reference
```

## Integration

### Add to Existing Project

1. **Copy validator template**:
```bash
cp validator_template.py src/validators/data_validator.py
```

2. **Create configuration**:
```bash
cp config_template.yaml config/validation.yaml
# Edit config/validation.yaml with your validation rules
```

3. **Install dependencies**:
```bash
uv pip install pandas plotly pyyaml
```

4. **Use in pipeline**:
```python
from src.validators.data_validator import DataValidator

validator = DataValidator(config_path="config/validation.yaml")
results = validator.validate_dataframe(df)
validator.generate_interactive_report(results, Path("reports/output.html"))
```

## Customization

### Extend Validation Rules

```python
class CustomValidator(DataValidator):
    def _check_business_rules(self, df: pd.DataFrame) -> List[str]:
        """Add custom business logic validation."""
        issues = []

        # Example: Check date ranges
        if 'start_date' in df.columns and 'end_date' in df.columns:
            invalid_dates = (df['end_date'] < df['start_date']).sum()
            if invalid_dates > 0:
                issues.append(f'{invalid_dates} records with end_date before start_date')

        return issues
```

### Custom Visualizations

```python
# Add 5th panel to dashboard
fig = make_subplots(
    rows=3, cols=2,
    specs=[
        [{'type': 'indicator'}, {'type': 'bar'}],
        [{'type': 'bar'}, {'type': 'table'}],
        [{'type': 'scatter', 'colspan': 2}, None]  # New panel
    ]
)

# Add custom plot
fig.add_trace(
    go.Scatter(x=df['date'], y=df['quality_score'], name='Quality Trend'),
    row=3, col=1
)
```

## Performance

**Benchmarks** (tested on 100,000 row dataset):
- Validation: ~2.5 seconds
- Report generation: ~1.2 seconds
- Total: ~3.7 seconds

**Memory usage**: ~150MB for 100k rows

**Scalability**:
- Tested up to 1M rows
- Linear scaling for validation
- Report generation optimized with sampling for large datasets

## Best Practices

1. **Configuration Management**:
   - Store validation rules in YAML (version controlled)
   - Use environment-specific configs (dev/staging/prod)
   - Document validation thresholds

2. **Logging**:
   - Enable DEBUG level during development
   - Use INFO level in production
   - Log all validation failures

3. **Reporting**:
   - Generate reports for all production data loads
   - Archive reports with timestamps
   - Include reports in data lineage

4. **Quality Gates**:
   - Set minimum quality score thresholds
   - Block pipelines on validation failures
   - Alert on quality degradation

## Dependencies

```txt
pandas>=1.5.0
plotly>=5.14.0
pyyaml>=6.0
```

## Related Skills

- **csv-data-loader** - Load and preprocess CSV data
- **plotly-dashboard** - Advanced dashboard creation
- **data-quality-monitor** - Continuous quality monitoring

## Examples

See `example_usage.py` for complete working examples:
- Basic validation workflow
- Custom validation rules
- Batch validation (multiple files)
- Quality trend analysis
- Integration with data pipelines

## Change Log

**v1.0.0** (2026-01-07)
- Initial skill creation from production code
- 4-panel Plotly dashboard
- YAML configuration support
- Quality scoring algorithm
- Missing data and type validation

## License

Part of workspace-hub skill library. See root LICENSE.

## Support

For issues or enhancements, see workspace-hub issue tracker.

Related Skills

worldenergydata-source-readiness

5
from vamseeachanta/workspace-hub

Route agents to the canonical worldenergydata source-readiness skill and summary script. Use when asked for worldenergydata data completeness, data locations, latest known data dates, scheduler freshness, source-readiness status, or acceptance-criteria inputs across the repo ecosystem.

sodir-data-extractor

5
from vamseeachanta/workspace-hub

Extract and process Norwegian Petroleum Directorate field and production data from SODIR

metocean-data-fetcher

5
from vamseeachanta/workspace-hub

Fetch real-time and historical metocean data from NDBC, CO-OPS, Open-Meteo, ERDDAP, and MET Norway. Use for buoy data retrieval, tidal observations, marine forecasts, and multi-source data fusion.

energy-data-visualizer

5
from vamseeachanta/workspace-hub

Interactive visualization for oil and gas production data analysis using Plotly dashboards

bsee-data-extractor

5
from vamseeachanta/workspace-hub

Extract and process BSEE (Bureau of Safety and Environmental Enforcement) data including production, WAR (Well Activity Reports), and APD (Application for Permit to Drill) data. Use for querying production data, well activities, drilling permits, completions, and workovers by API number, block, lease, or field with automatic data normalization and caching.

gtm-demo-validation-cache-regression-repair

5
from vamseeachanta/workspace-hub

Diagnose and repair GTM demo validation failures caused by legacy cache files missing intermediate chart data, especially in nested digitalmodel demo scripts using --from-cache.

tax-return-data-capture-and-archival

5
from vamseeachanta/workspace-hub

Capture structured tax return summaries as YAML for year-over-year comparison, with fallback to manual PDF download and relocation when automation fails

repo-separation-for-sensitive-data

5
from vamseeachanta/workspace-hub

Architecture pattern for splitting confidential data and reusable algorithms across repos

plan-gated-issue-validation-workflow

5
from vamseeachanta/workspace-hub

Systematic validation pattern for plan-approved GitHub issues with pre-existing deliverables

metadata-only-wiki-sweep-workflow

5
from vamseeachanta/workspace-hub

Disciplined inventory process for cataloging documents by filename/path without content claims, using parent-centric grouping to prevent stub proliferation

metadata-only-inventory-sweep

5
from vamseeachanta/workspace-hub

Execute constrained file inventory sweeps with metadata-only stubs and validation, useful for staged documentation work on large file sets

handle-blocked-financial-sites-data-export

5
from vamseeachanta/workspace-hub

Workflow for extracting data from blocked financial sites when browser automation is restricted