evidently-drift-detector

Evidently AI skill for data drift detection, model performance monitoring, target drift analysis, and automated reporting for ML systems in production.

509 stars

Best use case

evidently-drift-detector is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Evidently AI skill for data drift detection, model performance monitoring, target drift analysis, and automated reporting for ML systems in production.

Teams using evidently-drift-detector should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/evidently-drift-detector/SKILL.md --create-dirs "https://raw.githubusercontent.com/a5c-ai/babysitter/main/library/specializations/data-science-ml/skills/evidently-drift-detector/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/evidently-drift-detector/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How evidently-drift-detector Compares

Feature / Agentevidently-drift-detectorStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Evidently AI skill for data drift detection, model performance monitoring, target drift analysis, and automated reporting for ML systems in production.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Evidently Drift Detector

Detect data drift, monitor model performance, and generate automated reports using Evidently AI.

## Overview

This skill provides comprehensive capabilities for ML monitoring using Evidently AI. It enables detection of data drift, concept drift, target drift, and model performance degradation in production ML systems.

## Capabilities

### Data Drift Detection
- Feature-level drift detection
- Dataset-level drift analysis
- Multiple drift detection methods (KS, PSI, Wasserstein, etc.)
- Distribution visualization
- Drift magnitude quantification

### Model Performance Monitoring
- Classification metrics tracking
- Regression metrics tracking
- Performance degradation detection
- Slice-based analysis
- Error analysis

### Target Drift Analysis
- Target distribution changes
- Label drift detection
- Prediction drift monitoring
- Class balance monitoring

### Automated Reporting
- HTML report generation
- JSON metrics export
- Dashboard integration
- Custom metric creation
- Test suite execution

### Production Monitoring
- Real-time monitoring integration
- Alerting threshold configuration
- Time-series drift tracking
- Batch comparison analysis

## Prerequisites

### Installation
```bash
pip install evidently>=0.4.0
```

### Optional Dependencies
```bash
# For Spark support
pip install evidently[spark]

# For specific visualizations
pip install plotly nbformat
```

## Usage Patterns

### Basic Data Drift Report
```python
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# Define column mapping
column_mapping = ColumnMapping(
    target='target',
    prediction='prediction',
    numerical_features=['feature_1', 'feature_2', 'feature_3'],
    categorical_features=['category_1', 'category_2']
)

# Create drift report
report = Report(metrics=[
    DataDriftPreset()
])

# Run report comparing reference and current data
report.run(
    reference_data=reference_df,
    current_data=current_df,
    column_mapping=column_mapping
)

# Save report
report.save_html("drift_report.html")

# Get metrics as dictionary
metrics_dict = report.as_dict()
```

### Classification Performance Report
```python
from evidently.metric_preset import ClassificationPreset

report = Report(metrics=[
    ClassificationPreset()
])

report.run(
    reference_data=reference_df,
    current_data=current_df,
    column_mapping=column_mapping
)

# Access specific metrics
results = report.as_dict()
accuracy = results['metrics'][0]['result']['current']['accuracy']
```

### Regression Performance Report
```python
from evidently.metric_preset import RegressionPreset

report = Report(metrics=[
    RegressionPreset()
])

report.run(
    reference_data=reference_df,
    current_data=current_df,
    column_mapping=column_mapping
)
```

### Test Suite for Automated Checks
```python
from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset, DataQualityTestPreset

# Create test suite
test_suite = TestSuite(tests=[
    DataDriftTestPreset(),
    DataQualityTestPreset()
])

# Run tests
test_suite.run(
    reference_data=reference_df,
    current_data=current_df,
    column_mapping=column_mapping
)

# Check results
if test_suite.as_dict()['summary']['all_passed']:
    print("All tests passed!")
else:
    failed_tests = [t for t in test_suite.as_dict()['tests'] if t['status'] == 'FAIL']
    print(f"Failed tests: {len(failed_tests)}")
```

### Individual Drift Metrics
```python
from evidently.metrics import (
    DatasetDriftMetric,
    ColumnDriftMetric,
    DataDriftTable,
    TargetByFeaturesTable
)

# Detailed drift analysis
report = Report(metrics=[
    DatasetDriftMetric(),
    ColumnDriftMetric(column_name='feature_1'),
    ColumnDriftMetric(column_name='feature_2'),
    DataDriftTable(),
    TargetByFeaturesTable()
])

report.run(
    reference_data=reference_df,
    current_data=current_df,
    column_mapping=column_mapping
)
```

### Custom Drift Thresholds
```python
from evidently.metrics import DatasetDriftMetric
from evidently.options import DataDriftOptions

# Custom options
options = DataDriftOptions(
    drift_share=0.5,  # Share of drifted features to flag dataset drift
    stattest='psi',   # Statistical test
    stattest_threshold=0.1  # PSI threshold
)

report = Report(metrics=[
    DatasetDriftMetric(options=options)
])
```

### Time-Series Monitoring
```python
import pandas as pd
from datetime import datetime, timedelta

def monitor_over_time(reference_df, production_data_stream, window_days=7):
    """Monitor drift over time windows."""
    results = []

    for window_start in production_data_stream:
        window_end = window_start + timedelta(days=window_days)
        current_window = production_data_stream.query(
            f"timestamp >= '{window_start}' and timestamp < '{window_end}'"
        )

        report = Report(metrics=[DataDriftPreset()])
        report.run(reference_data=reference_df, current_data=current_window)

        metrics = report.as_dict()
        results.append({
            'window_start': window_start,
            'drift_detected': metrics['metrics'][0]['result']['dataset_drift'],
            'drift_share': metrics['metrics'][0]['result']['drift_share']
        })

    return pd.DataFrame(results)
```

## Integration with Babysitter SDK

### Task Definition Example
```javascript
const driftDetectionTask = defineTask({
  name: 'evidently-drift-detection',
  description: 'Detect data drift between reference and current data',

  inputs: {
    referenceDataPath: { type: 'string', required: true },
    currentDataPath: { type: 'string', required: true },
    targetColumn: { type: 'string' },
    predictionColumn: { type: 'string' },
    numericalFeatures: { type: 'array' },
    categoricalFeatures: { type: 'array' },
    driftThreshold: { type: 'number', default: 0.5 }
  },

  outputs: {
    driftDetected: { type: 'boolean' },
    driftShare: { type: 'number' },
    driftedFeatures: { type: 'array' },
    reportPath: { type: 'string' }
  },

  async run(inputs, taskCtx) {
    return {
      kind: 'skill',
      title: 'Detect data drift',
      skill: {
        name: 'evidently-drift-detector',
        context: {
          operation: 'detect_drift',
          referenceDataPath: inputs.referenceDataPath,
          currentDataPath: inputs.currentDataPath,
          targetColumn: inputs.targetColumn,
          predictionColumn: inputs.predictionColumn,
          numericalFeatures: inputs.numericalFeatures,
          categoricalFeatures: inputs.categoricalFeatures,
          driftThreshold: inputs.driftThreshold
        }
      },
      io: {
        inputJsonPath: `tasks/${taskCtx.effectId}/input.json`,
        outputJsonPath: `tasks/${taskCtx.effectId}/result.json`
      }
    };
  }
});
```

## Available Presets

### Metric Presets
| Preset | Use Case |
|--------|----------|
| `DataDriftPreset` | Feature drift detection |
| `DataQualityPreset` | Data quality checks |
| `ClassificationPreset` | Classification model performance |
| `RegressionPreset` | Regression model performance |
| `TargetDriftPreset` | Target variable drift |
| `TextOverviewPreset` | Text data analysis |

### Test Presets
| Preset | Use Case |
|--------|----------|
| `DataDriftTestPreset` | Automated drift tests |
| `DataQualityTestPreset` | Data quality validation |
| `DataStabilityTestPreset` | Data stability checks |
| `NoTargetPerformanceTestPreset` | Proxy performance tests |
| `RegressionTestPreset` | Regression performance tests |
| `MulticlassClassificationTestPreset` | Multiclass tests |
| `BinaryClassificationTestPreset` | Binary classification tests |

## Statistical Tests Available

| Test | Method | Best For |
|------|--------|----------|
| `ks` | Kolmogorov-Smirnov | Numerical, general |
| `psi` | Population Stability Index | Production monitoring |
| `wasserstein` | Wasserstein distance | Distribution comparison |
| `jensenshannon` | Jensen-Shannon divergence | Probability distributions |
| `chisquare` | Chi-square | Categorical features |
| `z` | Z-test | Large samples, normal |
| `kl_div` | KL divergence | Information theory |

## ML Pipeline Integration

### Retraining Trigger
```python
def check_retraining_needed(reference_df, current_df, column_mapping, threshold=0.3):
    """Determine if model retraining is needed based on drift."""
    from evidently.test_suite import TestSuite
    from evidently.tests import TestShareOfDriftedColumns

    test_suite = TestSuite(tests=[
        TestShareOfDriftedColumns(lt=threshold)
    ])

    test_suite.run(
        reference_data=reference_df,
        current_data=current_df,
        column_mapping=column_mapping
    )

    results = test_suite.as_dict()
    retraining_needed = not results['summary']['all_passed']

    return {
        'retraining_needed': retraining_needed,
        'drift_share': results['tests'][0]['result']['current'],
        'threshold': threshold
    }
```

### Performance Degradation Alert
```python
def check_performance_degradation(reference_df, current_df, column_mapping, min_accuracy=0.85):
    """Alert if classification accuracy drops below threshold."""
    from evidently.tests import TestAccuracyScore

    test_suite = TestSuite(tests=[
        TestAccuracyScore(gte=min_accuracy)
    ])

    test_suite.run(
        reference_data=reference_df,
        current_data=current_df,
        column_mapping=column_mapping
    )

    results = test_suite.as_dict()
    return {
        'degradation_detected': not results['summary']['all_passed'],
        'current_accuracy': results['tests'][0]['result']['current'],
        'threshold': min_accuracy
    }
```

## Best Practices

1. **Establish Baselines**: Use production data as reference, not training data
2. **Choose Appropriate Tests**: Match statistical tests to data types
3. **Set Meaningful Thresholds**: Balance sensitivity vs. alert fatigue
4. **Monitor Feature Importance**: Focus on high-impact features
5. **Time-Based Comparison**: Compare similar time periods
6. **Document Decisions**: Record why certain drift is acceptable

## References

- [Evidently AI Documentation](https://docs.evidentlyai.com/)
- [Evidently GitHub](https://github.com/evidentlyai/evidently)
- [ML Monitoring Course](https://evidentlyai.com/ml-in-production-course)
- [Evidently Blog](https://www.evidentlyai.com/blog)

Related Skills

homoglyph-detector

509
from a5c-ai/babysitter

Byte-level Unicode homoglyph detection for identifying invisible character substitutions in code

geant4-detector-simulator

509
from a5c-ai/babysitter

Geant4 detector simulation skill for particle transport, detector geometry, and physics process modeling

structural-variant-detector

509
from a5c-ai/babysitter

Structural variant detection skill for identifying CNVs, inversions, translocations, and complex rearrangements

fusion-gene-detector

509
from a5c-ai/babysitter

Gene fusion detection skill for oncology applications with multiple caller integration

memory-leak-detector

509
from a5c-ai/babysitter

Detect memory leaks in desktop applications through heap analysis and object tracking

fairlearn-bias-detector

509
from a5c-ai/babysitter

Fairness assessment skill using Fairlearn for bias detection, mitigation, and compliance reporting.

code-smell-detector

509
from a5c-ai/babysitter

Automated detection of code smells and anti-patterns to identify refactoring opportunities

terminal-capability-detector

509
from a5c-ai/babysitter

Detect terminal capabilities including color support, TTY status, size, and Unicode support for adaptive CLI output.

prompt-injection-detector

509
from a5c-ai/babysitter

Prompt injection detection and prevention for secure LLM applications

anti-drift

509
from a5c-ai/babysitter

Hierarchical coordination and drift detection with frequent checkpoints, shared memory coherence validation, role specialization enforcement, and short task cycles.

process-builder

509
from a5c-ai/babysitter

Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.

Workflow & Productivity

babysitter

509
from a5c-ai/babysitter

Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)