detecting-data-anomalies
Process identify anomalies and outliers in datasets using machine learning algorithms. Use when analyzing data for unusual patterns, outliers, or unexpected deviations from normal behavior. Trigger with phrases like "detect anomalies", "find outliers", or "identify unusual patterns".
Best use case
detecting-data-anomalies is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Process identify anomalies and outliers in datasets using machine learning algorithms. Use when analyzing data for unusual patterns, outliers, or unexpected deviations from normal behavior. Trigger with phrases like "detect anomalies", "find outliers", or "identify unusual patterns".
Teams using detecting-data-anomalies should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/detecting-data-anomalies/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How detecting-data-anomalies Compares
| Feature / Agent | detecting-data-anomalies | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Process identify anomalies and outliers in datasets using machine learning algorithms. Use when analyzing data for unusual patterns, outliers, or unexpected deviations from normal behavior. Trigger with phrases like "detect anomalies", "find outliers", or "identify unusual patterns".
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# Detecting Data Anomalies
## Overview
Identify anomalies and outliers in datasets using statistical and machine learning algorithms including Isolation Forest, One-Class SVM, Local Outlier Factor, and autoencoders. This skill handles the full detection pipeline from data ingestion and feature scaling through algorithm selection, threshold tuning, and result interpretation with anomaly scoring.
## Prerequisites
- Python 3.9+ with scikit-learn >= 1.3 (`pip install scikit-learn`)
- pandas and NumPy for data manipulation (`pip install pandas numpy`)
- matplotlib or seaborn for anomaly visualizations (`pip install matplotlib seaborn`)
- Dataset in CSV, JSON, Parquet, or database-queryable format
- Minimum 500 data points for statistical significance (1000+ recommended)
- Optional: PyTorch or TensorFlow for autoencoder-based detection on complex patterns
## Instructions
1. Load the dataset using the Read tool and verify schema, column types, and row count
2. Profile feature distributions using descriptive statistics to understand baseline behavior
3. Handle missing values via imputation (median for numeric, mode for categorical) or row exclusion
4. Apply StandardScaler or MinMaxScaler to numeric features to normalize magnitude differences
5. Select the detection algorithm based on data characteristics:
- **Isolation Forest**: high-dimensional data, no assumptions on distribution
- **One-Class SVM**: well-defined normal class with clear decision boundary
- **Local Outlier Factor**: density-varying data with local anomaly patterns
- **Autoencoder**: complex temporal or image data with non-linear relationships
6. Set the contamination parameter to the expected anomaly proportion (start with 0.01-0.05)
7. Fit the model on the training partition and generate anomaly scores for each data point
8. Apply the decision threshold to classify points as normal (-1) or anomalous (1)
9. Analyze flagged anomalies for common characteristics, temporal clusters, or feature correlations
10. Generate a summary report with detection counts, score distributions, and visualization plots
See `${CLAUDE_SKILL_DIR}/references/implementation.md` for the detailed implementation guide.
## Output
- Anomaly detection summary: total points, anomaly count, contamination rate
- Per-record anomaly scores with classification labels
- Algorithm configuration: model type, contamination, distance metric, threshold
- Feature importance ranking showing which dimensions drive anomaly flags
- Visualization: scatter plot of anomaly scores, distribution histogram, t-SNE cluster plot
- CSV export of flagged records with anomaly scores and contributing features
## Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| Insufficient data volume | Fewer than 100 data points for model fitting | Collect additional data or switch to simple statistical methods (z-score, IQR) |
| High false positive rate | Contamination parameter set too high or features not scaled | Lower contamination to 0.01; verify StandardScaler applied; refine feature selection |
| Algorithm OOM on large dataset | Isolation Forest or LOF exceeds available memory | Subsample data for training; use `max_samples` parameter; switch to streaming approach |
| Feature scaling mismatch | Mixed numeric and categorical features without proper encoding | One-hot encode categoricals separately; scale numeric features independently |
| No ground truth for validation | Unlabeled dataset prevents accuracy measurement | Use domain expert review on top-N anomalies; implement feedback loop to refine threshold |
See `${CLAUDE_SKILL_DIR}/references/errors.md` for the full error reference.
## Examples
**Scenario 1: Network Intrusion Detection** -- Apply Isolation Forest to 50K network flow records with features: packet count, byte volume, duration, protocol type. Expected contamination: 2%. Target: flag port-scan and DDoS patterns with precision above 0.85.
**Scenario 2: Manufacturing Quality Control** -- Run LOF on sensor readings (temperature, vibration, pressure) from 10K production cycles. Detect equipment degradation anomalies. Visualize flagged cycles on a time-series plot with normal operating bands.
**Scenario 3: Financial Transaction Monitoring** -- Train an autoencoder on 100K legitimate transactions. Reconstruct test transactions and flag those with reconstruction error above the 99th percentile. Report flagged transactions with amount, merchant category, and time-of-day features.
## Resources
- [scikit-learn Anomaly Detection](https://scikit-learn.org/stable/modules/outlier_detection.html) -- Isolation Forest, LOF, One-Class SVM
- [PyOD Library](https://pyod.readthedocs.io/) -- 40+ outlier detection algorithms with unified API
- Autoencoder anomaly detection: Keras/PyTorch reconstruction-error approach
- Feature scaling: StandardScaler, RobustScaler, MinMaxScaler selection guide
- Evaluation without labels: silhouette analysis, domain expert review protocolsRelated Skills
generating-test-data
Generate realistic test data including edge cases and boundary conditions. Use when creating realistic fixtures or edge case test data. Trigger with phrases like "generate test data", "create fixtures", or "setup test database".
managing-database-tests
Test database testing including fixtures, transactions, and rollback management. Use when performing specialized testing. Trigger with phrases like "test the database", "run database tests", or "validate data integrity".
detecting-sql-injection-vulnerabilities
Detect and analyze SQL injection vulnerabilities in application code and database queries. Use when you need to scan code for SQL injection risks, review query construction, validate input sanitization, or implement secure query patterns. Trigger with phrases like "detect SQL injection", "scan for SQLi vulnerabilities", "review database queries", or "check SQL security".
encrypting-and-decrypting-data
Validate encryption implementations and cryptographic practices. Use when reviewing data security measures. Trigger with 'check encryption', 'validate crypto', or 'review security keys'.
scanning-for-data-privacy-issues
Scan for data privacy issues and sensitive information exposure. Use when reviewing data handling practices. Trigger with 'scan privacy issues', 'check sensitive data', or 'validate data protection'.
windsurf-data-handling
Control what code and data Windsurf AI can access and process in your workspace. Use when handling sensitive data, implementing data exclusion patterns, or ensuring compliance with privacy regulations in Windsurf environments. Trigger with phrases like "windsurf data privacy", "windsurf PII", "windsurf GDPR", "windsurf compliance", "codeium data", "windsurf telemetry".
webflow-data-handling
Implement Webflow data handling — CMS content delivery patterns, PII redaction in form submissions, GDPR/CCPA compliance for ecommerce data, and data retention policies. Trigger with phrases like "webflow data", "webflow PII", "webflow GDPR", "webflow data retention", "webflow privacy", "webflow CCPA", "webflow forms data".
vercel-data-handling
Implement data handling, PII protection, and GDPR/CCPA compliance for Vercel deployments. Use when handling sensitive data in serverless functions, implementing data redaction, or ensuring privacy compliance on Vercel. Trigger with phrases like "vercel data", "vercel PII", "vercel GDPR", "vercel data retention", "vercel privacy", "vercel compliance".
veeva-data-handling
Veeva Vault data handling for enterprise operations. Use when implementing advanced Veeva Vault patterns. Trigger: "veeva data handling".
vastai-data-handling
Manage training data and model artifacts securely on Vast.ai GPU instances. Use when transferring data to instances, managing checkpoints, or implementing secure data lifecycle on rented hardware. Trigger with phrases like "vastai data", "vastai upload data", "vastai checkpoints", "vastai data security", "vastai artifacts".
twinmind-data-handling
Handle TwinMind meeting data with GDPR compliance: transcript storage, memory vault management, data export, and deletion policies. Use when implementing data handling, or managing TwinMind meeting AI operations. Trigger with phrases like "twinmind data handling", "twinmind data handling".
supabase-data-handling
Implement GDPR/CCPA compliance with Supabase: RLS for data isolation, user deletion via auth.admin.deleteUser(), data export via SQL, PII column management, backup/restore workflows, and retention policies. Use when handling sensitive data, implementing right-to-deletion, configuring data retention, or auditing PII in Supabase database columns. Trigger: "supabase GDPR", "supabase data handling", "supabase PII", "supabase compliance", "supabase data retention", "supabase delete user", "supabase data export".