data-orchestrator
Coordinates data pipeline tasks (ETL, analytics, feature engineering). Use when implementing data ingestion, transformations, quality checks, or analytics. Applies data-quality-standard.md (95% minimum).
Best use case
data-orchestrator is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Coordinates data pipeline tasks (ETL, analytics, feature engineering). Use when implementing data ingestion, transformations, quality checks, or analytics. Applies data-quality-standard.md (95% minimum).
Teams using data-orchestrator should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/data-orchestrator/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How data-orchestrator Compares
| Feature / Agent | data-orchestrator | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Coordinates data pipeline tasks (ETL, analytics, feature engineering). Use when implementing data ingestion, transformations, quality checks, or analytics. Applies data-quality-standard.md (95% minimum).
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Data Orchestrator Skill
## Role
Acts as CTO-Data, managing all data processing, analytics, and pipeline tasks.
## Responsibilities
1. **Data Pipeline Management**
- ETL/ELT processes
- Data validation
- Quality assurance
- Pipeline monitoring
2. **Analytics Coordination**
- Feature engineering
- Model integration
- Report generation
- Metric calculation
3. **Data Governance**
- Schema management
- Data lineage tracking
- Privacy compliance
- Access control
4. **Context Maintenance**
```
ai-state/active/data/
├── pipelines.json # Pipeline definitions
├── features.json # Feature registry
├── quality.json # Data quality metrics
└── tasks/ # Active data tasks
```
## Skill Coordination
### Available Data Skills
- `etl-skill` - Extract, transform, load operations
- `feature-engineering-skill` - Feature creation
- `analytics-skill` - Analysis and reporting
- `quality-skill` - Data quality checks
- `pipeline-skill` - Pipeline orchestration
### Context Package to Skills
```yaml
context:
task_id: "task-003-pipeline"
pipelines:
existing: ["daily_aggregation", "customer_segmentation"]
schedule: "0 2 * * *"
features:
current: ["revenue_30d", "churn_risk"]
dependencies: ["transactions", "customers"]
standards:
- "data-quality-standard.md"
- "feature-engineering.md"
test_requirements:
quality: ["completeness", "accuracy", "timeliness"]
```
## Task Processing Flow
1. **Receive Task**
- Identify data sources
- Check dependencies
- Validate requirements
2. **Prepare Context**
- Current pipeline state
- Feature definitions
- Quality metrics
3. **Assign to Skill**
- Choose data skill
- Set parameters
- Define outputs
4. **Monitor Execution**
- Track pipeline progress
- Monitor resource usage
- Check quality gates
5. **Validate Results**
- Data quality checks
- Output validation
- Performance metrics
- Lineage tracking
## Data-Specific Standards
### Pipeline Checklist
- [ ] Input validation
- [ ] Error handling
- [ ] Checkpoint/recovery
- [ ] Monitoring enabled
- [ ] Documentation updated
- [ ] Performance optimized
### Quality Checklist
- [ ] Completeness checks
- [ ] Accuracy validation
- [ ] Consistency rules
- [ ] Timeliness metrics
- [ ] Uniqueness constraints
- [ ] Validity ranges
### Feature Engineering Checklist
- [ ] Business logic documented
- [ ] Dependencies tracked
- [ ] Version controlled
- [ ] Performance tested
- [ ] Edge cases handled
- [ ] Monitoring added
## Integration Points
### With Backend Orchestrator
- Data model alignment
- API data contracts
- Database optimization
- Cache strategies
### With Frontend Orchestrator
- Dashboard data requirements
- Real-time vs batch
- Data freshness SLAs
- Visualization formats
### With Human-Docs
Updates documentation with:
- Pipeline changes
- Feature definitions
- Data dictionaries
- Quality reports
## Event Communication
### Listening For
```json
{
"event": "data.source.updated",
"source": "transactions",
"schema_change": true,
"impact": ["daily_pipeline", "revenue_features"]
}
```
### Broadcasting
```json
{
"event": "data.pipeline.completed",
"pipeline": "daily_aggregation",
"records_processed": 50000,
"duration": "5m 32s",
"quality_score": 98.5
}
```
## Test Requirements
### Every Data Task Must Include
1. **Unit Tests** - Transformation logic
2. **Integration Tests** - Pipeline flow
3. **Data Quality Tests** - Accuracy, completeness
4. **Performance Tests** - Processing speed
5. **Edge Case Tests** - Null, empty, invalid data
6. **Regression Tests** - Output consistency
## Success Metrics
- Pipeline success rate > 99%
- Data quality score > 95%
- Processing time < SLA
- Zero data loss
- Feature coverage > 90%
## Common Patterns
### ETL Pattern
```python
class ETLOrchestrator:
def run_pipeline(self, task):
# 1. Extract from sources
# 2. Validate input data
# 3. Transform data
# 4. Quality checks
# 5. Load to destination
# 6. Update lineage
```
### Feature Pattern
```python
class FeatureOrchestrator:
def create_feature(self, task):
# 1. Define feature logic
# 2. Identify dependencies
# 3. Implement calculation
# 4. Add to feature store
# 5. Create monitoring
```
## Data Processing Guidelines
### Batch Processing
- Use for large volumes
- Schedule during off-peak
- Implement checkpointing
- Monitor resource usage
### Stream Processing
- Use for real-time needs
- Implement windowing
- Handle late arrivals
- Maintain state
### Data Quality Rules
1. **Completeness** - No missing required fields
2. **Accuracy** - Values within expected ranges
3. **Consistency** - Cross-dataset alignment
4. **Timeliness** - Data freshness requirements
5. **Uniqueness** - No unwanted duplicates
6. **Validity** - Format and type correctness
## Anti-Patterns to Avoid
❌ Processing without validation
❌ No error recovery mechanism
❌ Missing data lineage
❌ Hardcoded transformations
❌ No monitoring/alerting
❌ Manual intervention requiredRelated Skills
azure-storage-file-datalake-py
Azure Data Lake Storage Gen2 SDK for Python. Use for hierarchical file systems, big data analytics, and file/directory operations. Triggers: "data lake", "DataLakeServiceClient", "FileSystemClient", "ADLS Gen2", "hierarchical namespace".
azure-data-tables-py
Azure Tables SDK for Python (Storage and Cosmos DB). Use for NoSQL key-value storage, entity CRUD, and batch operations. Triggers: "table storage", "TableServiceClient", "TableClient", "entities", "PartitionKey", "RowKey".
azure-data-tables-java
Build table storage applications with Azure Tables SDK for Java. Use when working with Azure Table Storage or Cosmos DB Table API for NoSQL key-value data, schemaless storage, or structured data at scale.
fixing-metadata
Ship correct, complete metadata.
native-data-fetching
Use when implementing or debugging ANY network request, API call, or data fetching. Covers fetch API, axios, React Query, SWR, error handling, caching strategies, offline support.
writing-data
Use this skill when you need to structure data in `srs/data` for the Next.js app
engineering-nba-data
Extracts, transforms, and analyzes NBA statistics using the nba_api Python library. Use when working with NBA player stats, team data, game logs, shot charts, league statistics, or any NBA-related data engineering tasks. Supports both stats.nba.com endpoints and static player/team lookups.
datafusion-query-advisor
Reviews SQL queries and DataFrame operations for optimization opportunities including predicate pushdown, partition pruning, column projection, and join ordering. Activates when users write DataFusion queries or experience slow query performance.
data-lake-architect
Provides architectural guidance for data lake design including partitioning strategies, storage layout, schema design, and lakehouse patterns. Activates when users discuss data lake architecture, partitioning, or large-scale data organization.
data-substrate-analysis
Analyze fundamental data primitives, type systems, and state management patterns in a codebase. Use when (1) evaluating typing strategies (Pydantic vs TypedDict vs loose dicts), (2) assessing immutability and mutation patterns, (3) understanding serialization approaches, (4) documenting state shape and lifecycle, or (5) comparing data modeling approaches across frameworks.
cascade-orchestrator
Creates sophisticated workflow cascades coordinating multiple micro-skills with sequential pipelines, parallel execution, conditional branching, and Codex sandbox iteration. Enhanced with multi-model routing (Gemini/Codex), ruv-swarm coordination, memory persistence, and audit-pipeline patterns for production workflows.
cc-devflow-orchestrator
CC-DevFlow workflow router and agent recommender. Use when starting requirements, running flow commands, or asking about devflow processes.