data-pipeline
Orchestrate marketing data collection, transformation, aggregation, and reporting workflows across platforms
Best use case
data-pipeline is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
It is a strong fit for teams already working in Codex.
Orchestrate marketing data collection, transformation, aggregation, and reporting workflows across platforms
Teams using data-pipeline should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/data-pipeline/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How data-pipeline Compares
| Feature / Agent | data-pipeline | Standard Approach |
|---|---|---|
| Platform Support | Codex | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Orchestrate marketing data collection, transformation, aggregation, and reporting workflows across platforms
Which AI agents support this skill?
This skill is designed for Codex.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
SKILL.md Source
# data-pipeline
Orchestrate marketing data collection, transformation, and reporting workflows.
## Triggers
Alternate expressions and non-obvious activations (primary phrases are matched automatically from the skill description):
- "ETL [source] to [dest]" → data pipeline creation shorthand
- "ELT" → extract-load-transform pipeline
- "dbt" / "Airflow" / "Spark" → tool-specific pipeline requests
## Purpose
This skill manages marketing data workflows by:
- Collecting data from multiple marketing platforms
- Transforming raw data into actionable metrics
- Aggregating cross-channel performance
- Generating automated reports
- Maintaining data quality and consistency
## Behavior
When triggered, this skill:
1. **Identifies data sources**:
- List connected platforms
- Check API credentials/access
- Determine data freshness requirements
2. **Collects raw data**:
- Pull metrics from each platform
- Handle pagination and rate limits
- Store raw data snapshots
3. **Transforms data**:
- Normalize naming conventions
- Calculate derived metrics
- Apply attribution models
- Aggregate across channels
4. **Validates data**:
- Check for anomalies
- Validate against thresholds
- Flag data quality issues
5. **Stores and reports**:
- Update data warehouse/storage
- Generate summary reports
- Trigger alerts if needed
## Supported Platforms
### Advertising Platforms
```yaml
advertising:
google_ads:
metrics:
- impressions
- clicks
- cost
- conversions
- conversion_value
dimensions:
- campaign
- ad_group
- keyword
- device
refresh_frequency: 4h
meta_ads:
metrics:
- impressions
- reach
- clicks
- spend
- conversions
dimensions:
- campaign
- ad_set
- ad
- placement
refresh_frequency: 4h
linkedin_ads:
metrics:
- impressions
- clicks
- cost
- leads
- conversions
dimensions:
- campaign
- creative
- audience
refresh_frequency: daily
```
### Analytics Platforms
```yaml
analytics:
google_analytics:
metrics:
- sessions
- users
- pageviews
- bounce_rate
- conversions
- revenue
dimensions:
- source_medium
- campaign
- landing_page
- device
refresh_frequency: 4h
mixpanel:
metrics:
- events
- unique_users
- retention
- funnel_conversion
dimensions:
- event_name
- user_properties
refresh_frequency: real-time
amplitude:
metrics:
- events
- users
- retention
- conversion
dimensions:
- event_type
- user_segment
refresh_frequency: real-time
```
### Email Platforms
```yaml
email:
mailchimp:
metrics:
- sends
- opens
- clicks
- bounces
- unsubscribes
dimensions:
- campaign
- list
- segment
refresh_frequency: 1h
hubspot:
metrics:
- sends
- opens
- clicks
- contacts_created
- deals_influenced
dimensions:
- campaign
- email_type
- lifecycle_stage
refresh_frequency: 1h
sendgrid:
metrics:
- delivered
- opens
- clicks
- bounces
- spam_reports
refresh_frequency: real-time
```
### Social Platforms
```yaml
social:
instagram:
metrics:
- reach
- impressions
- engagement
- followers
- saves
- shares
dimensions:
- post_type
- content_category
refresh_frequency: daily
linkedin:
metrics:
- impressions
- engagement
- followers
- clicks
dimensions:
- post_type
- content_category
refresh_frequency: daily
twitter:
metrics:
- impressions
- engagements
- followers
- retweets
- likes
refresh_frequency: 4h
```
## Data Transformation
### Metric Calculations
```yaml
derived_metrics:
ctr:
formula: clicks / impressions
format: percentage
description: Click-through rate
cpc:
formula: cost / clicks
format: currency
description: Cost per click
cpm:
formula: (cost / impressions) * 1000
format: currency
description: Cost per thousand impressions
cpa:
formula: cost / conversions
format: currency
description: Cost per acquisition
roas:
formula: revenue / cost
format: ratio
description: Return on ad spend
conversion_rate:
formula: conversions / clicks
format: percentage
description: Conversion rate
engagement_rate:
formula: engagements / impressions
format: percentage
description: Engagement rate
```
### Attribution Models
```yaml
attribution_models:
last_click:
description: 100% credit to last touchpoint
use_case: Bottom-funnel optimization
first_click:
description: 100% credit to first touchpoint
use_case: Top-funnel optimization
linear:
description: Equal credit across touchpoints
use_case: Multi-touch awareness
time_decay:
description: More credit to recent touchpoints
use_case: Typical purchase journey
position_based:
description: 40% first, 40% last, 20% middle
use_case: Balanced attribution
data_driven:
description: ML-based credit assignment
use_case: Advanced optimization
```
## Pipeline Configuration
```yaml
pipeline_config:
name: marketing-data-pipeline
schedule: "0 */4 * * *" # Every 4 hours
sources:
- name: google_ads
credentials: .aiwg/marketing/config/google-ads-creds.json
date_range: last_30_days
- name: google_analytics
credentials: .aiwg/marketing/config/ga4-creds.json
property_id: "123456789"
- name: meta_ads
credentials: .aiwg/marketing/config/meta-creds.json
ad_account_id: "act_123456"
transformations:
- name: normalize_naming
rules:
- source: google_ads
campaign_pattern: "^GA_"
- source: meta_ads
campaign_pattern: "^META_"
- name: calculate_metrics
metrics: [ctr, cpc, cpa, roas]
- name: apply_attribution
model: position_based
lookback_window: 30
output:
- type: json
path: .aiwg/marketing/data/
- type: csv
path: .aiwg/marketing/reports/
- type: dashboard
tool: internal
alerts:
- name: spend_anomaly
condition: daily_spend > avg_spend * 1.5
notify: [marketing-team]
- name: conversion_drop
condition: daily_conversions < avg_conversions * 0.5
notify: [marketing-team, analytics]
```
## Data Quality Checks
```yaml
quality_checks:
completeness:
- all_platforms_reporting: true
- date_gaps: none_allowed
- metric_nulls: <5%
consistency:
- cross_platform_totals: ±5% variance
- historical_trend: ±20% from avg
- attribution_sum: 100%
freshness:
- max_age: 24h
- preferred_age: 4h
- alert_threshold: 12h
anomaly_detection:
- z_score_threshold: 3
- min_data_points: 14
- metrics_to_monitor:
- spend
- conversions
- ctr
- cpc
```
## Pipeline Report Format
```markdown
# Marketing Data Pipeline Report
**Run ID**: PIPE-2025-12-08-1400
**Status**: Completed with Warnings
**Duration**: 4m 32s
**Date Range**: 2025-11-08 to 2025-12-08
## Data Collection Summary
| Source | Status | Records | Freshness |
|--------|--------|---------|-----------|
| Google Ads | ✅ Success | 45,231 | 2h ago |
| Meta Ads | ✅ Success | 32,156 | 3h ago |
| Google Analytics | ✅ Success | 128,459 | 1h ago |
| Mailchimp | ⚠️ Partial | 5,234 | 6h ago |
| Instagram | ✅ Success | 1,847 | 4h ago |
## Data Quality
| Check | Status | Details |
|-------|--------|---------|
| Completeness | ✅ Pass | All platforms reporting |
| Consistency | ⚠️ Warning | GA vs Ads conversion ±8% |
| Freshness | ✅ Pass | All data <12h old |
| Anomaly | ✅ Pass | No anomalies detected |
## Aggregated Metrics
### Overall Performance (Last 30 Days)
| Metric | Value | vs Prior Period | vs Target |
|--------|-------|-----------------|-----------|
| Spend | $125,432 | +12% | On target |
| Impressions | 8.2M | +18% | +5% |
| Clicks | 156,234 | +15% | +8% |
| Conversions | 3,421 | +8% | -2% |
| Revenue | $342,100 | +22% | +12% |
### By Channel
| Channel | Spend | Conv | CPA | ROAS |
|---------|-------|------|-----|------|
| Paid Search | $45,230 | 1,234 | $36.67 | 3.2x |
| Paid Social | $38,450 | 987 | $38.95 | 2.8x |
| Email | $5,200 | 543 | $9.58 | 8.5x |
| Organic Social | $0 | 321 | - | - |
| Display | $36,552 | 336 | $108.79 | 1.2x |
### Attribution Report
| Attribution Model | Conv Distrib |
|-------------------|--------------|
| Paid Search | 42% |
| Email | 24% |
| Paid Social | 18% |
| Organic | 12% |
| Direct | 4% |
## Alerts & Issues
### ⚠️ Warning: Mailchimp Data Delay
- **Issue**: Email metrics 6h stale (threshold: 4h)
- **Impact**: Email performance may be underreported
- **Action**: Retry scheduled for next run
### ⚠️ Warning: Cross-Platform Variance
- **Issue**: GA conversions vs Ad platform conversions ±8%
- **Expected**: ±5% variance
- **Cause**: Likely attribution window differences
- **Action**: Review attribution settings
## Output Files
- Raw data: `.aiwg/marketing/data/raw/2025-12-08/`
- Transformed: `.aiwg/marketing/data/transformed/2025-12-08.json`
- Report: `.aiwg/marketing/reports/daily-2025-12-08.csv`
- Dashboard updated: Yes
## Next Scheduled Run
**Time**: 2025-12-08 18:00 UTC
**Expected Duration**: ~5 minutes
```
## Usage Examples
### Full Data Refresh
```
User: "Refresh marketing analytics"
Skill executes:
1. Connect to all platforms
2. Pull latest data
3. Transform and aggregate
4. Generate report
Output:
"Data Pipeline Complete
Sources Updated:
✅ Google Ads (45K records)
✅ Meta Ads (32K records)
✅ Google Analytics (128K records)
✅ Email platforms (5K records)
Key Metrics (Last 30 Days):
- Total Spend: $125,432 (+12%)
- Conversions: 3,421 (+8%)
- ROAS: 2.73x
Data Quality: 2 warnings
- Mailchimp data stale (6h)
- Cross-platform variance 8%
Report: .aiwg/marketing/reports/daily-2025-12-08.md"
```
### Specific Channel Data
```
User: "Pull email campaign metrics"
Skill executes:
1. Connect to email platforms
2. Pull campaign data
3. Calculate email-specific metrics
Output:
"Email Data Updated
Campaigns: 12 active
Total Sends: 245,000
Open Rate: 24.5% (industry avg: 21%)
Click Rate: 3.2% (industry avg: 2.5%)
Unsubscribe: 0.3%
Top Performer: Holiday Sale Email
- Opens: 32%
- Clicks: 5.1%
- Revenue: $45,230
Data saved: .aiwg/marketing/data/email/2025-12-08.json"
```
### Data Quality Check
```
User: "Check marketing data quality"
Skill validates:
- Completeness
- Consistency
- Freshness
- Anomalies
Output:
"Data Quality Report
✅ Completeness: All sources reporting
⚠️ Consistency: 8% variance in conversions
✅ Freshness: All data <12h old
✅ Anomalies: None detected
Issue Details:
- GA reports 3,421 conversions
- Ad platforms report 3,695 conversions
- Delta: 274 (7.4%)
- Likely cause: Attribution windows
Recommendation: Align attribution windows across platforms"
```
## Integration
This skill uses:
- `project-awareness`: Identify connected platforms
- `artifact-metadata`: Track pipeline runs
## Agent Orchestration
```yaml
agents:
data_collection:
agent: data-analyst
focus: Platform connections and data extraction
analysis:
agent: marketing-analyst
focus: Metric interpretation and insights
reporting:
agent: reporting-specialist
focus: Report generation and visualization
```
## Configuration
### Platform Credentials
```yaml
credentials_config:
storage: .aiwg/marketing/config/
encryption: required
rotation: 90_days
platforms:
google_ads:
type: oauth2
refresh_token: encrypted
meta_ads:
type: access_token
expiry_check: true
mailchimp:
type: api_key
scoped: marketing
```
### Scheduling
```yaml
schedule_config:
full_refresh:
cron: "0 */4 * * *"
description: Every 4 hours
daily_report:
cron: "0 8 * * *"
description: Daily at 8 AM
weekly_summary:
cron: "0 9 * * 1"
description: Monday at 9 AM
```
## Output Locations
- Raw data: `.aiwg/marketing/data/raw/`
- Transformed data: `.aiwg/marketing/data/transformed/`
- Reports: `.aiwg/marketing/reports/`
- Pipeline logs: `.aiwg/marketing/logs/pipeline/`
## References
- Platform configs: .aiwg/marketing/config/
- Attribution models: docs/attribution-models.md
- Data dictionary: .aiwg/marketing/data/dictionary.mdRelated Skills
Metadata Tagging
opustags and ffmpeg patterns for applying metadata to audio and video files
validate-metadata
Validate AIWG extension definitions against the metadata schema and report errors with field names, line numbers, and remediation hints
pipeline-status
Show status overview of all LLM inference pipelines in the current project
pipeline-design
Interactive LLM inference pipeline design — elicits requirements, recommends pattern, scaffolds production-ready artifacts
artifact-metadata
Manage artifact metadata, versioning, ownership, and review history across the SDLC lifecycle
aiwg-orchestrate
Route structured artifact work to AIWG workflows via MCP with zero parent context cost
venv-manager
Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.
pytest-runner
Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.
vitest-runner
Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.
eslint-checker
Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.
repo-analyzer
Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.
pr-reviewer
Review GitHub pull requests for code quality, security, and best practices. Use for automated PR feedback and approval workflows.