data-pipeline-engineer

Expert data engineer for ETL/ELT pipelines, streaming, data warehousing. Activate on: data pipeline, ETL, ELT, data warehouse, Spark, Kafka, Airflow, dbt, data modeling, star schema, streaming data, batch processing, data quality. NOT for: API design (use api-architect), ML training (use ML skills), dashboards (use design skills).

85 stars

bycuriositech

View on GitHub Installation ↓

Best use case

data-pipeline-engineer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using data-pipeline-engineer should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/data-pipeline-engineer/SKILL.md --create-dirs "https://raw.githubusercontent.com/curiositech/some_claude_skills/main/.claude/skills/data-pipeline-engineer/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/data-pipeline-engineer/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How data-pipeline-engineer Compares

Feature / Agent	data-pipeline-engineer	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Data Pipeline Engineer

Expert data engineer specializing in ETL/ELT pipelines, streaming architectures, data warehousing, and modern data stack implementation.

## Quick Start

1. **Identify sources** - data formats, volumes, freshness requirements
2. **Choose architecture** - Medallion (Bronze/Silver/Gold), Lambda, or Kappa
3. **Design layers** - staging → intermediate → marts (dbt pattern)
4. **Add quality gates** - Great Expectations or dbt tests at each layer
5. **Orchestrate** - Airflow DAGs with sensors and retries
6. **Monitor** - lineage, freshness, anomaly detection

## Core Capabilities

| Capability | Technologies | Key Patterns |
|------------|--------------|--------------|
| **Batch Processing** | Spark, dbt, Databricks | Incremental, partitioning, Delta/Iceberg |
| **Stream Processing** | Kafka, Flink, Spark Streaming | Watermarks, exactly-once, windowing |
| **Orchestration** | Airflow, Dagster, Prefect | DAG design, sensors, task groups |
| **Data Modeling** | dbt, SQL | Kimball, Data Vault, SCD |
| **Data Quality** | Great Expectations, dbt tests | Validation suites, freshness |

## Architecture Patterns

### Medallion Architecture (Recommended)
```
BRONZE (Raw)     → Exact source copy, schema-on-read, partitioned by ingestion
      ↓ Cleaning, Deduplication
SILVER (Cleansed) → Validated, standardized, business logic applied
      ↓ Aggregation, Enrichment
GOLD (Business)   → Dimensional models, aggregates, ready for BI/ML
```

### Lambda vs Kappa
- **Lambda**: Batch + Stream layers → merged serving layer (complex but complete)
- **Kappa**: Stream-only with replay → simpler but requires robust streaming

## Reference Examples

Full implementation examples in `./references/`:

| File | Description |
|------|-------------|
| `dbt-project-structure.md` | Complete dbt layout with staging, intermediate, marts |
| `airflow-dag.py` | Production DAG with sensors, task groups, quality checks |
| `spark-streaming.py` | Kafka-to-Delta processor with windowing |
| `great-expectations-suite.json` | Comprehensive data quality expectation suite |

## Anti-Patterns (10 Critical Mistakes)

### 1. Full Table Refreshes
**Symptom**: Truncate and rebuild entire tables every run
**Fix**: Use incremental models with `is_incremental()`, partition by date

### 2. Tight Coupling to Source Schemas
**Symptom**: Pipeline breaks when upstream adds/removes columns
**Fix**: Explicit source contracts, select only needed columns in staging

### 3. Monolithic DAGs
**Symptom**: One 200-task DAG running 8 hours
**Fix**: Domain-specific DAGs, ExternalTaskSensor for dependencies

### 4. No Data Quality Gates
**Symptom**: Bad data reaches production before detection
**Fix**: Great Expectations or dbt tests at each layer, block on failures

### 5. Processing Before Archiving
**Symptom**: Raw data transformed without preserving original
**Fix**: Always land raw in Bronze first, make transformations reproducible

### 6. Hardcoded Dates in Queries
**Symptom**: Manual updates needed for date filters
**Fix**: Use Airflow templating (e.g., `ds` variable) or dynamic date functions

### 7. Missing Watermarks in Streaming
**Symptom**: Unbounded state growth, OOM in long-running jobs
**Fix**: Add `withWatermark()` to handle late-arriving data

### 8. No Retry/Backoff Strategy
**Symptom**: Transient failures cause DAG failures
**Fix**: `retries=3`, `retry_exponential_backoff=True`, `max_retry_delay`

### 9. Undocumented Data Lineage
**Symptom**: No one knows where data comes from or who uses it
**Fix**: dbt docs, data catalog integration, column-level lineage

### 10. Testing Only in Production
**Symptom**: Bugs discovered by stakeholders, not engineers
**Fix**: dbt `--target dev`, sample datasets, CI/CD for models

## Quality Checklist

**Pipeline Design:**
- [ ] Incremental processing where possible
- [ ] Idempotent transformations (re-runnable safely)
- [ ] Partitioning strategy defined and documented
- [ ] Backfill procedures documented

**Data Quality:**
- [ ] Tests at Bronze layer (schema, nulls, ranges)
- [ ] Tests at Silver layer (business rules, referential integrity)
- [ ] Tests at Gold layer (aggregation checks, trend monitoring)
- [ ] Anomaly detection for volumes and distributions

**Orchestration:**
- [ ] Retry and alerting configured
- [ ] SLAs defined and monitored
- [ ] Cross-DAG dependencies use sensors
- [ ] max_active_runs prevents parallel conflicts

**Operations:**
- [ ] Data lineage documented
- [ ] Runbooks for common failures
- [ ] Monitoring dashboards for pipeline health
- [ ] On-call procedures defined

## Validation Script

Run `./scripts/validate-pipeline.sh` to check:
- dbt project structure and conventions
- Airflow DAG best practices
- Spark job configurations
- Data quality setup

## External Resources

- [dbt Best Practices](https://docs.getdbt.com/guides/best-practices)
- [Airflow Best Practices](https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html)
- [Great Expectations Docs](https://docs.greatexpectations.io/)
- [Delta Lake Guide](https://docs.delta.io/latest/index.html)
- [Kafka Streams](https://kafka.apache.org/documentation/streams/)

Related Skills

vr-avatar-engineer

from curiositech/some_claude_skills

Expert in photorealistic and stylized VR avatar systems for Apple Vision Pro, Meta Quest, and cross-platform metaverse. Specializes in facial tracking (52+ blend shapes), subsurface scattering, Persona-style generation, Photon networking, and real-time LOD. Activate on 'VR avatar', 'Vision Pro Persona', 'Meta avatar', 'facial tracking', 'blend shapes', 'avatar networking', 'photorealistic avatar'. NOT for 2D profile pictures (use image generation), non-VR game characters (use game engine tools), static 3D models (use modeling tools), or motion capture hardware setup.

voice-audio-engineer

from curiositech/some_claude_skills

Expert in voice synthesis, TTS, voice cloning, podcast production, speech processing, and voice UI design via ElevenLabs integration. Specializes in vocal clarity, loudness standards (LUFS), de-essing, dialogue mixing, and voice transformation. Activate on 'TTS', 'text-to-speech', 'voice clone', 'voice synthesis', 'ElevenLabs', 'podcast', 'voice recording', 'speech-to-speech', 'voice UI', 'audiobook', 'dialogue'. NOT for spatial audio (use sound-engineer), music production (use DAW tools), game audio middleware (use sound-engineer), sound effects generation (use sound-engineer with ElevenLabs SFX), or live concert audio.

sound-engineer

from curiositech/some_claude_skills

Expert in spatial audio, procedural sound design, game audio middleware, and app UX sound design. Specializes in HRTF/Ambisonics, Wwise/FMOD integration, UI sound design, and adaptive music systems. Activate on 'spatial audio', 'HRTF', 'binaural', 'Wwise', 'FMOD', 'procedural sound', 'footstep system', 'adaptive music', 'UI sounds', 'notification audio', 'sonic branding'. NOT for music composition/production (use DAW), audio post-production for film (linear media), voice cloning/TTS (use voice-audio-engineer), podcast editing (use standard audio editors), or hardware design.

site-reliability-engineer

from curiositech/some_claude_skills

Docusaurus build health validation and deployment safety for Claude Skills showcase. Pre-commit MDX validation (Liquid syntax, angle brackets, prop mismatches), pre-build link checking, post-build health reports. Activate on 'build errors', 'commit hooks', 'deployment safety', 'site health', 'MDX validation'. NOT for general DevOps (use deployment-engineer), Kubernetes/cloud infrastructure (use kubernetes-architect), runtime monitoring (use observability-engineer), or non-Docusaurus projects.

prompt-engineer

from curiositech/some_claude_skills

Expert prompt optimization for LLMs and AI systems. Use PROACTIVELY when building AI features, improving agent performance, or crafting system prompts. Masters prompt patterns and techniques.

github-actions-pipeline-builder

from curiositech/some_claude_skills

Build production CI/CD pipelines with GitHub Actions. Implements matrix builds, caching, deployments, testing, security scanning. Use for automated testing, deployments, release workflows. Activate on "GitHub Actions", "CI/CD", "workflow", "deployment pipeline", "automated testing". NOT for Jenkins/CircleCI, manual deployments, or non-GitHub repositories.

geospatial-data-pipeline

from curiositech/some_claude_skills

Process, analyze, and visualize geospatial data at scale. Handles drone imagery, GPS tracks, GeoJSON optimization, coordinate transformations, and tile generation. Use for mapping apps, drone data processing, location-based services. Activate on "geospatial", "GIS", "PostGIS", "GeoJSON", "map tiles", "coordinate systems". NOT for simple address validation, basic distance calculations, or static map embeds.

data-viz-2025

from curiositech/some_claude_skills

State-of-the-art data visualization for React/Next.js/TypeScript with Tailwind CSS. Creates compelling, tested, and accessible visualizations following Tufte principles and NYT Graphics standards. Activate on "data viz", "chart", "graph", "visualization", "dashboard", "plot", "Recharts", "Nivo", "D3". NOT for static images, print graphics, or basic HTML tables.

computer-vision-pipeline

from curiositech/some_claude_skills

Build production computer vision pipelines for object detection, tracking, and video analysis. Handles drone footage, wildlife monitoring, and real-time detection. Supports YOLO, Detectron2, TensorFlow, PyTorch. Use for archaeological surveys, conservation, security. Activate on "object detection", "video analysis", "YOLO", "tracking", "drone footage". NOT for simple image filters, photo editing, or face recognition APIs.

ai-engineer

from curiositech/some_claude_skills

Build production-ready LLM applications, advanced RAG systems, and intelligent agents. Implements vector search, multimodal AI, agent orchestration, and enterprise AI integrations. Use PROACTIVELY for LLM features, chatbots, AI agents, or AI-powered applications.

skill-coach

from curiositech/some_claude_skills

Guides creation of high-quality Agent Skills with domain expertise, anti-pattern detection, and progressive disclosure best practices. Use when creating skills, reviewing existing skills, or when users mention improving skill quality, encoding expertise, or avoiding common AI tooling mistakes. Activate on keywords: create skill, review skill, skill quality, skill best practices, skill anti-patterns. NOT for general coding advice or non-skill Claude Code features.

3d-cv-labeling-2026

from curiositech/some_claude_skills

Expert in 3D computer vision labeling tools, workflows, and AI-assisted annotation for LiDAR, point clouds, and sensor fusion. Covers SAM4D/Point-SAM, human-in-the-loop architectures, and vertical-specific training strategies. Activate on '3D labeling', 'point cloud annotation', 'LiDAR labeling', 'SAM 3D', 'SAM4D', 'sensor fusion annotation', '3D bounding box', 'semantic segmentation point cloud'. NOT for 2D image labeling (use clip-aware-embeddings), general ML training (use ml-engineer), video annotation without 3D (use computer-vision-pipeline), or VLM prompt engineering (use prompt-engineer).