data-engineering-data-pipeline

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

30 stars

byZidong-IA

View on GitHub Installation ↓

Best use case

data-engineering-data-pipeline is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

Teams using data-engineering-data-pipeline should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/data-engineering-data-pipeline/SKILL.md --create-dirs "https://raw.githubusercontent.com/Zidong-IA/BIBLIOTECA/main/skills/skills/ai-ml/data-engineering-data-pipeline/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/data-engineering-data-pipeline/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How data-engineering-data-pipeline Compares

Feature / Agent	data-engineering-data-pipeline	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Data Pipeline Architecture

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

## Use this skill when

- Working on data pipeline architecture tasks or workflows
- Needing guidance, best practices, or checklists for data pipeline architecture

## Do not use this skill when

- The task is unrelated to data pipeline architecture
- You need a different domain or tool outside this scope

## Requirements

$ARGUMENTS

## Core Capabilities

- Design ETL/ELT, Lambda, Kappa, and Lakehouse architectures
- Implement batch and streaming data ingestion
- Build workflow orchestration with Airflow/Prefect
- Transform data using dbt and Spark
- Manage Delta Lake/Iceberg storage with ACID transactions
- Implement data quality frameworks (Great Expectations, dbt tests)
- Monitor pipelines with CloudWatch/Prometheus/Grafana
- Optimize costs through partitioning, lifecycle policies, and compute optimization

## Instructions

### 1. Architecture Design
- Assess: sources, volume, latency requirements, targets
- Select pattern: ETL (transform before load), ELT (load then transform), Lambda (batch + speed layers), Kappa (stream-only), Lakehouse (unified)
- Design flow: sources → ingestion → processing → storage → serving
- Add observability touchpoints

### 2. Ingestion Implementation
**Batch**
- Incremental loading with watermark columns
- Retry logic with exponential backoff
- Schema validation and dead letter queue for invalid records
- Metadata tracking (_extracted_at, _source)

**Streaming**
- Kafka consumers with exactly-once semantics
- Manual offset commits within transactions
- Windowing for time-based aggregations
- Error handling and replay capability

### 3. Orchestration
**Airflow**
- Task groups for logical organization
- XCom for inter-task communication
- SLA monitoring and email alerts
- Incremental execution with execution_date
- Retry with exponential backoff

**Prefect**
- Task caching for idempotency
- Parallel execution with .submit()
- Artifacts for visibility
- Automatic retries with configurable delays

### 4. Transformation with dbt
- Staging layer: incremental materialization, deduplication, late-arriving data handling
- Marts layer: dimensional models, aggregations, business logic
- Tests: unique, not_null, relationships, accepted_values, custom data quality tests
- Sources: freshness checks, loaded_at_field tracking
- Incremental strategy: merge or delete+insert

### 5. Data Quality Framework
**Great Expectations**
- Table-level: row count, column count
- Column-level: uniqueness, nullability, type validation, value sets, ranges
- Checkpoints for validation execution
- Data docs for documentation
- Failure notifications

**dbt Tests**
- Schema tests in YAML
- Custom data quality tests with dbt-expectations
- Test results tracked in metadata

### 6. Storage Strategy
**Delta Lake**
- ACID transactions with append/overwrite/merge modes
- Upsert with predicate-based matching
- Time travel for historical queries
- Optimize: compact small files, Z-order clustering
- Vacuum to remove old files

**Apache Iceberg**
- Partitioning and sort order optimization
- MERGE INTO for upserts
- Snapshot isolation and time travel
- File compaction with binpack strategy
- Snapshot expiration for cleanup

### 7. Monitoring & Cost Optimization
**Monitoring**
- Track: records processed/failed, data size, execution time, success/failure rates
- CloudWatch metrics and custom namespaces
- SNS alerts for critical/warning/info events
- Data freshness checks
- Performance trend analysis

**Cost Optimization**
- Partitioning: date/entity-based, avoid over-partitioning (keep >1GB)
- File sizes: 512MB-1GB for Parquet
- Lifecycle policies: hot (Standard) → warm (IA) → cold (Glacier)
- Compute: spot instances for batch, on-demand for streaming, serverless for adhoc
- Query optimization: partition pruning, clustering, predicate pushdown

## Example: Minimal Batch Pipeline

```python
# Batch ingestion with validation
from batch_ingestion import BatchDataIngester
from storage.delta_lake_manager import DeltaLakeManager
from data_quality.expectations_suite import DataQualityFramework

ingester = BatchDataIngester(config={})

# Extract with incremental loading
df = ingester.extract_from_database(
    connection_string='postgresql://host:5432/db',
    query='SELECT * FROM orders',
    watermark_column='updated_at',
    last_watermark=last_run_timestamp
)

# Validate
schema = {'required_fields': ['id', 'user_id'], 'dtypes': {'id': 'int64'}}
df = ingester.validate_and_clean(df, schema)

# Data quality checks
dq = DataQualityFramework()
result = dq.validate_dataframe(df, suite_name='orders_suite', data_asset_name='orders')

# Write to Delta Lake
delta_mgr = DeltaLakeManager(storage_path='s3://lake')
delta_mgr.create_or_update_table(
    df=df,
    table_name='orders',
    partition_columns=['order_date'],
    mode='append'
)

# Save failed records
ingester.save_dead_letter_queue('s3://lake/dlq/orders')
```

## Output Deliverables

### 1. Architecture Documentation
- Architecture diagram with data flow
- Technology stack with justification
- Scalability analysis and growth patterns
- Failure modes and recovery strategies

### 2. Implementation Code
- Ingestion: batch/streaming with error handling
- Transformation: dbt models (staging → marts) or Spark jobs
- Orchestration: Airflow/Prefect DAGs with dependencies
- Storage: Delta/Iceberg table management
- Data quality: Great Expectations suites and dbt tests

### 3. Configuration Files
- Orchestration: DAG definitions, schedules, retry policies
- dbt: models, sources, tests, project config
- Infrastructure: Docker Compose, K8s manifests, Terraform
- Environment: dev/staging/prod configs

### 4. Monitoring & Observability
- Metrics: execution time, records processed, quality scores
- Alerts: failures, performance degradation, data freshness
- Dashboards: Grafana/CloudWatch for pipeline health
- Logging: structured logs with correlation IDs

### 5. Operations Guide
- Deployment procedures and rollback strategy
- Troubleshooting guide for common issues
- Scaling guide for increased volume
- Cost optimization strategies and savings
- Disaster recovery and backup procedures

## Success Criteria
- Pipeline meets defined SLA (latency, throughput)
- Data quality checks pass with >99% success rate
- Automatic retry and alerting on failures
- Comprehensive monitoring shows health and performance
- Documentation enables team maintenance
- Cost optimization reduces infrastructure costs by 30-50%
- Schema evolution without downtime
- End-to-end data lineage tracked

Related Skills

sqlmap-database-pentesting

from Zidong-IA/BIBLIOTECA

This skill should be used when the user asks to "automate SQL injection testing," "enumerate database structure," "extract database credentials using sqlmap," "dump tables and columns...

protocol-reverse-engineering

from Zidong-IA/BIBLIOTECA

Master network protocol reverse engineering including packet analysis, protocol dissection, and custom protocol documentation. Use when analyzing network traffic, understanding proprietary protocol...

gdpr-data-handling

from Zidong-IA/BIBLIOTECA

Implement GDPR-compliant data handling with consent management, data subject rights, and privacy by design. Use when building systems that process EU personal data, implementing privacy controls, o...

data-storytelling

from Zidong-IA/BIBLIOTECA

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive present...

deployment-pipeline-design

from Zidong-IA/BIBLIOTECA

Design multi-stage CI/CD pipelines with approval gates, security checks, and deployment orchestration. Use when architecting deployment workflows, setting up continuous delivery, or implementing Gi...

database

from Zidong-IA/BIBLIOTECA

Database development and operations workflow covering SQL, NoSQL, database design, migrations, optimization, and data engineering.

database-optimizer

from Zidong-IA/BIBLIOTECA

Expert database optimizer specializing in modern performance tuning, query optimization, and scalable architectures.

database-migrations-sql-migrations

from Zidong-IA/BIBLIOTECA

SQL database migrations with zero-downtime strategies for PostgreSQL, MySQL, and SQL Server. Focus on data integrity and rollback plans.

database-migrations-migration-observability

from Zidong-IA/BIBLIOTECA

Migration monitoring, CDC, and observability infrastructure

database-cloud-optimization-cost-optimize

from Zidong-IA/BIBLIOTECA

You are a cloud cost optimization expert specializing in reducing infrastructure expenses while maintaining performance and reliability. Analyze cloud spending, identify savings opportunities, and ...

database-architect

from Zidong-IA/BIBLIOTECA

Expert database architect specializing in data layer design from scratch, technology selection, schema modeling, and scalable database architectures.

database-admin

from Zidong-IA/BIBLIOTECA

Expert database administrator specializing in modern cloud databases, automation, and reliability engineering.