data-engineering
数据工程。Airflow、Dagster、Kafka Streams、Flink、dbt、数据管道、流处理、数据质量。当用户提到数据管道、ETL、流处理、数据质量时路由到此。
Best use case
data-engineering is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
数据工程。Airflow、Dagster、Kafka Streams、Flink、dbt、数据管道、流处理、数据质量。当用户提到数据管道、ETL、流处理、数据质量时路由到此。
Teams using data-engineering should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/data-engineering/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How data-engineering Compares
| Feature / Agent | data-engineering | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
数据工程。Airflow、Dagster、Kafka Streams、Flink、dbt、数据管道、流处理、数据质量。当用户提到数据管道、ETL、流处理、数据质量时路由到此。
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# 数据工程域 · Data Engineering ``` 编排:Airflow(调度) | Dagster(资产) | Prefect(现代流) 流处理:Kafka Streams(嵌入式) | Flink(集群) | Spark Streaming 质量:Great Expectations | dbt tests | Soda Core ``` --- ## 管道编排 | 特性 | Airflow | Dagster | Prefect | |------|---------|---------|---------| | 核心模型 | DAG+Task | Asset+Op | Flow+Task | | 资产管理 | 无 | 原生 | 无 | | 本地开发 | 复杂 | 简单 | 简单 | Airflow:`@task` 装饰器+自动 XCom | `.expand()` 动态映射 | `retries=3, retry_exponential_backoff=True` Dagster:`@asset(group_name, deps)` + `ConfigurableResource` + `DailyPartitionsDefinition` + `@asset_check` Prefect:`@flow` + `@task(retries=3, cache_key_fn=task_input_hash)` + `ConcurrentTaskRunner` ### 编排检查项 幂等(UPSERT/分区覆盖) | 增量(`WHERE updated_at > last_run`) | 事件驱动触发 | 跨 DAG 依赖 | 数据血缘(`ref()`/Asset deps) --- ## 流处理 | 特性 | Kafka Streams | Flink | Spark Streaming | |------|---------------|-------|-----------------| | 部署 | 嵌入式 JVM | 独立集群 | 独立集群 | | 状态 | RocksDB | RocksDB/内存 | 内存 | | 窗口 | 丰富 | 最丰富 | 基础 | Kafka Streams:`StreamsBuilder` → `stream/filter/map` → `groupByKey().aggregate()` → `to()` | Join(Stream-Stream/Stream-Table) | `EXACTLY_ONCE_V2` Flink:Tumbling/Sliding/Session 窗口 | `aggregate(AggregateFunction)` | ValueState/ListState+TTL | `enableCheckpointing(60000)` + Watermark(`forBoundedOutOfOrderness`) | 数据倾斜→随机前缀打散 ### 流处理检查项 时间语义选择 | Watermark 乱序容忍 | 状态 TTL 防膨胀 | Checkpoint 间隔 | 端到端 Exactly-Once | 背压监控 --- ## 数据质量 维度:`完整性 → 准确性 → 一致性 → 及时性 → 有效性` | 工具 | 优势 | 适用 | |------|------|------| | Great Expectations | 丰富 Expectations、Data Docs | Python 生态、复杂验证 | | dbt | SQL 原生、血缘追踪 | 数仓转换测试 | | Soda Core | 简洁 YAML | 快速验证、CI/CD | GE:`gx.get_context()` → 数据源 → `row_count_between`/`not_be_null`/`be_unique`/`be_between` → Checkpoints dbt:`unique`/`not_null`/`accepted_values`/`relationships` + `dbt_expectations` + `--store-failures` Soda:`row_count > N` / `missing_count(col) = 0` / `freshness(ts) < 1d` ### 质量检查项 分层验证(源→转换→目标) | 完整性+准确性+一致性 | 及时性阈值 | 加权评分 | 告警(Slack/PagerDuty) ## 触发词 数据管道、Airflow、Dagster、Prefect、ETL、流处理、Kafka Streams、Flink、数据质量、dbt、数据血缘
Related Skills
name: parse-error
this is not frontmatter
multi-script
too many scripts
missing-description
No description provided.
invalid-tools
invalid tool name
clash-skill
second duplicate
review
Review skill. Read ~/.claude/skills/gstack/review/checklist.md before acting.
office-hours
Office hours skill. Uses ~/.claude/skills/gstack/bin/gstack-config.
codex
Should be skipped for codex host.
gstack
Root gstack skill. Uses ~/.claude/skills/gstack/bin helpers.
verify-security
安全校验关卡。自动扫描代码安全漏洞,检测危险模式,确保安全决策有文档记录。当魔尊提到安全扫描、漏洞检测、安全审计、代码安全、OWASP、注入检测、敏感信息泄露时使用。在新建模块、安全相关变更、攻防任务、重构完成时自动触发。
verify-quality
代码质量校验关卡。检测复杂度、重复代码、命名规范、函数长度等质量指标。当魔尊提到代码质量、复杂度检查、代码异味、重构建议、lint检查、代码规范时使用。在复杂模块、重构完成时自动触发。
verify-module
模块完整性校验关卡。扫描目录结构、检测缺失文档、验证代码与文档同步。当魔尊提到模块校验、文档检查、结构完整性、README检查、DESIGN检查时使用。在新建模块完成时自动触发。