data-engineering

数据工程。Airflow、Dagster、Kafka Streams、Flink、dbt、数据管道、流处理、数据质量。当用户提到数据管道、ETL、流处理、数据质量时路由到此。

155 stars

Best use case

data-engineering is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

数据工程。Airflow、Dagster、Kafka Streams、Flink、dbt、数据管道、流处理、数据质量。当用户提到数据管道、ETL、流处理、数据质量时路由到此。

Teams using data-engineering should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/data-engineering/SKILL.md --create-dirs "https://raw.githubusercontent.com/telagod/code-abyss/main/skills/domains/data-engineering/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/data-engineering/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How data-engineering Compares

Feature / Agentdata-engineeringStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

数据工程。Airflow、Dagster、Kafka Streams、Flink、dbt、数据管道、流处理、数据质量。当用户提到数据管道、ETL、流处理、数据质量时路由到此。

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 数据工程域 · Data Engineering

```
编排:Airflow(调度) | Dagster(资产) | Prefect(现代流)
流处理:Kafka Streams(嵌入式) | Flink(集群) | Spark Streaming
质量:Great Expectations | dbt tests | Soda Core
```

---

## 管道编排

| 特性 | Airflow | Dagster | Prefect |
|------|---------|---------|---------|
| 核心模型 | DAG+Task | Asset+Op | Flow+Task |
| 资产管理 | 无 | 原生 | 无 |
| 本地开发 | 复杂 | 简单 | 简单 |

Airflow:`@task` 装饰器+自动 XCom | `.expand()` 动态映射 | `retries=3, retry_exponential_backoff=True`
Dagster:`@asset(group_name, deps)` + `ConfigurableResource` + `DailyPartitionsDefinition` + `@asset_check`
Prefect:`@flow` + `@task(retries=3, cache_key_fn=task_input_hash)` + `ConcurrentTaskRunner`

### 编排检查项

幂等(UPSERT/分区覆盖) | 增量(`WHERE updated_at > last_run`) | 事件驱动触发 | 跨 DAG 依赖 | 数据血缘(`ref()`/Asset deps)

---

## 流处理

| 特性 | Kafka Streams | Flink | Spark Streaming |
|------|---------------|-------|-----------------|
| 部署 | 嵌入式 JVM | 独立集群 | 独立集群 |
| 状态 | RocksDB | RocksDB/内存 | 内存 |
| 窗口 | 丰富 | 最丰富 | 基础 |

Kafka Streams:`StreamsBuilder` → `stream/filter/map` → `groupByKey().aggregate()` → `to()` | Join(Stream-Stream/Stream-Table) | `EXACTLY_ONCE_V2`
Flink:Tumbling/Sliding/Session 窗口 | `aggregate(AggregateFunction)` | ValueState/ListState+TTL | `enableCheckpointing(60000)` + Watermark(`forBoundedOutOfOrderness`) | 数据倾斜→随机前缀打散

### 流处理检查项

时间语义选择 | Watermark 乱序容忍 | 状态 TTL 防膨胀 | Checkpoint 间隔 | 端到端 Exactly-Once | 背压监控

---

## 数据质量

维度:`完整性 → 准确性 → 一致性 → 及时性 → 有效性`

| 工具 | 优势 | 适用 |
|------|------|------|
| Great Expectations | 丰富 Expectations、Data Docs | Python 生态、复杂验证 |
| dbt | SQL 原生、血缘追踪 | 数仓转换测试 |
| Soda Core | 简洁 YAML | 快速验证、CI/CD |

GE:`gx.get_context()` → 数据源 → `row_count_between`/`not_be_null`/`be_unique`/`be_between` → Checkpoints
dbt:`unique`/`not_null`/`accepted_values`/`relationships` + `dbt_expectations` + `--store-failures`
Soda:`row_count > N` / `missing_count(col) = 0` / `freshness(ts) < 1d`

### 质量检查项

分层验证(源→转换→目标) | 完整性+准确性+一致性 | 及时性阈值 | 加权评分 | 告警(Slack/PagerDuty)

## 触发词

数据管道、Airflow、Dagster、Prefect、ETL、流处理、Kafka Streams、Flink、数据质量、dbt、数据血缘

Related Skills

name: parse-error

155
from telagod/code-abyss

this is not frontmatter

multi-script

155
from telagod/code-abyss

too many scripts

missing-description

155
from telagod/code-abyss

No description provided.

invalid-tools

155
from telagod/code-abyss

invalid tool name

clash-skill

155
from telagod/code-abyss

second duplicate

review

155
from telagod/code-abyss

Review skill. Read ~/.claude/skills/gstack/review/checklist.md before acting.

office-hours

155
from telagod/code-abyss

Office hours skill. Uses ~/.claude/skills/gstack/bin/gstack-config.

codex

155
from telagod/code-abyss

Should be skipped for codex host.

gstack

155
from telagod/code-abyss

Root gstack skill. Uses ~/.claude/skills/gstack/bin helpers.

verify-security

155
from telagod/code-abyss

安全校验关卡。自动扫描代码安全漏洞,检测危险模式,确保安全决策有文档记录。当魔尊提到安全扫描、漏洞检测、安全审计、代码安全、OWASP、注入检测、敏感信息泄露时使用。在新建模块、安全相关变更、攻防任务、重构完成时自动触发。

verify-quality

155
from telagod/code-abyss

代码质量校验关卡。检测复杂度、重复代码、命名规范、函数长度等质量指标。当魔尊提到代码质量、复杂度检查、代码异味、重构建议、lint检查、代码规范时使用。在复杂模块、重构完成时自动触发。

verify-module

155
from telagod/code-abyss

模块完整性校验关卡。扫描目录结构、检测缺失文档、验证代码与文档同步。当魔尊提到模块校验、文档检查、结构完整性、README检查、DESIGN检查时使用。在新建模块完成时自动触发。