etl-generator

大数据 ETL 流程生成器 - 根据源表 DDL 生成标准化 ETL 加工 SQL(HiveSQL/MySQL)

3,891 stars

Best use case

etl-generator is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

大数据 ETL 流程生成器 - 根据源表 DDL 生成标准化 ETL 加工 SQL(HiveSQL/MySQL)

Teams using etl-generator should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/etl-generator/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/alexmayanjun-collab/etl-generator/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/etl-generator/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How etl-generator Compares

Feature / Agentetl-generatorStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

大数据 ETL 流程生成器 - 根据源表 DDL 生成标准化 ETL 加工 SQL(HiveSQL/MySQL)

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# ETL 流程生成器 - 大数据专家版

根据源表 DDL 自动生成标准化的 ETL 加工 SQL,支持 HiveSQL、MySQL、ODPS。

## 🎯 角色定位

**大数据专家**(20 年经验)
- 精通 HiveSQL、MySQL、Shell、Python
- 严格遵守大数据 ETL 加工规范
- 注意字段类型转换、时区处理、数据质量

---

## 🔧 核心功能

### 1️⃣ 表名规范
- **源表:** `ods_[表名]_di`
- **目标表:** `dwd_[表名]_di`

### 2️⃣ 字段类型转换
- `_at` 或 `_time` 结尾的 TIMESTAMP 字段 → STRING(时区转换)
- `_date` 结尾的字段 → STRING(不转换)
- 其他字段 → 保持原类型

### 3️⃣ 时区转换
```sql
DATE_FORMAT(FROM_UTC_TIMESTAMP(created_at, "${timezone}"), "yyyy-MM-dd HH:mm:ss.SSS") AS created_at
DATE_FORMAT(FROM_UTC_TIMESTAMP(updated_at, "${timezone}"), "yyyy-MM-dd HH:mm:ss.SSS") AS updated_at
```

### 4️⃣ 分区字段
```sql
DATE_FORMAT(FROM_UTC_TIMESTAMP(created_at, "${timezone}"), "yyyy-MM-dd") AS ds
```

### 5️⃣ 增量数据处理
- 使用 `ods_data_base_di` 表
- 支持 INSERT/UPDATE/DELETE 操作
- 通过 `_operation_` 和 `_after_image_` 识别

### 6️⃣ 去重逻辑
```sql
ROW_NUMBER() OVER(PARTITION BY id ORDER BY updated_at DESC) as rn
WHERE rn = 1
```

### 7️⃣ 字段排除 rn
```sql
SELECT `(rn)?+.+` FROM (...)
```

---

## 📋 使用方式

### 方式 1:命令行

```bash
# 从文件读取 DDL
python3 skills/etl-generator/etl_generator.py source_table.ddl > etl_sql.sql

# 从标准输入读取
cat source_table.ddl | python3 skills/etl-generator/etl_generator.py > etl_sql.sql
```

### 方式 2:直接调用

```python
from etl_generator import parse_table_ddl, generate_target_table_ddl, generate_etl_sql

ddl = """
CREATE TABLE IF NOT EXISTS ods_delivery_attempt_di(
  id STRING COMMENT '主键',
  pno STRING COMMENT '运单号',
  client_id STRING COMMENT '客户 ID',
  returned BIGINT COMMENT '是否退货件',
  delivery_date STRING COMMENT '派送日期',
  marker_id BIGINT COMMENT '标记原因',
  store_id STRING COMMENT '网点 ID',
  created_at TIMESTAMP COMMENT '创建时间',
  updated_at TIMESTAMP COMMENT '更新时间'
) 
PARTITIONED BY (ds STRING) 
STORED AS ALIORC  
TBLPROPERTIES ("columnar.nested.type"="true", "comment"="有效尝试派送详情") 
LIFECYCLE 36500;
"""

table_name, fields, table_comment = parse_table_ddl(ddl)
target_ddl = generate_target_table_ddl(table_name, fields, table_comment)
etl_sql = generate_etl_sql(table_name, fields, table_comment)
```

---

## 📝 输出示例

### 输入(源表 DDL)

```sql
CREATE TABLE IF NOT EXISTS ods_sap_store_cash_pay_info_di(
  id STRING COMMENT "主键",
  store_id STRING COMMENT "网点编号",
  business_date STRING COMMENT "业务日期",
  sap_state BIGINT COMMENT "0:待处理 1:待发送 2:不需要发送 3:已发送 4:异常",
  created_at TIMESTAMP COMMENT "创建时间",
  updated_at TIMESTAMP COMMENT "更新时间'
) 
PARTITIONED BY (ds STRING) 
STORED AS ALIORC 
TBLPROPERTIES ("columnar.nested.type"="true", "comment"="SAP 门店现金支付信息") 
LIFECYCLE 36500;
```

### 输出(目标表 DDL + ETL SQL)

```sql
-- 目标表 DDL
CREATE TABLE IF NOT EXISTS dwd_sap_store_cash_pay_info_di(
  id STRING COMMENT '主键',
  store_id STRING COMMENT '网点编号',
  business_date STRING COMMENT '业务日期',
  sap_state BIGINT COMMENT '0:待处理 1:待发送 2:不需要发送 3:已发送 4:异常',
  created_at STRING COMMENT '创建时间',
  updated_at STRING COMMENT '更新时间'
) 
PARTITIONED BY (ds STRING) 
STORED AS ALIORC  
TBLPROPERTIES ("columnar.nested.type"="true", "comment"="SAP 门店现金支付信息") 
LIFECYCLE 36500;

-- ETL 加工 SQL
WITH ods_data AS (
  SELECT
    id,
    store_id,
    business_date,
    sap_state,
    DATE_FORMAT(FROM_UTC_TIMESTAMP(created_at, "${timezone}"), "yyyy-MM-dd HH:mm:ss.SSS") AS created_at,
    DATE_FORMAT(FROM_UTC_TIMESTAMP(updated_at, "${timezone}"), "yyyy-MM-dd HH:mm:ss.SSS") AS updated_at,
    DATE_FORMAT(FROM_UTC_TIMESTAMP(created_at, "${timezone}"), "yyyy-MM-dd") AS ds
  FROM ods_sap_store_cash_pay_info_di
  WHERE ds >= "${y-m-d}"
  UNION ALL
  SELECT
    get_json_object(values, "$.id") as id,
    get_json_object(values, "$.store_id") as store_id,
    get_json_object(values, "$.business_date") as business_date,
    get_json_object(values, "$.sap_state") as sap_state,
    DATE_FORMAT(FROM_UTC_TIMESTAMP(get_json_object(values, "$.created_at"), "${timezone}"), "yyyy-MM-dd HH:mm:ss.SSS") AS created_at,
    DATE_FORMAT(FROM_UTC_TIMESTAMP(get_json_object(values, "$.updated_at"), "${timezone}"), "yyyy-MM-dd HH:mm:ss.SSS") AS updated_at,
    DATE_FORMAT(FROM_UTC_TIMESTAMP(get_json_object(values, "$.created_at"), "${timezone}"), "yyyy-MM-dd") AS ds
  FROM ods_data_base_di 
  WHERE (
    (_after_image_ = "Y" AND _operation_ IN ("INSERT", "UPDATE"))
    OR (_operation_ = "DELETE" AND _before_image_ = "Y")
    OR _id_ IS NULL
  )
  AND ds >= "${y-m-d}"
  AND table_name = "sap_store_cash_pay_info"
  AND db_name = "source_db"
)
INSERT OVERWRITE TABLE dwd_sap_store_cash_pay_info_di PARTITION(ds)
SELECT `(rn)?+.+` FROM (
  SELECT 
    *,
    ROW_NUMBER() OVER(PARTITION BY id ORDER BY updated_at DESC) as rn 
  FROM (
    SELECT * FROM dwd_sap_store_cash_pay_info_di WHERE ds IN (
      SELECT DISTINCT ds FROM ods_data
    )
    UNION ALL
    SELECT * FROM ods_data
  ) a
) t1
WHERE rn = 1;
```

---

## 🧪 数据质量检查

自动生成以下检查 SQL:

1. **主键空值检查**
2. **退货件比例检查**(如果有 returned 字段)
3. **数据量对比**(源表 vs 目标表)

---

## 📋 字段映射说明

自动生成字段映射文档:

```sql
-- ============================================
-- 字段映射说明
-- ============================================
-- 源表字段 (7 个): id, store_id, business_date, sap_state, created_at, updated_at
-- 目标表字段 (7 个): id, store_id, business_date, sap_state, created_at, updated_at
-- 分区字段:ds
-- 
-- 字段转换规则:
-- created_at: TIMESTAMP → STRING, 时区转换
-- updated_at: TIMESTAMP → STRING, 时区转换
-- business_date: 直接映射
-- ============================================
```

---

## ⚙️ 配置参数

| 参数 | 说明 | 默认值 |
|------|------|--------|
| `${timezone}` | 时区 | UTC |
| `${y-m-d}` | 业务日期 | ${yyyymmdd-1} |
| `${bizdate}` | 业务日期(质量检查) | ${yyyymmdd-1} |

---

## 📁 文件结构

```
skills/etl-generator/
├── SKILL.md              # 技能说明
├── etl_generator.py      # 核心脚本
├── README.md             # 使用文档
└── examples/             # 示例 DDL
    └── delivery_attempt.ddl
```

---

## 🔧 高级用法

### 1. 批量生成

```bash
# 批量处理多个表
for ddl in ddl/*.ddl; do
  python3 skills/etl-generator/etl_generator.py $ddl > etl/$(basename $ddl .ddl)_etl.sql
done
```

### 2. 自定义模板

修改 `etl_generator.py` 中的模板函数,适配特定业务场景。

### 3. 集成 DataWorks

```bash
# 生成 DataWorks 节点配置
python3 skills/etl-generator/etl_generator.py source.ddl | \
  python3 skills/etl-generator/dataworks_adapter.py > node_config.yaml
```

---

## ⚠️ 注意事项

### 1. 字段顺序
- 确保输入输出的字段顺序个数一致
- 使用 `(rn)?+.+` 排除 rn 字段

### 2. 时区处理
- 所有时间字段必须做时区转换
- `_date` 结尾的字段不转换

### 3. 表名规范
- 源表:`ods_[表名]_di`
- 目标表:`dwd_[表名]_di`
- WITH 引用使用原表名

### 4. 增量数据
- 使用 `ods_data_base_di` 表
- 正确配置 `table_name` 和 `db_name`

---

## 📊 版本历史

### v2.0 (2026-03-06)
- ✅ 优化时区转换逻辑
- ✅ 支持增量数据处理
- ✅ 自动生成数据质量检查
- ✅ 自动生成字段映射说明
- ✅ 字段排除 rn 字段

### v1.0 (2026-02-28)
- ✅ 基础 ETL 生成功能
- ✅ 字段类型转换
- ✅ 目标表 DDL 生成

---

**维护者:** 汉克 (Hank)  
**更新时间:** 2026-03-06

Related Skills

Invoice Generator

3891
from openclaw/skills

Creates professional invoices in markdown and HTML

Workflow & Productivity

Incident Postmortem Generator

3891
from openclaw/skills

Generate blameless incident postmortems from raw notes, Slack threads, or bullet points.

DevOps & Infrastructure

Partnership Agreement Generator

3891
from openclaw/skills

Generate comprehensive partnership agreements, joint venture frameworks, and strategic alliance documents for B2B relationships.

Legal Documents & Agreements

Employee Onboarding Generator

3891
from openclaw/skills

Build a structured 90-day onboarding plan for any role. Covers pre-boarding, Day 1, Week 1, 30/60/90-day milestones, buddy assignments, and success metrics.

Workflow & Productivity

Employee Handbook Generator

3891
from openclaw/skills

Build a complete, customized employee handbook for your company. Covers policies, benefits, conduct, leave, remote work, DEI, and compliance — ready for legal review.

Content & Documentation

IT Disaster Recovery Plan Generator

3891
from openclaw/skills

Build production-ready disaster recovery plans that actually get followed when things break.

DevOps & Infrastructure

Compliance Audit Generator

3891
from openclaw/skills

Run internal compliance audits against major frameworks without hiring a consultant.

Security

API Documentation Generator

3891
from openclaw/skills

Generate production-ready API documentation from endpoint descriptions. Outputs OpenAPI 3.0, markdown reference docs, and SDK quickstart guides.

Coding & Development

Annual Report Generator

3891
from openclaw/skills

Build a complete annual business report from raw data. Covers financial performance, operational metrics, strategic highlights, and forward-looking guidance.

Workflow & Productivity

daily-report-generator

3891
from openclaw/skills

Automatically generate daily/weekly work reports from git commits, calendar events, and task lists. Use when you need to quickly create professional work reports without manual effort.

Workflow & Productivity

hr-policy-generator

3891
from openclaw/skills

Comprehensive HR policy development covering attendance, time-off, overtime, remote work, and compliance. Generates structured policy documents, legal checklists, exception handling frameworks, and employee communication plans tailored to company size, work arrangement, and jurisdiction.

Workflow & Productivity

hr-policy-generator-cn

3891
from openclaw/skills

综合性 HR 政策设计工具,覆盖考勤、休假、加班、远程办公及合规要求。根据公司规模、办公模式、适用法律等输入,生成完整的政策文档、法律合规清单、例外处理机制及员工沟通方案。

Workflow & Productivity