airflow-etl

Generate Apache Airflow ETL pipelines for government websites and document sources. Explores websites to find downloadable documents, verifies commercial use licenses, and creates complete Airflow DAG assets with daily scheduling. Use when user wants to create ETL pipelines, scrape government documents, or automate document collection workflows.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

airflow-etl is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using airflow-etl should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/airflow-etl/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/airflow-etl/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/airflow-etl/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How airflow-etl Compares

Feature / Agent	airflow-etl	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Airflow ETL Pipeline Generator

Generate production-ready Apache Airflow ETL pipelines that automatically discover, download, and transform documents from government websites and other data sources into structured markdown files.

## Workflow

### Phase 1: Website Exploration and Discovery

1. **Initial Analysis**:
   - Use WebFetch to explore the provided website URL
   - Identify document sections (downloads, archives, publications, meetings, etc.)
   - Look for API endpoints, RSS feeds, or structured data sources
   - Note pagination patterns and document organization

2. **License Verification**:
   - Search for license information (Creative Commons, Open Government License, etc.)
   - Look for terms of use or copyright notices
   - Check for explicit commercial use permissions
   - If unclear, ask user about license status

3. **Document Inventory**:
   - Identify document types (PDF, DOC, DOCX, etc.)
   - Understand the URL patterns for documents
   - Determine how to detect new documents
   - Note any metadata available (dates, categories, titles)

4. **User Confirmation**:
   - Present findings in a clear summary
   - Show example document URLs
   - Describe the discovered structure
   - Ask user to confirm this is the correct data source

### Phase 2: Generate Airflow Pipeline Assets

Create a complete, production-ready Airflow project structure:

```
airflow_pipelines/
├── dags/
│   └── [source_name]_etl_dag.py
├── operators/
│   ├── __init__.py
│   ├── document_scraper.py
│   └── document_converter.py
├── utils/
│   ├── __init__.py
│   ├── license_checker.py
│   └── file_manager.py
├── config/
│   └── [source_name]_config.yaml
├── requirements.txt
└── README.md
```

#### File Generation Requirements:

**1. DAG File** (`dags/[source_name]_etl_dag.py`):
- Daily schedule (adjustable)
- Clear task dependencies
- Error handling and retries
- Sensor for checking new documents
- Download task
- Conversion task to markdown
- File organization task
- Use Airflow best practices (XComs, task groups, dynamic task generation)

**2. Document Scraper** (`operators/document_scraper.py`):
- BeautifulSoup or Scrapy for web scraping
- Request handling with retries
- Respect robots.txt
- User-agent configuration
- Rate limiting
- Checksum/hash tracking to avoid re-downloading
- State management for incremental updates

**3. Document Converter** (`operators/document_converter.py`):
- Support for PDF, DOC, DOCX conversion to markdown
- Use libraries like pypandoc, pdfplumber, or python-docx
- Preserve document structure (headings, lists, tables)
- Extract metadata
- Handle encoding issues
- Clean and normalize output

**4. License Checker** (`utils/license_checker.py`):
- Validate license information
- Check for commercial use permission
- Log license status
- Skip non-compliant documents

**5. File Manager** (`utils/file_manager.py`):
- Create meaningful directory structure
- Organize by date, category, or document type
- Generate consistent filenames
- Handle duplicates
- Maintain index of processed documents

**6. Configuration** (`config/[source_name]_config.yaml`):
```yaml
source:
  name: "Source Name"
  url: "https://example.com"
  document_section: "/documents"

schedule:
  interval: "0 0 * * *"  # Daily at midnight

storage:
  base_path: "/data/documents"
  structure: "year/month/category"

scraping:
  rate_limit: 1  # requests per second
  user_agent: "ETL Pipeline Bot"
  retry_attempts: 3

conversion:
  format: "markdown"
  preserve_structure: true
  extract_metadata: true
```

**7. Requirements** (`requirements.txt`):
```
apache-airflow>=2.7.0
beautifulsoup4>=4.12.0
requests>=2.31.0
pypandoc>=1.12
pdfplumber>=0.10.0
python-docx>=1.0.0
pyyaml>=6.0
lxml>=4.9.0
```

**8. Documentation** (`README.md`):
- Pipeline overview
- Setup instructions
- Configuration guide
- Airflow connection requirements
- Monitoring and troubleshooting
- Example usage

### Phase 3: Implementation Notes

**Important Considerations**:
- Include comprehensive error handling
- Log all operations for debugging
- Add data quality checks
- Implement idempotency (safe to re-run)
- Use Airflow variables for sensitive config
- Add email/Slack alerts for failures
- Document the directory structure created

**Code Quality**:
- Follow PEP 8 style guidelines
- Add docstrings to all functions
- Include type hints
- Write modular, reusable code
- Add comments for complex logic

**Testing Recommendations** (optional):
- Suggest basic unit tests for utilities
- Recommend integration testing approach
- Provide example test cases

### Phase 4: Delivery

1. Generate all files using Write tool
2. Provide summary of created assets
3. Explain how to deploy to Airflow:
   - Copy files to Airflow home directory
   - Install requirements
   - Enable the DAG in Airflow UI
   - Configure connections if needed
4. Suggest next steps (testing, scheduling, monitoring)

## Examples

### Example 1: German Bundestag Documents
```
User: "Create an ETL pipeline for https://www.bundestag.de/digitales to collect committee meeting documents"

Skill Response:
- Explores the digital committee section
- Finds document sections (agendas, protocols, reports)
- Checks copyright notice
- Confirms findings with user
- Generates complete Airflow pipeline
- Creates scraper for committee documents
- Sets up markdown conversion
- Organizes by committee and date
```

### Example 2: EU Open Data Portal
```
User: "Build an Airflow pipeline for EU legislation documents from data.europa.eu"

Skill Response:
- Discovers API endpoints
- Verifies open data license
- Generates API-based scraper
- Creates pipeline with API operators
- Includes rate limiting
- Organizes by document type and year
```

## Key Success Criteria

- Pipeline runs successfully in Airflow
- Documents are correctly downloaded
- Markdown conversion preserves structure
- File organization is logical and scalable
- License compliance is enforced
- New documents are detected automatically
- Pipeline is idempotent and fault-tolerant

## Tips for Users

- Provide the main URL of the data source
- Mention any specific document types needed
- Specify preferred organization structure
- Note any special requirements (date ranges, categories)
- Test with a small sample before full deployment

Related Skills

apache-airflow-orchestration

from diegosouzapw/awesome-omni-skill

Complete guide for Apache Airflow orchestration including DAGs, operators, sensors, XComs, task dependencies, dynamic workflows, and production deployment

airflow-workflows

from diegosouzapw/awesome-omni-skill

Apache Airflow DAG design, operators, and scheduling best practices.

airflow-expert

from diegosouzapw/awesome-omni-skill

Expert-level Apache Airflow orchestration, DAGs, operators, sensors, XComs, task dependencies, and scheduling

airflow-dag

from diegosouzapw/awesome-omni-skill

Create Apache Airflow DAGs for construction data pipelines. Orchestrate ETL, validation, and reporting workflows.

airflow-dag-patterns

from diegosouzapw/awesome-omni-skill

Build production Apache Airflow DAGs with best practices for operators, sensors, testing, and deployment. Use when creating data pipelines, orchestrating workflows, or scheduling batch jobs.

airflow-3x-migration

from diegosouzapw/awesome-omni-skill

Comprehensive guide and patterns for migrating Apache Airflow 2.x workflows to Airflow 3.x, covering import changes, deprecated features, and new paradigms like Asset scheduling and TaskFlow API.

ahu-airflow

from diegosouzapw/awesome-omni-skill

Fan Selection & Airflow Analysis Agent

airflow

from diegosouzapw/awesome-omni-skill

Python DAG workflow orchestration using Apache Airflow for data pipelines, ETL processes, and scheduled task automation

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

large-data-with-dask

from diegosouzapw/awesome-omni-skill

Specific optimization strategies for Python scripts working with larger-than-memory datasets via Dask.

langsmith-fetch

from diegosouzapw/awesome-omni-skill

Debug LangChain and LangGraph agents by fetching execution traces from LangSmith Studio. Use when debugging agent behavior, investigating errors, analyzing tool calls, checking memory operations, or examining agent performance. Automatically fetches recent traces and analyzes execution patterns. Requires langsmith-fetch CLI installed.

langchain-tool-calling

from diegosouzapw/awesome-omni-skill

How chat models call tools - includes bind_tools, tool choice strategies, parallel tool calling, and tool message handling