airflow-etl
Generate Apache Airflow ETL pipelines for government websites and document sources. Explores websites to find downloadable documents, verifies commercial use licenses, and creates complete Airflow DAG assets with daily scheduling. Use when user wants to create ETL pipelines, scrape government documents, or automate document collection workflows.
Best use case
airflow-etl is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Generate Apache Airflow ETL pipelines for government websites and document sources. Explores websites to find downloadable documents, verifies commercial use licenses, and creates complete Airflow DAG assets with daily scheduling. Use when user wants to create ETL pipelines, scrape government documents, or automate document collection workflows.
Teams using airflow-etl should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/airflow-etl/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How airflow-etl Compares
| Feature / Agent | airflow-etl | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Generate Apache Airflow ETL pipelines for government websites and document sources. Explores websites to find downloadable documents, verifies commercial use licenses, and creates complete Airflow DAG assets with daily scheduling. Use when user wants to create ETL pipelines, scrape government documents, or automate document collection workflows.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Airflow ETL Pipeline Generator Generate production-ready Apache Airflow ETL pipelines that automatically discover, download, and transform documents from government websites and other data sources into structured markdown files. ## Workflow ### Phase 1: Website Exploration and Discovery 1. **Initial Analysis**: - Use WebFetch to explore the provided website URL - Identify document sections (downloads, archives, publications, meetings, etc.) - Look for API endpoints, RSS feeds, or structured data sources - Note pagination patterns and document organization 2. **License Verification**: - Search for license information (Creative Commons, Open Government License, etc.) - Look for terms of use or copyright notices - Check for explicit commercial use permissions - If unclear, ask user about license status 3. **Document Inventory**: - Identify document types (PDF, DOC, DOCX, etc.) - Understand the URL patterns for documents - Determine how to detect new documents - Note any metadata available (dates, categories, titles) 4. **User Confirmation**: - Present findings in a clear summary - Show example document URLs - Describe the discovered structure - Ask user to confirm this is the correct data source ### Phase 2: Generate Airflow Pipeline Assets Create a complete, production-ready Airflow project structure: ``` airflow_pipelines/ ├── dags/ │ └── [source_name]_etl_dag.py ├── operators/ │ ├── __init__.py │ ├── document_scraper.py │ └── document_converter.py ├── utils/ │ ├── __init__.py │ ├── license_checker.py │ └── file_manager.py ├── config/ │ └── [source_name]_config.yaml ├── requirements.txt └── README.md ``` #### File Generation Requirements: **1. DAG File** (`dags/[source_name]_etl_dag.py`): - Daily schedule (adjustable) - Clear task dependencies - Error handling and retries - Sensor for checking new documents - Download task - Conversion task to markdown - File organization task - Use Airflow best practices (XComs, task groups, dynamic task generation) **2. Document Scraper** (`operators/document_scraper.py`): - BeautifulSoup or Scrapy for web scraping - Request handling with retries - Respect robots.txt - User-agent configuration - Rate limiting - Checksum/hash tracking to avoid re-downloading - State management for incremental updates **3. Document Converter** (`operators/document_converter.py`): - Support for PDF, DOC, DOCX conversion to markdown - Use libraries like pypandoc, pdfplumber, or python-docx - Preserve document structure (headings, lists, tables) - Extract metadata - Handle encoding issues - Clean and normalize output **4. License Checker** (`utils/license_checker.py`): - Validate license information - Check for commercial use permission - Log license status - Skip non-compliant documents **5. File Manager** (`utils/file_manager.py`): - Create meaningful directory structure - Organize by date, category, or document type - Generate consistent filenames - Handle duplicates - Maintain index of processed documents **6. Configuration** (`config/[source_name]_config.yaml`): ```yaml source: name: "Source Name" url: "https://example.com" document_section: "/documents" schedule: interval: "0 0 * * *" # Daily at midnight storage: base_path: "/data/documents" structure: "year/month/category" scraping: rate_limit: 1 # requests per second user_agent: "ETL Pipeline Bot" retry_attempts: 3 conversion: format: "markdown" preserve_structure: true extract_metadata: true ``` **7. Requirements** (`requirements.txt`): ``` apache-airflow>=2.7.0 beautifulsoup4>=4.12.0 requests>=2.31.0 pypandoc>=1.12 pdfplumber>=0.10.0 python-docx>=1.0.0 pyyaml>=6.0 lxml>=4.9.0 ``` **8. Documentation** (`README.md`): - Pipeline overview - Setup instructions - Configuration guide - Airflow connection requirements - Monitoring and troubleshooting - Example usage ### Phase 3: Implementation Notes **Important Considerations**: - Include comprehensive error handling - Log all operations for debugging - Add data quality checks - Implement idempotency (safe to re-run) - Use Airflow variables for sensitive config - Add email/Slack alerts for failures - Document the directory structure created **Code Quality**: - Follow PEP 8 style guidelines - Add docstrings to all functions - Include type hints - Write modular, reusable code - Add comments for complex logic **Testing Recommendations** (optional): - Suggest basic unit tests for utilities - Recommend integration testing approach - Provide example test cases ### Phase 4: Delivery 1. Generate all files using Write tool 2. Provide summary of created assets 3. Explain how to deploy to Airflow: - Copy files to Airflow home directory - Install requirements - Enable the DAG in Airflow UI - Configure connections if needed 4. Suggest next steps (testing, scheduling, monitoring) ## Examples ### Example 1: German Bundestag Documents ``` User: "Create an ETL pipeline for https://www.bundestag.de/digitales to collect committee meeting documents" Skill Response: - Explores the digital committee section - Finds document sections (agendas, protocols, reports) - Checks copyright notice - Confirms findings with user - Generates complete Airflow pipeline - Creates scraper for committee documents - Sets up markdown conversion - Organizes by committee and date ``` ### Example 2: EU Open Data Portal ``` User: "Build an Airflow pipeline for EU legislation documents from data.europa.eu" Skill Response: - Discovers API endpoints - Verifies open data license - Generates API-based scraper - Creates pipeline with API operators - Includes rate limiting - Organizes by document type and year ``` ## Key Success Criteria - Pipeline runs successfully in Airflow - Documents are correctly downloaded - Markdown conversion preserves structure - File organization is logical and scalable - License compliance is enforced - New documents are detected automatically - Pipeline is idempotent and fault-tolerant ## Tips for Users - Provide the main URL of the data source - Mention any specific document types needed - Specify preferred organization structure - Note any special requirements (date ranges, categories) - Test with a small sample before full deployment
Related Skills
apache-airflow-orchestration
Complete guide for Apache Airflow orchestration including DAGs, operators, sensors, XComs, task dependencies, dynamic workflows, and production deployment
airflow-workflows
Apache Airflow DAG design, operators, and scheduling best practices.
airflow-expert
Expert-level Apache Airflow orchestration, DAGs, operators, sensors, XComs, task dependencies, and scheduling
airflow-dag
Create Apache Airflow DAGs for construction data pipelines. Orchestrate ETL, validation, and reporting workflows.
airflow-dag-patterns
Build production Apache Airflow DAGs with best practices for operators, sensors, testing, and deployment. Use when creating data pipelines, orchestrating workflows, or scheduling batch jobs.
airflow-3x-migration
Comprehensive guide and patterns for migrating Apache Airflow 2.x workflows to Airflow 3.x, covering import changes, deprecated features, and new paradigms like Asset scheduling and TaskFlow API.
ahu-airflow
Fan Selection & Airflow Analysis Agent
airflow
Python DAG workflow orchestration using Apache Airflow for data pipelines, ETL processes, and scheduled task automation
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.
large-data-with-dask
Specific optimization strategies for Python scripts working with larger-than-memory datasets via Dask.
langsmith-fetch
Debug LangChain and LangGraph agents by fetching execution traces from LangSmith Studio. Use when debugging agent behavior, investigating errors, analyzing tool calls, checking memory operations, or examining agent performance. Automatically fetches recent traces and analyzes execution patterns. Requires langsmith-fetch CLI installed.
langchain-tool-calling
How chat models call tools - includes bind_tools, tool choice strategies, parallel tool calling, and tool message handling