web-scraping

Web scraping best practices for AI coding agents. Covers tmux session management for long-running scrapes, Crawl4AI integration, parallel pipeline orchestration, resume-friendly architecture, and rate limit handling. Use this skill when building scrapers, running data extraction jobs, or managing lead generation pipelines.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

web-scraping is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using web-scraping should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/web-scraping/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/development/web-scraping/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/web-scraping/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How web-scraping Compares

Feature / Agent	web-scraping	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Web Scraping Skill for AI Agents

This skill provides best practices and operational patterns for building and running web scrapers. It covers session management, parallel execution, error recovery, and integration with AI-powered extraction tools.

## When to Use This Skill

- **Building a web scraper** — structuring code for reliability and resumability
- **Running long scrapes** — protecting processes with tmux
- **Lead generation pipelines** — orchestrating multi-stage data extraction
- **Data enrichment** — email finding, validation, and waterfall patterns
- **AI-powered extraction** — using Crawl4AI or LLM-based parsing
- **Remote server scraping** — SSH + tmux for location-independent execution

## Core Principles

### 1. Always Protect Long-Running Processes

Any script that runs longer than 5 minutes MUST run inside a tmux session.

```bash
# Standard pattern — always follow this
tmux new -s descriptive-name
python my_scraper.py --resume
# Ctrl+B, D to detach
```

For AI agents executing scripts autonomously, use detached mode:

```bash
# Launch detached (agent can continue other work immediately)
tmux new -d -s job-name 'python my_scraper.py --resume'

# Check status without attaching
tmux capture-pane -t job-name -p | tail -10

# Check if session is still alive
tmux has-session -t job-name 2>/dev/null && echo "running" || echo "done"
```

### 2. Always Build Resume Support

Every scraper should survive interruption. Implement checkpointing:

```python
import json
import os

PROGRESS_FILE = "output/progress.json"

def load_progress():
    if os.path.exists(PROGRESS_FILE):
        with open(PROGRESS_FILE) as f:
            return json.load(f)
    return {"completed": [], "last_index": 0}

def save_progress(progress):
    # Atomic write — write to temp file, then rename
    tmp = PROGRESS_FILE + ".tmp"
    with open(tmp, "w") as f:
        json.dump(progress, f)
    os.rename(tmp, PROGRESS_FILE)
```

### 3. Always Log to Files

Never rely solely on terminal output:

```bash
# Combine terminal output AND file logging
python scraper.py --resume 2>&1 | tee output/scrape.log
```

In Python, use dual logging:

```python
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.FileHandler("output/scrape.log"),
        logging.StreamHandler(),  # Also print to terminal
    ],
)
```

### 4. Respect Rate Limits

Always add delays between requests. When scraping directories or registries:

```python
import asyncio
import random

async def scrape_with_rate_limit(urls, delay_range=(1, 3)):
    for i, url in enumerate(urls):
        result = await fetch(url)
        process(result)

        # Random delay to avoid detection
        delay = random.uniform(*delay_range)
        await asyncio.sleep(delay)

        if i % 100 == 0:
            logging.info(f"Progress: {i}/{len(urls)}")
            save_progress({"last_index": i})
```

### 5. Name Sessions Descriptively

```bash
# Bad — you'll forget what's running
tmux new

# Good — instantly identifiable
tmux new -s email-enrichment-batch-3
tmux new -s directory-scrape-q1-2026
tmux new -s validation-run-feb06
```

## Scraping Patterns

### Pattern 1: Simple Sequential Scrape

For straightforward data collection from a single source:

```bash
tmux new -s my-scrape
python scraper.py --source listings --resume --output data/results.csv
# Ctrl+B, D
```

### Pattern 2: Parallel Pipeline

For multi-stage pipelines where stages can overlap:

```bash
# Stage 1: Discovery (2 hours)
tmux new -d -s discovery 'python scrape_directory.py --resume'

# Stage 2: Enrichment (runs on completed records from stage 1)
tmux new -d -s enrich 'python enrich_contacts.py --watch-input data/raw.csv'

# Stage 3: Validation
tmux new -d -s validate 'python validate_emails.py --watch-input data/enriched.csv'

# Monitor all stages
tmux ls
```

### Pattern 3: Multi-Source Aggregation

Scrape multiple sources in parallel, deduplicate later:

```bash
tmux new -d -s source-directory-a 'python scrape_source_a.py --resume'
tmux new -d -s source-directory-b 'python scrape_source_b.py --resume'
tmux new -d -s source-registry   'python scrape_registry.py --resume'

# When all finish, merge and deduplicate
python merge_sources.py --dedup --output data/combined.csv
```

### Pattern 4: Crawl4AI Integration

For AI-powered extraction from JavaScript-rendered pages:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def crawl_and_extract(urls, output_dir="output/pages"):
    config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        page_timeout=30000,
    )
    async with AsyncWebCrawler() as crawler:
        for i, url in enumerate(urls):
            result = await crawler.arun(url=url, config=config)
            if result.success:
                with open(f"{output_dir}/{i}.md", "w") as f:
                    f.write(result.markdown)
            print(f"[{i+1}/{len(urls)}] {url} - {'OK' if result.success else 'FAIL'}")
            await asyncio.sleep(2)

asyncio.run(crawl_and_extract(urls))
```

Always run Crawl4AI jobs in tmux:

```bash
tmux new -s crawl-sites
python crawl_extract.py --input data/urls.txt --resume
# Ctrl+B, D
```

### Pattern 5: Email Enrichment Waterfall

Try multiple methods in sequence, each operating on leftovers from the previous:

```bash
# Phase 1: Catch-all detection
tmux new -s enrich-catchall
python email_waterfall.py --phase catchall
# Ctrl+B, D

# Phase 2: API-based finders (after catchall completes)
tmux new -s enrich-finders
python email_waterfall.py --phase finders
# Ctrl+B, D

# Phase 3: Pattern permutation + validation
tmux new -s enrich-permutations
python email_waterfall.py --phase permutations
# Ctrl+B, D
```

### Pattern 6: Friday Night Deploy

Launch a full pipeline before the weekend:

```bash
# Friday evening
tmux new -d -s weekend-pipeline 'python full_pipeline.py --source all --enrich --validate --export'

# Saturday (check from phone via SSH)
ssh work-machine
tmux capture-pane -t weekend-pipeline -p | tail -5
# "Processing record 8,432 / 12,257..."

# Monday morning
tmux attach -t weekend-pipeline
# "Pipeline complete. 12,257 records. 3,847 validated emails."
```

## tmux Quick Reference

### Session Management

```
tmux new -s NAME              Create named session
tmux new -d -s NAME 'CMD'     Create detached session running CMD
tmux attach -t NAME           Reattach to session
tmux ls                       List all sessions
tmux kill-session -t NAME     Kill a session
```

### Inside a tmux Session

```
Ctrl+B, D                     Detach (session keeps running)
Ctrl+B, [                     Scroll mode (q to exit)
Ctrl+B, "                     Split pane horizontally
Ctrl+B, %                     Split pane vertically
Ctrl+B, arrow keys            Switch between panes
Ctrl+B, c                     New window
Ctrl+B, n / p                 Next / previous window
```

### Remote Monitoring (Without Attaching)

```bash
# Read last N lines of output
tmux capture-pane -t NAME -p | tail -20

# Check if session is alive
tmux has-session -t NAME 2>/dev/null && echo "running"

# Send Ctrl+C for graceful shutdown
tmux send-keys -t NAME C-c

# Send arbitrary input
tmux send-keys -t NAME 'q' Enter
```

## Agent-Specific Guidelines

When an AI agent is running scraping tasks:

1. **Always use `tmux new -d`** — the `-d` flag starts detached so the agent can continue working
2. **Check back periodically** — use `tmux capture-pane -t NAME -p | tail -10` to read output without attaching
3. **Run independent stages in parallel** — each in its own tmux session
4. **Never run long processes in the foreground** — always wrap in tmux
5. **Verify completion** — check `tmux has-session` or look for output files before proceeding to the next stage

## Scraper Architecture Checklist

When building a new scraper, ensure it has:

- [ ] `--resume` flag that loads progress from a checkpoint file
- [ ] Atomic checkpoint writes (write temp file, then rename)
- [ ] File-based logging (not just stdout)
- [ ] Configurable rate limiting (delay between requests)
- [ ] Graceful Ctrl+C handling (save progress before exit)
- [ ] Output to structured format (CSV or JSON)
- [ ] Error logging with failed URLs for retry
- [ ] User-agent rotation (if needed)
- [ ] Proxy support (if needed)

## Common Gotchas

| Problem | Solution |
|---------|----------|
| "Already attached" when reattaching | `tmux attach -t NAME -d` (force detach other viewer) |
| Session gone after reboot | Sessions don't survive restarts — use `--resume` |
| SSH drops mid-scrape | This is why you use tmux — just SSH back and reattach |
| Copy text from tmux scroll | `Ctrl+B, [` → navigate → `Space` to start → `Enter` to copy → `Ctrl+B, ]` to paste |
| Need to stop scraper gracefully | `tmux send-keys -t NAME C-c` from outside |

## Further Reading

- Full guide: [tmux for Web Scraping](https://topoffunnel.com/resources/tmux-web-scraping)
- Crawl4AI docs: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- tmux cheat sheet: `man tmux`

Related Skills

webscraping-ai-automation

from diegosouzapw/awesome-omni-skill

Automate Webscraping AI tasks via Rube MCP (Composio). Always search tools first for current schemas.

scrapingbee-automation

from diegosouzapw/awesome-omni-skill

Automate Scrapingbee tasks via Rube MCP (Composio). Always search tools first for current schemas.

scrapingant-automation

from diegosouzapw/awesome-omni-skill

Automate Scrapingant tasks via Rube MCP (Composio). Always search tools first for current schemas.

anti-scraping

from diegosouzapw/awesome-omni-skill

Use when need to bypass Cloudflare protection, scrape websites with anti-bot measures, render JavaScript pages, or simulate real browser behavior for web scraping

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

writing-skills

from diegosouzapw/awesome-omni-skill

Use when creating new skills, editing existing skills, or verifying skills work before deployment

writing-claude-md-files

from diegosouzapw/awesome-omni-skill

Use when creating or updating CLAUDE.md files for projects or subdirectories - covers top-level vs domain-level organization, capturing architectural intent and contracts, and mandatory freshness dates

writing-ad-copy

from diegosouzapw/awesome-omni-skill

Creates platform-specific ad copy for paid campaigns with A/B variants. Use when the user asks about ad copy, PPC ads, Google Ads, Facebook ads, LinkedIn ads, or paid campaign copy.

write-rich-descriptions

from diegosouzapw/awesome-omni-skill

Use metadata for system models (business/technical context) and markdown tables for deployment models (infrastructure specs). Makes models queryable and self-documenting.

write-ida-script

from diegosouzapw/awesome-omni-skill

Write an IDAPython script using verified API workflows from the IDA SDK MCP server

write-documents

from diegosouzapw/awesome-omni-skill

Apply when creating or editing INFO, SPEC, IMPL, TEST, FIX documents, or STRUT plans

write-coding-standards-from-file

from diegosouzapw/awesome-omni-skill

Write a coding standards document for a project using the coding styles from the file(s) and/or folder(s) passed as arguments in the prompt.