web-scraper

Full pipeline for building web scraping systems with agent team collaboration. Use this skill for requests like 'build a web scraper', 'develop a crawler', 'collect site data', 'web crawling system', 'build a scraper', 'data collection automation', 'site parsing', 'web data extraction', etc. Also supports target-analysis-only mode for specific site analysis. Note: real-time streaming data processing (Kafka/Flink), browser automation testing (Selenium testing), and website performance monitoring are outside the scope of this skill.

495 stars

byrevfactory

View on GitHub Installation ↓

Best use case

web-scraper is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using web-scraper should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/web-scraper/SKILL.md --create-dirs "https://raw.githubusercontent.com/revfactory/harness-100/main/en/37-web-scraper/.claude/skills/web-scraper/skill.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/web-scraper/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How web-scraper Compares

Feature / Agent	web-scraper	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

SKILL.md Source

# Web Scraper — Web Scraping System Construction Pipeline

An agent team collaborates to build target analysis, crawler design, parsing, storage, and monitoring for web scraping systems.

## Execution Mode

**Agent Team** — Five agents communicate directly via SendMessage and perform cross-validation.

## Agent Composition

| Agent | File | Role | Type |
|-------|------|------|------|
| target-analyst | `.claude/agents/target-analyst.md` | Target site analysis, risk assessment | general-purpose |
| crawler-developer | `.claude/agents/crawler-developer.md` | Crawler architecture and implementation | general-purpose |
| parser-engineer | `.claude/agents/parser-engineer.md` | Parsing logic design and implementation | general-purpose |
| data-manager | `.claude/agents/data-manager.md` | Data storage, validation, export | general-purpose |
| monitor-operator | `.claude/agents/monitor-operator.md` | Monitoring, alerting, scheduling | general-purpose |

## Workflow

### Phase 1: Preparation (performed directly by the orchestrator)

1. Extract the following from user input:
    - **Target site URL**: Website to scrape
    - **Target data**: What data to extract
    - **Purpose**: Intended use of collected data (analysis/monitoring/archiving)
    - **Scale**: Expected data volume, collection frequency
    - **Constraints** (optional): Tech stack limitations, budget, legal requirements
2. Create the `_workspace/` directory at the project root
3. Organize the input and save it to `_workspace/00_input.md`
4. Create the `_workspace/src/` directory
5. If pre-existing files are available, copy them to `_workspace/` and skip the corresponding phase
6. **Determine the execution mode** based on the scope of the request (see "Execution Modes" below)

### Phase 2: Team Assembly and Execution

Assemble the team and assign tasks. Inter-task dependencies are as follows:

| Order | Task | Owner | Dependencies | Deliverable |
|-------|------|-------|-------------|-------------|
| 1 | Target site analysis | analyst | None | `_workspace/01_target_analysis.md` |
| 2a | Crawler design and implementation | crawler | Task 1 | `_workspace/02_crawler_design.md` + `src/` |
| 2b | Parsing logic design and implementation | parser | Task 1 | `_workspace/03_parser_logic.md` + `src/` |
| 3 | Data storage design | data-mgr | Task 2b | `_workspace/04_data_storage.md` + `src/` |
| 4 | Monitoring configuration | monitor | Tasks 2a, 2b, 3 | `_workspace/05_monitor_config.md` + `src/` |

Tasks 2a (crawler) and 2b (parser) run **in parallel** since both depend only on Task 1 (analysis).

**Inter-agent communication flow:**
- analyst completes > passes URL patterns, anti-bot info, rate limits to crawler; data points and DOM structure to parser
- crawler completes > passes raw data format to parser; crawler health checkpoints to monitor
- parser completes > passes data schema to data-mgr; parsing metrics to monitor
- data-mgr completes > passes data quality metrics to monitor
- monitor integrates all components to finalize operations configuration

### Phase 3: Integration and Final Deliverables

1. Verify all files in `_workspace/` and `_workspace/src/`
2. Validate cross-deliverable consistency (analysis > crawler > parser > storage > monitoring)
3. Present the final summary and execution instructions to the user

## Execution Modes by Request Scope

| User Request Pattern | Execution Mode | Agents Deployed |
|---------------------|---------------|----------------|
| "Build a full scraping system" | **Full pipeline** | All 5 agents |
| "Analyze target site only" | **Analysis mode** | target-analyst only |
| "Build crawler only" | **Crawler mode** | target-analyst + crawler-developer |
| "Design parser only" | **Parser mode** | target-analyst + parser-engineer |
| "Monitor existing scraper" | **Monitor mode** | monitor-operator only |

**Reusing existing files**: If the user provides existing analysis results or crawler code, copy to `_workspace/` and skip the corresponding agent.

## Data Transfer Protocol

| Strategy | Method | Purpose |
|----------|--------|---------|
| File-based | `_workspace/` directory | Design documents |
| Code-based | `_workspace/src/` | Executable scraping code |
| Message-based | SendMessage | Key information transfer, feedback |

## Error Handling

| Error Type | Strategy |
|-----------|----------|
| Target site inaccessible | Analyze via cached/archived versions; explore alternative URLs |
| robots.txt blocks all crawling | Check for public API; propose API-based approach |
| Anti-bot blocks all requests | Escalate difficulty; propose headless browser or API alternatives |
| Dynamic rendering failure | Switch to Playwright; increase timeouts |
| Agent failure | Retry once; if still failing, proceed without that deliverable |

## Test Scenarios

### Normal Flow
**Prompt**: "Build a scraper to collect product prices from this e-commerce site daily"
**Expected result**:
- Analysis: Site structure, pagination, anti-bot mechanisms, robots.txt compliance plan
- Crawler: Async httpx-based crawler with rate limiting and retry logic
- Parser: CSS selector-based price/title/URL extraction with validation
- Storage: SQLite with upsert deduplication, CSV daily export
- Monitoring: Cron schedule, parsing success rate alerts, site change detection

### Analysis-Only Flow
**Prompt**: "Analyze whether this site can be scraped"
**Expected result**:
- target-analyst performs full analysis and risk assessment
- Other agents are not deployed

### Error Flow
**Prompt**: "Scrape data from this SPA with Cloudflare protection"
**Expected result**:
- target-analyst identifies Cloudflare challenge and SPA rendering
- crawler-developer uses Playwright with appropriate wait strategies
- parser-engineer handles dynamic DOM with robust selectors
- monitor-operator sets up change detection for frequently updated selectors


## Agent Extension Skills

| Skill | Path | Enhanced Agent | Role |
|-------|------|---------------|------|
| selector-generator | `.claude/skills/selector-generator/skill.md` | parser-engineer | CSS/XPath selector generation, robustness scoring, change detection |
| anti-bot-analyzer | `.claude/skills/anti-bot-analyzer/skill.md` | target-analyst, crawler-developer | Anti-bot defense layer analysis, rate limit detection, legal risk assessment |

Related Skills

sustainability-audit

495

from revfactory/harness-100

Full audit pipeline for ESG/sustainability where an agent team collaborates to generate environmental, social, and governance assessments along with an integrated report and improvement plan. Use this skill for requests such as 'run an ESG audit', 'write a sustainability report', 'ESG assessment', 'carbon emissions calculation', 'ESG rating diagnosis', 'governance review', 'social responsibility assessment', 'GRI report', 'TCFD disclosure', 'ESG improvement plan', and other ESG/sustainability tasks. Also supports assessment of specific pillars (E/S/G) only or improving existing reports. However, actual on-site audit execution, third-party verification certificate issuance, ESG rating agency score changes, and carbon credit trading are outside the scope of this skill.

materiality-assessment

495

from revfactory/harness-100

ESG materiality assessment matrix. Referenced by the esg-reporter and improvement-planner agents when evaluating ESG issue materiality and setting priorities. Use for 'materiality assessment', 'importance analysis', or 'Materiality Matrix' requests. Stakeholder surveys and external certification are out of scope.

ghg-protocol

495

from revfactory/harness-100

GHG Protocol detailed guide. Referenced by the environmental-analyst agent when calculating and reporting greenhouse gas emissions. Use for 'GHG Protocol', 'carbon emissions', 'Scope 1/2/3', or 'carbon footprint' requests. Carbon credit trading and CDM project execution are out of scope.

citation-standards

495

from revfactory/harness-100

Academic citation and reference standards guide. Referenced by the paper-writer and submission-preparer agents when composing citations and references. Use for 'citation format', 'APA', or 'references' requests. Original paper retrieval and professional database access are out of scope.

academic-paper

495

from revfactory/harness-100

Full research pipeline for academic paper writing where an agent team collaborates to generate research design, experiment protocols, analysis, manuscript writing, and submission preparation. Use this skill for requests such as 'write an academic paper', 'research paper writing', 'help me write a paper', 'design a study', 'run statistical analysis', 'prepare journal submission', 'manuscript writing', 'research methodology design', 'hypothesis testing', 'academic writing', and other academic research paper tasks. Also supports analysis, rewriting, and submission preparation when existing data or drafts are available. However, actual data collection execution, official IRB submission, journal system login and upload, and running actual statistical software are outside the scope of this skill.

product-copy-formulas

495

from revfactory/harness-100

Product copy formula library. Referenced by the detail-page-writer and marketing-manager agents when writing purchase-driving copy. Use for 'product copy', 'marketing copy', or 'ad copy' requests. Ad placement and design mockup creation are out of scope.

ecommerce-launcher

495

from revfactory/harness-100

Full launch pipeline for e-commerce products where an agent team collaborates to generate product planning, detail pages, pricing strategy, marketing, and CS setup all at once. Use this skill for requests such as 'launch an e-commerce product', 'prepare a product launch', 'register a product on Naver Smart Store', 'launch on Coupang', 'create a detail page', 'develop a pricing strategy', 'create a marketing plan', 'launch prep', 'product planning brief', 'e-commerce CS manual', and other e-commerce product launch tasks. Also supports supplementing pricing/marketing/CS even when existing briefs or detail pages are provided. However, actual platform API integration (automated product registration), payment system development, logistics system integration, and real-time order management are outside the scope of this skill.

conversion-optimization

495

from revfactory/harness-100

Purchase conversion optimization framework. Referenced by the detail-page-writer and pricing-strategist agents when designing detail pages and pricing with a conversion focus. Use for 'conversion rate optimization', 'CRO', or 'purchase psychology' requests. A/B testing tool setup and funnel automation are out of scope.

real-estate-analyst

495

from revfactory/harness-100

Real estate investment analysis pipeline. An agent team collaborates to produce market research, location analysis, profitability analysis, risk assessment, and investment reports. Use this skill for requests such as 'analyze this real estate', 'apartment investment analysis', 'studio apartment yield', 'real estate market research', 'location analysis', 'real estate investment report', 'buy vs lease', 'reconstruction investment analysis', 'commercial property yield analysis', and other general real estate investment analysis tasks. Actual purchase contracts, brokerage services, interior design, and property management are outside the scope of this skill.

location-scoring

495

from revfactory/harness-100

Location scoring scorecard. Referenced by the location-analyst agent for systematic real estate location evaluation. Use for requests involving 'location analysis', 'location assessment', or 'commercial area analysis'. On-site inspections and surveying are out of scope.

cap-rate-calculator

495

from revfactory/harness-100

Real estate yield calculator. Reference formulas and models used by the profitability-analyst agent for quantitative investment return analysis. Use for requests involving 'Cap Rate', 'yield analysis', 'DCF', or 'cash flow analysis'. Tax advisory and loan underwriting are out of scope.

vendor-scoring

495

from revfactory/harness-100

Vendor evaluation scorecard framework. Referenced by vendor-comparator and evaluation-designer agents when systematically comparing and evaluating vendors. Used for 'vendor evaluation', 'supplier comparison', 'bid evaluation' requests. Note: posting bid announcements and executing contracts are out of scope.