web-scraper

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

31,392 stars

bysickn33

View on GitHub Installation ↓

Best use case

web-scraper is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.

Practical example

Example input

Use the "web-scraper" skill to help with this workflow task. Context: Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

Example output

A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.

When to use this skill

Use this skill when you want a reusable workflow rather than writing the same prompt again and again.

When not to use this skill

Do not use this when you only need a one-off answer and do not need a reusable workflow.
Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/web-scraper/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/web-scraper/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/web-scraper/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How web-scraper Compares

Feature / Agent	web-scraper	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

AI Agents for Startups

Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

SKILL.md Source

# Web Scraper

## Overview

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

## When to Use This Skill

- When the user mentions "scraper" or related topics
- When the user mentions "scraping" or related topics
- When the user mentions "extrair dados web" or related topics
- When the user mentions "web scraping" or related topics
- When the user mentions "raspar dados" or related topics
- When the user mentions "coletar dados site" or related topics

## Do Not Use This Skill When

- The task is unrelated to web scraper
- A simpler, more specific tool can handle the request
- The user needs general-purpose assistance without domain expertise

## How It Works

Execute phases in strict order. Each phase feeds the next.

```
1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT
```

Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.

**Fast path**: If user provides URL + clear data target + the request is simple
(single page, one data type), compress Phases 1-3 into a single action:
fetch, classify, and extract in one WebFetch call. Still validate and format.

---

## Capabilities

- **Multi-strategy**: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
- **Extraction modes**: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
- **Output formats**: Markdown tables (default), JSON, CSV
- **Pagination**: auto-detect and follow (page numbers, infinite scroll, load-more)
- **Multi-URL**: extract same structure across sources with comparison and diff
- **Validation**: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
- **Auto-escalation**: WebFetch fails silently -> automatic Browser fallback
- **Data transforms**: cleaning, normalization, deduplication, enrichment
- **Differential mode**: detect changes between scraping runs

## Web Scraper

Multi-strategy web data extraction with intelligent approach selection,
automatic fallback escalation, data transformation, and structured output.

## Phase 1: Clarify

Establish extraction parameters before touching any URL.

## Required Parameters

| Parameter     | Resolve                              | Default        |
|:--------------|:-------------------------------------|:---------------|
| Target URL(s) | Which page(s) to scrape?             | *(required)*   |
| Data Target   | What specific data to extract?       | *(required)*   |
| Output Format | Markdown table, JSON, CSV, or text?  | Markdown table |
| Scope         | Single page, paginated, or multi-URL?| Single page    |

## Optional Parameters

| Parameter     | Resolve                                | Default      |
|:--------------|:---------------------------------------|:-------------|
| Pagination    | Follow pagination? Max pages?          | No, 1 page   |
| Max Items     | Maximum number of items to collect?    | Unlimited    |
| Filters       | Data to exclude or include?            | None         |
| Sort Order    | How to sort results?                   | Source order  |
| Save Path     | Save to file? Which path?              | Display only |
| Language      | Respond in which language?             | User's lang  |
| Diff Mode     | Compare with previous run?             | No           |

## Clarification Rules

- If user provides a URL and clear data target, proceed directly to Phase 2.
  Do NOT ask unnecessary questions.
- If request is ambiguous (e.g. "scrape this site"), ask ONLY:
  "What specific data do you want me to extract from this page?"
- Default to Markdown table output. Mention alternatives only if relevant.
- Accept requests in any language. Always respond in the user's language.
- If user says "everything" or "all data", perform recon first, then present
  what's available and let user choose.

## Discovery Mode

When user has a topic but no specific URL:
1. Use WebSearch to find the most relevant pages
2. Present top 3-5 URLs with descriptions
3. Let user choose which to scrape, or scrape all
4. Proceed to Phase 2 with selected URL(s)

Example: "find and extract pricing data for CRM tools"
-> WebSearch("CRM tools pricing comparison 2026")
-> Present top results -> User selects -> Extract

---

## Phase 2: Reconnaissance

Analyze the target page before extraction.

## Step 2.1: Initial Fetch

Use WebFetch to retrieve and analyze the page structure:

```
WebFetch(
  url = TARGET_URL,
  prompt = "Analyze this page structure and report:
    1. Page type: article, product listing, search results, data table,
       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
    2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
       accordion/collapsible sections, tabs
    3. Approximate number of distinct data items visible
    4. JavaScript rendering indicators: empty containers, loading spinners,
       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
    5. Pagination: next/prev links, page numbers, load-more buttons,
       infinite scroll indicators, total results count
    6. Data density: how much structured, extractable data exists
    7. List the main data fields/columns available for extraction
    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
    9. Available download links: CSV, Excel, PDF, API endpoints"
)
```

## Step 2.2: Evaluate Fetch Quality

| Signal                                      | Interpretation                    | Action                    |
|:--------------------------------------------|:----------------------------------|:--------------------------|
| Rich content with data clearly visible      | Static page                       | Strategy A (WebFetch)     |
| Empty containers, "loading...", minimal text | JS-rendered                       | Strategy B (Browser)      |
| Login wall, CAPTCHA, 403/401 response       | Blocked                           | Report to user            |
| Content present but poorly structured       | Needs precision                   | Strategy B (Browser)      |
| JSON or XML response body                   | API endpoint                      | Strategy C (Bash/curl)    |
| Download links for CSV/Excel available      | Direct data file                  | Strategy C (download)     |

## Step 2.3: Content Classification

Classify into an extraction mode:

| Mode       | Indicators                                 | Examples                          |
|:-----------|:-------------------------------------------|:----------------------------------|
| `table`    | HTML `<table>`, grid layout with headers   | Price comparison, statistics, specs|
| `list`     | Repeated similar elements, card grids      | Search results, product listings  |
| `article`  | Long-form text with headings/paragraphs    | Blog post, news article, docs     |
| `product`  | Product name, price, specs, images, rating | E-commerce product page           |
| `contact`  | Names, emails, phones, addresses, roles    | Team page, staff directory        |
| `faq`      | Question-answer pairs, accordions          | FAQ page, help center             |
| `pricing`  | Plan names, prices, features, tiers        | SaaS pricing page                 |
| `events`   | Dates, locations, titles, descriptions     | Event listings, conferences       |
| `jobs`     | Titles, companies, locations, salaries     | Job boards, career pages          |
| `custom`   | User specified CSS selectors or fields     | Anything not matching above       |

Record: **page type**, **extraction mode**, **JS rendering needed (yes/no)**,
**available fields**, **structured data present (JSON-LD etc.)**.

If user asked for "everything", present the available fields and let them choose.

---

## Phase 3: Strategy Selection

Choose the extraction approach based on recon results.

## Decision Tree

```
Structured data (JSON-LD, microdata) has what we need?
 |
 +-- YES --> STRATEGY E: Extract structured data directly
 |
 +-- NO: Content fully visible in WebFetch?
      |
      +-- YES: Need precise element targeting?
      |    |
      |    +-- NO  --> STRATEGY A: WebFetch + AI extraction
      |    +-- YES --> STRATEGY B: Browser automation
      |
      +-- NO: JavaScript rendering detected?
           |
           +-- YES --> STRATEGY B: Browser automation
           +-- NO:  API/JSON/XML endpoint or download link?
                |
                +-- YES --> STRATEGY C: Bash (curl + jq)
                +-- NO  --> Report access issue to user
```

## Strategy A: Webfetch With Ai Extraction

**Best for**: Static pages, articles, simple tables, well-structured HTML.

Use WebFetch with a targeted extraction prompt tailored to the mode:

```
WebFetch(
  url = URL,
  prompt = "Extract [DATA_TARGET] from this page.
    Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
    Rules:
    - If a value is missing or unclear, use 'N/A'
    - Do not include navigation, ads, footers, or unrelated content
    - Preserve original values exactly (numbers, currencies, dates)
    - Include ALL matching items, not just the first few
    - For each item, also extract the URL/link if available"
)
```

**Auto-escalation**: If WebFetch returns suspiciously few items (less than
50% of expected from recon), or mostly empty fields, automatically escalate
to Strategy B without asking user. Log the escalation in notes.

## Strategy B: Browser Automation

**Best for**: JS-rendered pages, SPAs, interactive content, lazy-loaded data.

Sequence:
1. Get tab context: `tabs_context_mcp(createIfEmpty=true)` -> get tabId
2. Navigate to URL: `navigate(url=TARGET_URL, tabId=TAB)`
3. Wait for content to load: `computer(action="wait", duration=3, tabId=TAB)`
4. Check for cookie/consent banners: `find(query="cookie consent or accept button", tabId=TAB)`
   - If found, dismiss it (prefer privacy-preserving option)
5. Read page structure: `read_page(tabId=TAB)` or `get_page_text(tabId=TAB)`
6. Locate target elements: `find(query="[DESCRIPTION]", tabId=TAB)`
7. Extract with JavaScript for precise data via `javascript_tool`

```javascript
// Table extraction
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
  const cells = row.querySelectorAll('td, th');
  return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);
```

```javascript
// List/card extraction
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
  field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
  field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
  link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);
```

8. For lazy-loaded content, scroll and re-extract:
   `computer(action="scroll", scroll_direction="down", tabId=TAB)`
   then `computer(action="wait", duration=2, tabId=TAB)`

## Strategy C: Bash (Curl + Jq)

**Best for**: REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.

```bash

## Json Api

curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'

## Csv Download

curl -s "CSV_URL" -o /tmp/scraped_data.csv

## Xml Parsing

curl -s "XML_URL" | python3 -c "
import xml.etree.ElementTree as ET, json, sys
tree = ET.parse(sys.stdin)

## ... Parse And Output Json

"
```

## Strategy D: Hybrid

When a single strategy is insufficient, combine:
1. WebSearch to discover relevant URLs
2. WebFetch for initial content assessment
3. Browser automation for JS-heavy sections
4. Bash for post-processing (jq, python for data cleaning)

## Strategy E: Structured Data Extraction

When JSON-LD, microdata, or OpenGraph is present:
1. Use Browser `javascript_tool` to extract structured data:
```javascript
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
  try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);
```
2. This often provides cleaner, more reliable data than DOM scraping
3. Fall back to DOM extraction only for fields not in structured data

## Pagination Handling

When pagination is detected and user wants multiple pages:

**Page-number pagination (any strategy):**
1. Extract data from current page
2. Identify URL pattern (e.g. `?page=N`, `/page/N`, `&offset=N`)
3. Iterate through pages up to user's max (default: 5 pages)
4. Show progress: "Extracting page 2/5..."
5. Concatenate all results, deduplicate if needed

**Infinite scroll (Browser only):**
1. Extract currently visible data
2. Record item count
3. Scroll down: `computer(action="scroll", scroll_direction="down", tabId=TAB)`
4. Wait: `computer(action="wait", duration=2, tabId=TAB)`
5. Extract newly loaded data
6. Compare count - if no new items after 2 scrolls, stop
7. Repeat until no new content or max iterations (default: 5)

**"Load More" button (Browser only):**
1. Extract currently visible data
2. Find button: `find(query="load more button", tabId=TAB)`
3. Click it: `computer(action="left_click", ref=REF, tabId=TAB)`
4. Wait and extract new content
5. Repeat until button disappears or max iterations reached

---

## Phase 4: Extract

Execute the selected strategy using mode-specific patterns.
See [references/extraction-patterns.md](references/extraction-patterns.md)
for CSS selectors and JavaScript snippets.

## Table Mode

WebFetch prompt:
```
"Extract ALL rows from the table(s) on this page.
Return as a markdown table with exact column headers.
Include every row - do not truncate or summarize.
Preserve numeric precision, currencies, and units."
```

## List Mode

WebFetch prompt:
```
"Extract each [ITEM_TYPE] from this page.
For each item, extract: [FIELD_LIST].
Return as a JSON array of objects with these keys: [KEY_LIST].
Include ALL items, not just the first few. Include link/URL for each item if available."
```

## Article Mode

WebFetch prompt:
```
"Extract article metadata:
- title, author, date, tags/categories, word count estimate
- Key factual data points, statistics, and named entities
Return as structured markdown. Summarize the content; do not reproduce full text."
```

## Product Mode

WebFetch prompt:
```
"Extract product data with these exact fields:
- name, brand, price, currency, originalPrice (if discounted),
  availability, description (first 200 chars), rating, reviewCount,
  specifications (as key-value pairs), productUrl, imageUrl
Return as JSON. Use null for missing fields."
```

Also check for JSON-LD `Product` schema (Strategy E) first.

## Contact Mode

WebFetch prompt:
```
"Extract contact information for each person/entity:
- name, title, role, email, phone, address, organization, website, linkedinUrl
Return as a markdown table. Only extract real contacts visible on the page."
```

## Faq Mode

WebFetch prompt:
```
"Extract all question-answer pairs from this page.
For each FAQ item extract:
- question: the exact question text
- answer: the answer text (first 300 chars if long)
- category: the section/category if grouped
Return as a JSON array of objects."
```

## Pricing Mode

WebFetch prompt:
```
"Extract all pricing plans/tiers from this page.
For each plan extract:
- planName, monthlyPrice, annualPrice, currency
- features (array of included features)
- limitations (array of limits or excluded features)
- ctaText (call-to-action button text)
- highlighted (true if marked as recommended/popular)
Return as JSON. Use null for missing fields."
```

## Events Mode

WebFetch prompt:
```
"Extract all events/sessions from this page.
For each event extract:
- title, date, time, endTime, location, description (first 200 chars)
- speakers (array of names), category, registrationUrl
Return as JSON. Use null for missing fields."
```

## Jobs Mode

WebFetch prompt:
```
"Extract all job listings from this page.
For each job extract:
- title, company, location, salary, salaryRange, type (full-time/part-time/contract)
- postedDate, description (first 200 chars), applyUrl, tags
Return as JSON. Use null for missing fields."
```

## Custom Mode

When user provides specific selectors or field descriptions:
- Use Browser automation with `javascript_tool` and user's CSS selectors
- Or use WebFetch with a prompt built from user's field descriptions
- Always confirm extracted schema with user before proceeding to multi-URL

## Multi-Url Extraction

When extracting from multiple URLs:
1. Extract from the **first URL** to establish the data schema
2. Show user the first results and confirm the schema is correct
3. Extract from remaining URLs using the same schema
4. Add a `source` column/field to every record with the origin URL
5. Combine all results into a single output
6. Show progress: "Extracting 3/7 URLs..."

---

## Phase 5: Transform

Clean, normalize, and enrich extracted data before validation.
See [references/data-transforms.md](references/data-transforms.md) for patterns.

## Automatic Transforms (Always Apply)

| Transform              | Action                                               |
|:-----------------------|:-----------------------------------------------------|
| Whitespace cleanup     | Trim, collapse multiple spaces, remove `\n` in cells |
| HTML entity decode     | `&amp;` -> `&`, `&lt;` -> `<`, `&#39;` -> `'`       |
| Unicode normalization  | NFKC normalization for consistent characters          |
| Empty string to null   | `""` -> `null` (for JSON), `""` -> `N/A` (for tables)|

## Conditional Transforms (Apply When Relevant)

| Transform             | When                         | Action                                  |
|:----------------------|:-----------------------------|:----------------------------------------|
| Price normalization   | Product/pricing modes        | Extract numeric value + currency symbol |
| Date normalization    | Any dates found              | Normalize to ISO-8601 (YYYY-MM-DD)      |
| URL resolution        | Relative URLs extracted      | Convert to absolute URLs                |
| Phone normalization   | Contact mode                 | Standardize to E.164 format if possible |
| Deduplication         | Multi-page or multi-URL      | Remove exact duplicate rows             |
| Sorting               | User requested or natural    | Sort by user-specified field            |

## Data Enrichment (Only When Useful)

| Enrichment             | When                         | Action                                |
|:-----------------------|:-----------------------------|:--------------------------------------|
| Currency conversion    | User asks for single currency| Note original + convert (approximate) |
| Domain extraction      | URLs in data                 | Add domain column from full URLs      |
| Word count             | Article mode                 | Count words in extracted text         |
| Relative dates         | Dates present                | Add "X days ago" column if useful     |

## Deduplication Strategy

When combining data from multiple pages or URLs:
1. Exact match: rows with identical values in all fields -> keep first
2. Near match: rows with same key fields (name+source) but different details
   -> keep most complete (fewer nulls), flag in notes
3. Report: "Removed N duplicate rows" in delivery notes

---

## Phase 6: Validate

Verify extraction quality before delivering results.

## Validation Checks

| Check                | Action                                              |
|:---------------------|:----------------------------------------------------|
| Item count           | Compare extracted count to expected count from recon |
| Empty fields         | Count N/A or null values per field                   |
| Data type consistency| Numbers should be numeric, dates parseable           |
| Duplicates           | Flag exact duplicate rows (post-dedup)               |
| Encoding             | Check for HTML entities, garbled characters           |
| Completeness         | All user-requested fields present in output          |
| Truncation           | Verify data wasn't cut off (check last items)        |
| Outliers             | Flag values that seem anomalous (e.g. $0.00 price)  |

## Confidence Rating

Assign to every extraction:

| Rating     | Criteria                                                        |
|:-----------|:----------------------------------------------------------------|
| **HIGH**   | All fields populated, count matches expected, no anomalies      |
| **MEDIUM** | Minor gaps (<10% empty fields) or count slightly differs        |
| **LOW**    | Significant gaps (>10% empty), structural issues, partial data  |

Always report confidence with specifics:
> Confidence: **HIGH** - 47 items extracted, all 6 fields populated,
> matches expected count from page analysis.

## Auto-Recovery (Try Before Reporting Issues)

| Issue              | Auto-Recovery Action                                  |
|:-------------------|:------------------------------------------------------|
| Missing data       | Re-attempt with Browser if WebFetch was used          |
| Encoding problems  | Apply HTML entity decode + unicode normalization      |
| Incomplete results | Check for pagination or lazy-loading, fetch more      |
| Count mismatch     | Scroll/paginate to find remaining items               |
| All fields empty   | Page likely JS-rendered, switch to Browser strategy   |
| Partial fields     | Try JSON-LD extraction as supplement                  |

Log all recovery attempts in delivery notes.
Inform user of any irrecoverable gaps with specific details.

---

## Phase 7: Format And Deliver

Structure results according to user preference.
See [references/output-templates.md](references/output-templates.md)
for complete formatting templates.

## Delivery Envelope

ALWAYS wrap results with this metadata header:

```markdown

## Extraction Results

**Source:** [Page Title](http://example.com)
**Date:** YYYY-MM-DD HH:MM UTC
**Items:** N records (M fields each)
**Confidence:** HIGH | MEDIUM | LOW
**Strategy:** A (WebFetch) | B (Browser) | C (API) | E (Structured Data)
**Format:** Markdown Table | JSON | CSV

---

[DATA HERE]

---

**Notes:**
- [Any gaps, issues, or observations]
- [Transforms applied: deduplication, normalization, etc.]
- [Pages scraped if paginated: "Pages 1-5 of 12"]
- [Auto-escalation if it occurred: "Escalated from WebFetch to Browser"]
```

## Markdown Table Rules

- Left-align text columns (`:---`), right-align numbers (`---:`)
- Consistent column widths for readability
- Include summary row for numeric data when useful (totals, averages)
- Maximum 10 columns per table; split wider data into multiple tables
  or suggest JSON format
- Truncate long cell values to 60 chars with `...` indicator
- Use `N/A` for missing values, never leave cells empty
- For multi-page results, show combined table (not per-page)

## Json Rules

- Use camelCase for keys (e.g. `productName`, `unitPrice`)
- Wrap in metadata envelope:
  ```json
  {
    "metadata": {
      "source": "URL",
      "title": "Page Title",
      "extractedAt": "ISO-8601",
      "itemCount": 47,
      "fieldCount": 6,
      "confidence": "HIGH",
      "strategy": "A",
      "transforms": ["deduplication", "priceNormalization"],
      "notes": []
    },
    "data": [ ... ]
  }
  ```
- Pretty-print with 2-space indentation
- Numbers as numbers (not strings), booleans as booleans
- null for missing values (not empty strings)

## Csv Rules

- First row is always headers
- Quote any field containing commas, quotes, or newlines
- UTF-8 encoding with BOM for Excel compatibility
- Use `,` as delimiter (standard)
- Include metadata as comments: `# Source: URL`

## File Output

When user requests file save:
- Markdown: `.md` extension
- JSON: `.json` extension
- CSV: `.csv` extension
- Confirm path before writing
- Report full file path and item count after saving

## Multi-Url Comparison Format

When comparing data across multiple sources:
- Add `Source` as the first column/field
- Use short identifiers for sources (domain name or user label)
- Group by source or interleave based on user preference
- Highlight differences if user asks for comparison
- Include summary: "Best price: $X at store-b.com"

## Differential Output

When user requests change detection (diff mode):
- Compare current extraction with previous run
- Mark new items with `[NEW]`
- Mark removed items with `[REMOVED]`
- Mark changed values with `[WAS: old_value]`
- Include summary: "Changes since last run: +5 new, -2 removed, 3 modified"

---

## Rate Limiting

- Maximum 1 request per 2 seconds for sequential page fetches
- For multi-URL jobs, process sequentially with pauses
- If a site returns 429 (Too Many Requests), stop and report to user

## Access Respect

- If a page blocks access (403, CAPTCHA, login wall), report to user
- Do NOT attempt to bypass bot detection, CAPTCHAs, or access controls
- Do NOT scrape behind authentication unless user explicitly provides access
- Respect robots.txt directives when known

## Copyright

- Do NOT reproduce large blocks of copyrighted article text
- For articles: extract factual data, statistics, and structured info;
  summarize narrative content
- Always include source attribution (http://example.com) in output

## Data Scope

- Extract ONLY what the user explicitly requested
- Warn user before collecting potentially sensitive data at scale
  (emails, phone numbers, personal information)
- Do not store or transmit extracted data beyond what the user sees

## Failure Protocol

When extraction fails or is blocked:
1. Explain the specific reason (JS rendering, bot detection, login, etc.)
2. Suggest alternatives (different URL, API if available, manual approach)
3. Never retry aggressively or escalate access attempts

---

## Quick Reference: Mode Cheat Sheet

| User Says...                         | Mode      | Strategy  | Output Default   |
|:-------------------------------------|:----------|:----------|:-----------------|
| "extract the table"                  | table     | A or B    | Markdown table   |
| "get all products/prices"            | product   | E then A  | Markdown table   |
| "scrape the listings"                | list      | A or B    | Markdown table   |
| "extract contact info / team page"   | contact   | A         | Markdown table   |
| "get the article data"               | article   | A         | Markdown text    |
| "extract the FAQ"                    | faq       | A or B    | JSON             |
| "get pricing plans"                  | pricing   | A or B    | Markdown table   |
| "scrape job listings"                | jobs      | A or B    | Markdown table   |
| "get event schedule"                 | events    | A or B    | Markdown table   |
| "find and extract [topic]"           | discovery | WebSearch | Markdown table   |
| "compare prices across sites"        | multi-URL | A or B    | Comparison table |
| "what changed since last time"       | diff      | any       | Diff format      |

---

## References

- **Extraction patterns**: [references/extraction-patterns.md](references/extraction-patterns.md)
  CSS selectors, JavaScript snippets, JSON-LD parsing, domain tips.

- **Output templates**: [references/output-templates.md](references/output-templates.md)
  Markdown, JSON, CSV templates with complete examples.

- **Data transforms**: [references/data-transforms.md](references/data-transforms.md)
  Cleaning, normalization, deduplication, enrichment patterns.

## Best Practices

- Provide clear, specific context about your project and requirements
- Review all suggestions before applying them to production code
- Combine with other complementary skills for comprehensive analysis

## Common Pitfalls

- Using this skill for tasks outside its domain expertise
- Applying recommendations without understanding your specific context
- Not providing enough project context for accurate analysis

## Limitations
- Use this skill only when the task clearly matches the scope described above.
- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.

Related Skills

x-twitter-scraper

31392

from sickn33/antigravity-awesome-skills

X (Twitter) data platform skill — tweet search, user lookup, follower extraction, engagement metrics, giveaway draws, monitoring, webhooks, 19 extraction tools, MCP server.

apify-ultimate-scraper

31392

from sickn33/antigravity-awesome-skills

AI-driven data extraction from 55+ Actors across all major platforms. This skill automatically selects the best Actor for your task.

nextjs-best-practices

31392

from sickn33/antigravity-awesome-skills

Next.js App Router principles. Server Components, data fetching, routing patterns.

network-101

31392

from sickn33/antigravity-awesome-skills

Configure and test common network services (HTTP, HTTPS, SNMP, SMB) for penetration testing lab environments. Enable hands-on practice with service enumeration, log analysis, and security testing against properly configured target systems.

neon-postgres

31392

from sickn33/antigravity-awesome-skills

Expert patterns for Neon serverless Postgres, branching, connection pooling, and Prisma/Drizzle integration

nanobanana-ppt-skills

31392

from sickn33/antigravity-awesome-skills

AI-powered PPT generation with document analysis and styled images

multi-agent-patterns

31392

from sickn33/antigravity-awesome-skills

This skill should be used when the user asks to "design multi-agent system", "implement supervisor pattern", "create swarm architecture", "coordinate multiple agents", or mentions multi-agent patterns, context isolation, agent handoffs, sub-agents, or parallel agent execution.

monorepo-management

31392

from sickn33/antigravity-awesome-skills

Build efficient, scalable monorepos that enable code sharing, consistent tooling, and atomic changes across multiple packages and applications.

monetization

31392

from sickn33/antigravity-awesome-skills

Estrategia e implementacao de monetizacao para produtos digitais - Stripe, subscriptions, pricing experiments, freemium, upgrade flows, churn prevention, revenue optimization e modelos de negocio SaaS.

modern-javascript-patterns

31392

from sickn33/antigravity-awesome-skills

Comprehensive guide for mastering modern JavaScript (ES6+) features, functional programming patterns, and best practices for writing clean, maintainable, and performant code.

microservices-patterns

31392

from sickn33/antigravity-awesome-skills

Master microservices architecture patterns including service boundaries, inter-service communication, data management, and resilience patterns for building distributed systems.

mcp-builder

31392

from sickn33/antigravity-awesome-skills

Create MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. The quality of an MCP server is measured by how well it enables LLMs to accomplish real-world tasks.