web-scraper-skill
Use this skill to scrape, crawl, or extract data from websites using Apify or Firecrawl APIs. Trigger whenever the user wants to: scrape a URL, crawl a website, extract structured data from web pages, run an Apify Actor, batch scrape multiple URLs, search and scrape the web, map a site's URLs, collect product/price/review data, or build any web data pipeline. If the user says things like "scrape this site", "get data from this URL", "crawl this website", "run an Apify actor", "use Firecrawl", "extract content from a page", "pull data from the web", or mentions any web data extraction task — always use this skill. Also use it when the user wants to choose between Apify and Firecrawl.
Best use case
web-scraper-skill is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Use this skill to scrape, crawl, or extract data from websites using Apify or Firecrawl APIs. Trigger whenever the user wants to: scrape a URL, crawl a website, extract structured data from web pages, run an Apify Actor, batch scrape multiple URLs, search and scrape the web, map a site's URLs, collect product/price/review data, or build any web data pipeline. If the user says things like "scrape this site", "get data from this URL", "crawl this website", "run an Apify actor", "use Firecrawl", "extract content from a page", "pull data from the web", or mentions any web data extraction task — always use this skill. Also use it when the user wants to choose between Apify and Firecrawl.
Teams using web-scraper-skill should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/web-scraper-skill/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How web-scraper-skill Compares
| Feature / Agent | web-scraper-skill | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Use this skill to scrape, crawl, or extract data from websites using Apify or Firecrawl APIs. Trigger whenever the user wants to: scrape a URL, crawl a website, extract structured data from web pages, run an Apify Actor, batch scrape multiple URLs, search and scrape the web, map a site's URLs, collect product/price/review data, or build any web data pipeline. If the user says things like "scrape this site", "get data from this URL", "crawl this website", "run an Apify actor", "use Firecrawl", "extract content from a page", "pull data from the web", or mentions any web data extraction task — always use this skill. Also use it when the user wants to choose between Apify and Firecrawl.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
SKILL.md Source
# Web Scraper Skill (Apify + Firecrawl)
This skill helps Openclaw scrape and extract data from websites using two powerful APIs:
- **Firecrawl** — best for scraping individual pages, crawling entire sites, and getting LLM-ready content (markdown)
- **Apify** — best for specialized scrapers (social media, Google Maps, e-commerce, etc.) via pre-built Actors
---
## Quick Decision Guide: Apify vs Firecrawl
| Use Case | Recommended Tool |
|---|---|
| Scrape a single page into markdown/JSON | **Firecrawl** `/scrape` |
| Crawl an entire website (follow links) | **Firecrawl** `/crawl` |
| Map all URLs on a site | **Firecrawl** `/map` |
| Search web + scrape results | **Firecrawl** `/search` |
| Scrape Instagram / TikTok / Twitter | **Apify** (social actors) |
| Scrape Google Maps / reviews | **Apify** (compass/crawler-google-places) |
| Scrape Amazon products | **Apify** (apify/amazon-scraper) |
| Scrape Google Search results | **Apify** (apify/google-search-scraper) |
| Custom actor / any Apify Store actor | **Apify** |
---
## Authentication
Both APIs require API keys passed via headers. Always ask the user for their key if not provided.
**Firecrawl:** `Authorization: Bearer fc-YOUR_API_KEY`
**Apify:** `Authorization: Bearer YOUR_APIFY_TOKEN` (or `?token=YOUR_TOKEN` in URL)
---
## Firecrawl API Reference
**Base URL:** `https://api.firecrawl.dev/v2`
### 1. Scrape a Single Page
```http
POST /v2/scrape
Authorization: Bearer fc-YOUR_API_KEY
Content-Type: application/json
{
"url": "https://example.com",
"formats": ["markdown"], // Options: markdown, html, rawHtml, links, screenshot, json
"onlyMainContent": true, // Strips nav/footer/ads
"waitFor": 0, // ms to wait before scraping (for JS-heavy pages)
"timeout": 30000, // ms
"blockAds": true,
"proxy": "auto" // "auto", "basic", or "stealth"
}
```
**Response:** `{ "success": true, "data": { "markdown": "...", "metadata": {...} } }`
### 2. Crawl an Entire Website
Crawling is async — starts a job, then poll for results.
```http
POST /v2/crawl
{
"url": "https://docs.example.com",
"limit": 50, // Max pages
"maxDepth": 3,
"allowExternalLinks": false,
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": true
}
}
```
**Response:** `{ "success": true, "id": "crawl-job-id" }`
**Poll status:**
```http
GET /v2/crawl/{crawl-job-id}
```
**Response:** `{ "status": "completed", "total": 50, "data": [...] }`
### 3. Map a Website's URLs
```http
POST /v2/map
{ "url": "https://example.com" }
```
**Response:** `{ "success": true, "links": [{ "url": "...", "title": "..." }] }`
### 4. Search + Scrape in One Call
```http
POST /v2/search
{
"query": "best web scraping tools 2025",
"limit": 5,
"scrapeOptions": { "formats": ["markdown"] }
}
```
**Response:** `{ "data": [{ "url": "...", "title": "...", "markdown": "..." }] }`
### 5. Batch Scrape Multiple URLs
```http
POST /v2/batch/scrape
{
"urls": ["https://a.com", "https://b.com"],
"formats": ["markdown"]
}
```
Returns a job ID; poll with `GET /v2/batch/scrape/{id}`
---
## Apify API Reference
**Base URL:** `https://api.apify.com/v2`
**Auth:** Pass token as query param `?token=YOUR_TOKEN` or in Authorization header.
### Core Workflow
Apify runs "Actors" (pre-built scrapers). The flow is:
1. **Start a run** → get a `runId` and `defaultDatasetId`
2. **Poll status** until `SUCCEEDED`
3. **Fetch results** from the dataset
### 1. Run an Actor (Async)
```http
POST /v2/acts/{actorId}/runs?token=YOUR_TOKEN
Content-Type: application/json
{ ...actor-specific input... }
```
**Response:**
```json
{
"data": {
"id": "RUN_ID",
"status": "RUNNING",
"defaultDatasetId": "DATASET_ID"
}
}
```
Common Actor IDs:
- `apify/web-scraper` — generic JS scraper
- `apify/google-search-scraper` — Google SERPs
- `compass/crawler-google-places` — Google Maps
- `apify/instagram-scraper` — Instagram
- `clockworks/free-tiktok-scraper` — TikTok
- `apify/amazon-scraper` — Amazon products
### 2. Poll Run Status
```http
GET /v2/acts/{actorId}/runs/{runId}?token=YOUR_TOKEN
```
Poll until `status` is `SUCCEEDED` or `FAILED`. Recommended interval: 5 seconds.
### 3. Fetch Results
```http
GET /v2/datasets/{datasetId}/items?token=YOUR_TOKEN&format=json
```
Optional params: `format` (json/csv/xlsx/xml), `limit`, `offset`
### 4. Run Synchronously (≤5 minutes)
For short runs, use the sync endpoint — it waits and returns dataset items directly:
```http
POST /v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_TOKEN
Content-Type: application/json
{ ...actor input... }
```
### Common Actor Inputs
**Google Search Scraper:**
```json
{ "queries": "web scraping tools", "maxPagesPerQuery": 1, "resultsPerPage": 10 }
```
**Google Maps Scraper:**
```json
{ "searchStringsArray": ["restaurants in Mumbai"], "maxCrawledPlaces": 20 }
```
**Web Scraper (generic):**
```json
{
"startUrls": [{ "url": "https://example.com" }],
"pageFunction": "async function pageFunction(context) { const $ = context.jQuery; return { title: $('title').text() }; }",
"maxPagesPerCrawl": 10
}
```
---
## Output Handling
- **Firecrawl** returns data directly in the response (or via polling for crawl/batch).
- **Apify** stores results in a dataset; retrieve with `GET /v2/datasets/{id}/items`.
- Both support JSON output. Firecrawl also provides clean markdown ideal for LLMs.
- Apify also supports CSV, XLSX, XML output formats.
---
## Code Templates
See `references/code-templates.md` for ready-to-run Python and JavaScript code for both APIs.
---
## Error Handling
- **Firecrawl 402** → out of credits; user needs to upgrade plan
- **Firecrawl 429** → rate limited; add delays between requests
- **Apify FAILED run** → check run logs via `GET /v2/acts/{id}/runs/{runId}/log`
- Always wrap API calls in try/catch and check `success: false` in Firecrawl responses
- Firecrawl crawls respect `robots.txt` by default
- For JS-heavy pages, increase `waitFor` (Firecrawl) or use Playwright/Puppeteer actors (Apify)
---
## Best Practices
1. **Start small** — test with 1 URL or a small `limit` before scaling
2. **Use `onlyMainContent: true`** in Firecrawl to remove nav/footer noise
3. **Choose async for large jobs** — don't use sync endpoints for crawls with 50+ pages
4. **Store API keys securely** — never hardcode them; use environment variables
5. **Check rate limits** — Firecrawl: varies by plan; Apify: 250k requests/min global
6. **Prefer Firecrawl for LLM pipelines** — markdown output is clean and ready for RAG/AI
7. **Prefer Apify for social/structured data** — specialized actors handle anti-bot betterRelated Skills
news-hot-scraper
This skill should be used when users need to scrape hot news topics from Chinese platforms (微博、知乎、B站、抖音、今日头条、腾讯新闻、澎湃新闻), generate summaries, and cite sources. It supports both API-based and direct scraping methods, and offers both extractive and abstractive summarization techniques.
TinyScraper
简单静态网站镜像爬虫。给定 URL 下载整个域名下的 HTML、JS、CSS 和静态资源到本地,支持离线浏览。
social-media-content-scraper-pro
Social Media Content Bulk Scraper, extract articles/posts from WeChat, Instagram, TikTok, YouTube, export to Markdown/HTML with full metadata. $0.005 USDT per use.
YouTube Channel Scraper
A browser-based YouTube channel discovery and scraping tool.
Twitter/X Profile Scraper
A browser-based Twitter/X profile discovery and scraping tool.
TikTok Profile Scraper
A browser-based TikTok profile discovery and scraping tool.
Instagram Profile Scraper
A browser-based Instagram profile discovery and scraping tool.
Facebook Page & Group Scraper
> Part of **[ScrapeClaw](https://www.scrapeclaw.cc/)** — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, and Facebook built with Python & Playwright, no API keys required.
grok-scraper
Execute queries to Grok AI via Playwright browser automation without requiring an X API KEY. Use when the user wants to "ask Grok", search X for real-time info, or specifically requests to use Grok for free without API billing.
mrscraper
Run AI-powered, unblockable web scraping, data extraction with natural language via the MrScraper API
scraper
Structured extraction and cleanup for public, user-authorized web pages. Use when the user wants to collect, clean, summarize, or transform content from accessible pages into reusable text or data. Do not use to bypass logins, paywalls, captchas, robots restrictions, or access controls. Local-only output.
sg-property-scraper
Search Singapore property rental and sale listings with flexible filters. Use when asked to search Singapore properties, find rental or sale listings, check property prices near MRT stations, or compare commute times. Supports filtering by listing type (rent/sale), property type (HDB/Condo/Landed), bedrooms, bathrooms, price range, size, TOP year, MRT station codes, distance to MRT, room type, availability, and commute time to a destination. Outputs JSON to stdout.