web-scraper-skill

Use this skill to scrape, crawl, or extract data from websites using Apify or Firecrawl APIs. Trigger whenever the user wants to: scrape a URL, crawl a website, extract structured data from web pages, run an Apify Actor, batch scrape multiple URLs, search and scrape the web, map a site's URLs, collect product/price/review data, or build any web data pipeline. If the user says things like "scrape this site", "get data from this URL", "crawl this website", "run an Apify actor", "use Firecrawl", "extract content from a page", "pull data from the web", or mentions any web data extraction task — always use this skill. Also use it when the user wants to choose between Apify and Firecrawl.

3,891 stars

byopenclaw

View on GitHub Installation ↓

Best use case

web-scraper-skill is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using web-scraper-skill should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/web-scraper-skill/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/abhishekj9621/web-scraper-skill/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/web-scraper-skill/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How web-scraper-skill Compares

Feature / Agent	web-scraper-skill	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

AI Agents for Startups

Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

SKILL.md Source

# Web Scraper Skill (Apify + Firecrawl)

This skill helps Openclaw scrape and extract data from websites using two powerful APIs:
- **Firecrawl** — best for scraping individual pages, crawling entire sites, and getting LLM-ready content (markdown)
- **Apify** — best for specialized scrapers (social media, Google Maps, e-commerce, etc.) via pre-built Actors

---

## Quick Decision Guide: Apify vs Firecrawl

| Use Case | Recommended Tool |
|---|---|
| Scrape a single page into markdown/JSON | **Firecrawl** `/scrape` |
| Crawl an entire website (follow links) | **Firecrawl** `/crawl` |
| Map all URLs on a site | **Firecrawl** `/map` |
| Search web + scrape results | **Firecrawl** `/search` |
| Scrape Instagram / TikTok / Twitter | **Apify** (social actors) |
| Scrape Google Maps / reviews | **Apify** (compass/crawler-google-places) |
| Scrape Amazon products | **Apify** (apify/amazon-scraper) |
| Scrape Google Search results | **Apify** (apify/google-search-scraper) |
| Custom actor / any Apify Store actor | **Apify** |

---

## Authentication

Both APIs require API keys passed via headers. Always ask the user for their key if not provided.

**Firecrawl:** `Authorization: Bearer fc-YOUR_API_KEY`
**Apify:** `Authorization: Bearer YOUR_APIFY_TOKEN` (or `?token=YOUR_TOKEN` in URL)

---

## Firecrawl API Reference

**Base URL:** `https://api.firecrawl.dev/v2`

### 1. Scrape a Single Page
```http
POST /v2/scrape
Authorization: Bearer fc-YOUR_API_KEY
Content-Type: application/json

{
  "url": "https://example.com",
  "formats": ["markdown"],          // Options: markdown, html, rawHtml, links, screenshot, json
  "onlyMainContent": true,          // Strips nav/footer/ads
  "waitFor": 0,                     // ms to wait before scraping (for JS-heavy pages)
  "timeout": 30000,                 // ms
  "blockAds": true,
  "proxy": "auto"                   // "auto", "basic", or "stealth"
}
```
**Response:** `{ "success": true, "data": { "markdown": "...", "metadata": {...} } }`

### 2. Crawl an Entire Website
Crawling is async — starts a job, then poll for results.

```http
POST /v2/crawl
{
  "url": "https://docs.example.com",
  "limit": 50,                      // Max pages
  "maxDepth": 3,
  "allowExternalLinks": false,
  "scrapeOptions": {
    "formats": ["markdown"],
    "onlyMainContent": true
  }
}
```
**Response:** `{ "success": true, "id": "crawl-job-id" }`

**Poll status:**
```http
GET /v2/crawl/{crawl-job-id}
```
**Response:** `{ "status": "completed", "total": 50, "data": [...] }`

### 3. Map a Website's URLs
```http
POST /v2/map
{ "url": "https://example.com" }
```
**Response:** `{ "success": true, "links": [{ "url": "...", "title": "..." }] }`

### 4. Search + Scrape in One Call
```http
POST /v2/search
{
  "query": "best web scraping tools 2025",
  "limit": 5,
  "scrapeOptions": { "formats": ["markdown"] }
}
```
**Response:** `{ "data": [{ "url": "...", "title": "...", "markdown": "..." }] }`

### 5. Batch Scrape Multiple URLs
```http
POST /v2/batch/scrape
{
  "urls": ["https://a.com", "https://b.com"],
  "formats": ["markdown"]
}
```
Returns a job ID; poll with `GET /v2/batch/scrape/{id}`

---

## Apify API Reference

**Base URL:** `https://api.apify.com/v2`
**Auth:** Pass token as query param `?token=YOUR_TOKEN` or in Authorization header.

### Core Workflow
Apify runs "Actors" (pre-built scrapers). The flow is:
1. **Start a run** → get a `runId` and `defaultDatasetId`
2. **Poll status** until `SUCCEEDED`
3. **Fetch results** from the dataset

### 1. Run an Actor (Async)
```http
POST /v2/acts/{actorId}/runs?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor-specific input... }
```
**Response:**
```json
{
  "data": {
    "id": "RUN_ID",
    "status": "RUNNING",
    "defaultDatasetId": "DATASET_ID"
  }
}
```

Common Actor IDs:
- `apify/web-scraper` — generic JS scraper
- `apify/google-search-scraper` — Google SERPs
- `compass/crawler-google-places` — Google Maps
- `apify/instagram-scraper` — Instagram
- `clockworks/free-tiktok-scraper` — TikTok
- `apify/amazon-scraper` — Amazon products

### 2. Poll Run Status
```http
GET /v2/acts/{actorId}/runs/{runId}?token=YOUR_TOKEN
```
Poll until `status` is `SUCCEEDED` or `FAILED`. Recommended interval: 5 seconds.

### 3. Fetch Results
```http
GET /v2/datasets/{datasetId}/items?token=YOUR_TOKEN&format=json
```
Optional params: `format` (json/csv/xlsx/xml), `limit`, `offset`

### 4. Run Synchronously (≤5 minutes)
For short runs, use the sync endpoint — it waits and returns dataset items directly:
```http
POST /v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor input... }
```

### Common Actor Inputs

**Google Search Scraper:**
```json
{ "queries": "web scraping tools", "maxPagesPerQuery": 1, "resultsPerPage": 10 }
```

**Google Maps Scraper:**
```json
{ "searchStringsArray": ["restaurants in Mumbai"], "maxCrawledPlaces": 20 }
```

**Web Scraper (generic):**
```json
{
  "startUrls": [{ "url": "https://example.com" }],
  "pageFunction": "async function pageFunction(context) { const $ = context.jQuery; return { title: $('title').text() }; }",
  "maxPagesPerCrawl": 10
}
```

---

## Output Handling

- **Firecrawl** returns data directly in the response (or via polling for crawl/batch).
- **Apify** stores results in a dataset; retrieve with `GET /v2/datasets/{id}/items`.
- Both support JSON output. Firecrawl also provides clean markdown ideal for LLMs.
- Apify also supports CSV, XLSX, XML output formats.

---

## Code Templates

See `references/code-templates.md` for ready-to-run Python and JavaScript code for both APIs.

---

## Error Handling

- **Firecrawl 402** → out of credits; user needs to upgrade plan
- **Firecrawl 429** → rate limited; add delays between requests
- **Apify FAILED run** → check run logs via `GET /v2/acts/{id}/runs/{runId}/log`
- Always wrap API calls in try/catch and check `success: false` in Firecrawl responses
- Firecrawl crawls respect `robots.txt` by default
- For JS-heavy pages, increase `waitFor` (Firecrawl) or use Playwright/Puppeteer actors (Apify)

---

## Best Practices

1. **Start small** — test with 1 URL or a small `limit` before scaling
2. **Use `onlyMainContent: true`** in Firecrawl to remove nav/footer noise
3. **Choose async for large jobs** — don't use sync endpoints for crawls with 50+ pages
4. **Store API keys securely** — never hardcode them; use environment variables
5. **Check rate limits** — Firecrawl: varies by plan; Apify: 250k requests/min global
6. **Prefer Firecrawl for LLM pipelines** — markdown output is clean and ready for RAG/AI
7. **Prefer Apify for social/structured data** — specialized actors handle anti-bot better

Related Skills

news-hot-scraper

3891

from openclaw/skills

This skill should be used when users need to scrape hot news topics from Chinese platforms (微博、知乎、B站、抖音、今日头条、腾讯新闻、澎湃新闻), generate summaries, and cite sources. It supports both API-based and direct scraping methods, and offers both extractive and abstractive summarization techniques.

Data & Research

TinyScraper

3891

from openclaw/skills

简单静态网站镜像爬虫。给定 URL 下载整个域名下的 HTML、JS、CSS 和静态资源到本地，支持离线浏览。

social-media-content-scraper-pro

3891

from openclaw/skills

Social Media Content Bulk Scraper, extract articles/posts from WeChat, Instagram, TikTok, YouTube, export to Markdown/HTML with full metadata. $0.005 USDT per use.

YouTube Channel Scraper

3891

from openclaw/skills

A browser-based YouTube channel discovery and scraping tool.

Twitter/X Profile Scraper

3891

from openclaw/skills

A browser-based Twitter/X profile discovery and scraping tool.

TikTok Profile Scraper

3891

from openclaw/skills

A browser-based TikTok profile discovery and scraping tool.

Instagram Profile Scraper

3891

from openclaw/skills

A browser-based Instagram profile discovery and scraping tool.

Facebook Page & Group Scraper

3891

from openclaw/skills

> Part of **[ScrapeClaw](https://www.scrapeclaw.cc/)** — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, and Facebook built with Python & Playwright, no API keys required.

grok-scraper

3891

from openclaw/skills

Execute queries to Grok AI via Playwright browser automation without requiring an X API KEY. Use when the user wants to "ask Grok", search X for real-time info, or specifically requests to use Grok for free without API billing.

mrscraper

3891

from openclaw/skills

Run AI-powered, unblockable web scraping, data extraction with natural language via the MrScraper API

scraper

3891

from openclaw/skills

Structured extraction and cleanup for public, user-authorized web pages. Use when the user wants to collect, clean, summarize, or transform content from accessible pages into reusable text or data. Do not use to bypass logins, paywalls, captchas, robots restrictions, or access controls. Local-only output.

sg-property-scraper

3891

from openclaw/skills

Search Singapore property rental and sale listings with flexible filters. Use when asked to search Singapore properties, find rental or sale listings, check property prices near MRT stations, or compare commute times. Supports filtering by listing type (rent/sale), property type (HDB/Condo/Landed), bedrooms, bathrooms, price range, size, TOP year, MRT station codes, distance to MRT, room type, availability, and commute time to a destination. Outputs JSON to stdout.