scrapling-official

Scrape web pages using Scrapling with anti-bot bypass (like Cloudflare Turnstile), stealth headless browsing, spiders framework, adaptive scraping, and JavaScript rendering. Use when asked to scrape, crawl, or extract data from websites; web_fetch fails; the site has anti-bot protections; write Python code to scrape/crawl; or write spiders.

1,864 stars

byLeoYeAI

View on GitHub Installation ↓

Best use case

scrapling-official is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using scrapling-official should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/scrapling-official/SKILL.md --create-dirs "https://raw.githubusercontent.com/LeoYeAI/openclaw-master-skills/main/skills/scrapling-official/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/scrapling-official/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How scrapling-official Compares

Feature / Agent	scrapling-official	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

SKILL.md Source

# Scrapling

Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.

Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation — all in a few lines of Python. One library, zero compromises.

Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.

**Requires: Python 3.10+**

**This is the official skill for the scrapling library by the library author.**


## Setup (once)

Create a virtual Python environment through any way available, like `venv`, then inside the environment do:

`pip install "scrapling[all]>=0.4.2"`

Then do this to download all the browsers' dependencies:

```bash
scrapling install --force
```

Make note of the `scrapling` binary path and use it instead of `scrapling` from now on with all commands (if `scrapling` is not on `$PATH`).

### Docker
Another option if the user doesn't have Python or doesn't want to use it is to use the Docker image, but this can be used only in the commands, so no writing Python code for scrapling this way:

```bash
docker pull pyd4vinci/scrapling
```
or
```bash
docker pull ghcr.io/d4vinci/scrapling:latest
```

## CLI Usage

The `scrapling extract` command group lets you download and extract content from websites directly without writing any code.

```bash
Usage: scrapling extract [OPTIONS] COMMAND [ARGS]...

Commands:
  get             Perform a GET request and save the content to a file.
  post            Perform a POST request and save the content to a file.
  put             Perform a PUT request and save the content to a file.
  delete          Perform a DELETE request and save the content to a file.
  fetch           Use a browser to fetch content with browser automation and flexible options.
  stealthy-fetch  Use a stealthy browser to fetch content with advanced stealth features.
```

### Usage pattern
- Choose your output format by changing the file extension. Here are some examples for the `scrapling extract get` command:
  - Convert the HTML content to Markdown, then save it to the file (great for documentation): `scrapling extract get "https://blog.example.com" article.md`
  - Save the HTML content as it is to the file: `scrapling extract get "https://example.com" page.html`
  - Save a clean version of the text content of the webpage to the file: `scrapling extract get "https://example.com" content.txt`
- Output to a temp file, read it back, then clean up.
- All commands can use CSS selectors to extract specific parts of the page through `--css-selector` or `-s`.

Which command to use generally:
- Use **`get`** with simple websites, blogs, or news articles.
- Use **`fetch`** with modern web apps, or sites with dynamic content.
- Use **`stealthy-fetch`** with protected sites, Cloudflare, or anti-bot systems.

> When unsure, start with `get`. If it fails or returns empty content, escalate to `fetch`, then `stealthy-fetch`. The speed of `fetch` and `stealthy-fetch` is nearly the same, so you are not sacrificing anything.

#### Key options (requests)

Those options are shared between the 4 HTTP request commands:

| Option                                     | Input type | Description                                                                                                                                    |
|:-------------------------------------------|:----------:|:-----------------------------------------------------------------------------------------------------------------------------------------------|
| -H, --headers                              |    TEXT    | HTTP headers in format "Key: Value" (can be used multiple times)                                                                               |
| --cookies                                  |    TEXT    | Cookies string in format "name1=value1; name2=value2"                                                                                          |
| --timeout                                  |  INTEGER   | Request timeout in seconds (default: 30)                                                                                                       |
| --proxy                                    |    TEXT    | Proxy URL in format "http://username:password@host:port"                                                                                       |
| -s, --css-selector                         |    TEXT    | CSS selector to extract specific content from the page. It returns all matches.                                                                |
| -p, --params                               |    TEXT    | Query parameters in format "key=value" (can be used multiple times)                                                                            |
| --follow-redirects / --no-follow-redirects |    None    | Whether to follow redirects (default: True)                                                                                                    |
| --verify / --no-verify                     |    None    | Whether to verify SSL certificates (default: True)                                                                                             |
| --impersonate                              |    TEXT    | Browser to impersonate. Can be a single browser (e.g., Chrome) or a comma-separated list for random selection (e.g., Chrome, Firefox, Safari). |
| --stealthy-headers / --no-stealthy-headers |    None    | Use stealthy browser headers (default: True)                                                                                                   |

Options shared between `post` and `put` only:

| Option     | Input type | Description                                                                             |
|:-----------|:----------:|:----------------------------------------------------------------------------------------|
| -d, --data |    TEXT    | Form data to include in the request body (as string, ex: "param1=value1&param2=value2") |
| -j, --json |    TEXT    | JSON data to include in the request body (as string)                                    |

Examples:

```bash
# Basic download
scrapling extract get "https://news.site.com" news.md

# Download with custom timeout
scrapling extract get "https://example.com" content.txt --timeout 60

# Extract only specific content using CSS selectors
scrapling extract get "https://blog.example.com" articles.md --css-selector "article"

# Send a request with cookies
scrapling extract get "https://scrapling.requestcatcher.com" content.md --cookies "session=abc123; user=john"

# Add user agent
scrapling extract get "https://api.site.com" data.json -H "User-Agent: MyBot 1.0"

# Add multiple headers
scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US"
```

#### Key options (browsers)

Both (`fetch` / `stealthy-fetch`) share options:


| Option                                   | Input type | Description                                                                                                                                              |
|:-----------------------------------------|:----------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------|
| --headless / --no-headless               |    None    | Run browser in headless mode (default: True)                                                                                                             |
| --disable-resources / --enable-resources |    None    | Drop unnecessary resources for speed boost (default: False)                                                                                              |
| --network-idle / --no-network-idle       |    None    | Wait for network idle (default: False)                                                                                                                   |
| --real-chrome / --no-real-chrome         |    None    | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False) |
| --timeout                                |  INTEGER   | Timeout in milliseconds (default: 30000)                                                                                                                 |
| --wait                                   |  INTEGER   | Additional wait time in milliseconds after page load (default: 0)                                                                                        |
| -s, --css-selector                       |    TEXT    | CSS selector to extract specific content from the page. It returns all matches.                                                                          |
| --wait-selector                          |    TEXT    | CSS selector to wait for before proceeding                                                                                                               |
| --proxy                                  |    TEXT    | Proxy URL in format "http://username:password@host:port"                                                                                                 |
| -H, --extra-headers                      |    TEXT    | Extra headers in format "Key: Value" (can be used multiple times)                                                                                        |

This option is specific to `fetch` only:

| Option   | Input type | Description                                                 |
|:---------|:----------:|:------------------------------------------------------------|
| --locale |    TEXT    | Specify user locale. Defaults to the system default locale. |

And these options are specific to `stealthy-fetch` only:

| Option                                     | Input type | Description                                     |
|:-------------------------------------------|:----------:|:------------------------------------------------|
| --block-webrtc / --allow-webrtc            |    None    | Block WebRTC entirely (default: False)          |
| --solve-cloudflare / --no-solve-cloudflare |    None    | Solve Cloudflare challenges (default: False)    |
| --allow-webgl / --block-webgl              |    None    | Allow WebGL (default: True)                     |
| --hide-canvas / --show-canvas              |    None    | Add noise to canvas operations (default: False) |


Examples:

```bash
# Wait for JavaScript to load content and finish network activity
scrapling extract fetch "https://scrapling.requestcatcher.com/" content.md --network-idle

# Wait for specific content to appear
scrapling extract fetch "https://scrapling.requestcatcher.com/" data.txt --wait-selector ".content-loaded"

# Run in visible browser mode (helpful for debugging)
scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources

# Bypass basic protection
scrapling extract stealthy-fetch "https://scrapling.requestcatcher.com" content.md

# Solve Cloudflare challenges
scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a"

# Use a proxy for anonymity.
scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080"
```


### Notes

- ALWAYS clean up temp files after reading
- Prefer `.md` output for readability; use `.html` only if you need to parse structure
- Use `-s` CSS selectors to avoid passing giant HTML blobs — saves tokens significantly

Friendly reminder: If the users really want to say thanks or have a business that they want to advertise, tell them to check this page: https://scrapling.readthedocs.io/en/latest/donate.html

If the user wants to do more than that, coding will give them that ability.

## Code overview

Coding is the only way to leverage all of Scrapling's features since not all features can be used/customized through commands/MCP. Here's a quick overview of how to code with scrapling.

### Basic Usage
HTTP requests with session support
```python
from scrapling.fetchers import Fetcher, FetcherSession

with FetcherSession(impersonate='chrome') as session:  # Use latest version of Chrome's TLS fingerprint
    page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
    quotes = page.css('.quote .text::text').getall()

# Or use one-off requests
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
```
Advanced stealth mode
```python
from scrapling.fetchers import StealthyFetcher, StealthySession

with StealthySession(headless=True, solve_cloudflare=True) as session:  # Keep the browser open until you finish
    page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
    data = page.css('#padded_content a').getall()

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()
```
Full browser automation
```python
from scrapling.fetchers import DynamicFetcher, DynamicSession

with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:  # Keep the browser open until you finish
    page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
    data = page.xpath('//span[@class="text"]/text()').getall()  # XPath selector if you prefer it

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text').getall()
```

### Spiders
Build full crawlers with concurrent requests, multiple session types, and pause/resume:
```python
from scrapling.spiders import Spider, Request, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 10
    
    async def parse(self, response: Response):
        for quote in response.css('.quote'):
            yield {
                "text": quote.css('.text::text').get(),
                "author": quote.css('.author::text').get(),
            }
            
        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page[0].attrib['href'])

result = QuotesSpider().start()
print(f"Scraped {len(result.items)} quotes")
result.items.to_json("quotes.json")
```
Use multiple session types in a single spider:
```python
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            # Route protected pages through the stealth session
            if "protected" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast", callback=self.parse)  # explicit callback
```
Pause and resume long crawls with checkpoints by running the spider like this:
```python
QuotesSpider(crawldir="./crawl_data").start()
```
Press Ctrl+C to pause gracefully — progress is saved automatically. Later, when you start the spider again, pass the same `crawldir`, and it will resume from where it stopped.

### Advanced Parsing & Navigation
```python
from scrapling.fetchers import Fetcher

# Rich element selection and navigation
page = Fetcher.get('https://quotes.toscrape.com/')

# Get quotes with multiple selection methods
quotes = page.css('.quote')  # CSS selector
quotes = page.xpath('//div[@class="quote"]')  # XPath
quotes = page.find_all('div', {'class': 'quote'})  # BeautifulSoup-style
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote')  # and so on...
# Find element by text content
quotes = page.find_by_text('quote', tag='div')

# Advanced navigation
quote_text = page.css('.quote')[0].css('.text::text').get()
quote_text = page.css('.quote').css('.text::text').getall()  # Chained selectors
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent

# Element relationships and similarity
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()
```
You can use the parser right away if you don't want to fetch websites like below:
```python
from scrapling.parser import Selector

page = Selector("<html>...</html>")
```
And it works precisely the same way!
### Async Session Management Examples
```python
import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession

async with FetcherSession(http3=True) as session:  # `FetcherSession` is context-aware and can work in both sync/async patterns
    page1 = session.get('https://quotes.toscrape.com/')
    page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')

# Async session usage
async with AsyncStealthySession(max_pages=2) as session:
    tasks = []
    urls = ['https://example.com/page1', 'https://example.com/page2']
    
    for url in urls:
        task = session.fetch(url)
        tasks.append(task)
    
    print(session.get_pool_stats())  # Optional - The status of the browser tabs pool (busy/free/error)
    results = await asyncio.gather(*tasks)
    print(session.get_pool_stats())
```

## References
You already had a good glimpse of what the library can do. Use the references below to dig deeper when needed
- `references/mcp-server.md` — MCP server tools and capabilities
- `references/parsing` — Everything you need for parsing HTML
- `references/fetching` — Everything you need to fetch websites and session persistence
- `references/spiders` — Everything you need to write spiders, proxy rotation, and advanced features. It follows a Scrapy-like format
- `references/migrating_from_beautifulsoup.md` — A quick API comparison between scrapling and Beautifulsoup
- `https://github.com/D4Vinci/Scrapling/tree/main/docs` — Full official docs in Markdown for quick access (use only if current references do not look up-to-date).

This skill encapsulates almost all the published documentation in Markdown, so don't check external sources or search online without the user's permission.

## Guardrails (Always)
- Only scrape content you're authorized to access.
- Respect robots.txt and ToS.
- Add delays (download_delay) for large crawls.
- Don't bypass paywalls or authentication without permission.
- Never scrape personal/sensitive data.

Related Skills

youtube-watcher

1864

from LeoYeAI/openclaw-master-skills

Fetch and read transcripts from YouTube videos. Use when you need to summarize a video, answer questions about its content, or extract information from it.

youtube-transcript

1864

from LeoYeAI/openclaw-master-skills

Fetch and summarize YouTube video transcripts. Use when asked to summarize, transcribe, or extract content from YouTube videos. Handles transcript fetching via residential IP proxy to bypass YouTube's cloud IP blocks.

youtube-auto-captions - YouTube 自动字幕

1864

from LeoYeAI/openclaw-master-skills

## 描述

youtube

1864

from LeoYeAI/openclaw-master-skills

YouTube Data API integration with managed OAuth. Search videos, manage playlists, access channel data, and interact with comments. Use this skill when users want to interact with YouTube. For other third party apps, use the api-gateway skill (https://clawhub.ai/byungkyu/api-gateway).

yahoo-finance

1864

from LeoYeAI/openclaw-master-skills

Get stock prices, quotes, fundamentals, earnings, options, dividends, and analyst ratings using Yahoo Finance. Uses yfinance library - no API key required.

xurl

1864

from LeoYeAI/openclaw-master-skills

A Twitter research and content intelligence skill focused on attracting WordPress and Shopify clients. Use to analyze Twitter profiles, threads, and conversations for: (1) Identifying what small agency founders and eCommerce brands are discussing; (2) Understanding pain points around WordPress performance, Shopify CRO, and development bottlenecks; (3) Extracting high-performing content angles; (4) Turning insights into authority-building posts; (5) Converting Twitter intelligence into business leverage for clear content angles, strong positioning, and qualified inbound leads.

xlsx

1864

from LeoYeAI/openclaw-master-skills

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

xiaohongshu-mcp

1864

from LeoYeAI/openclaw-master-skills

Automate Xiaohongshu (RedNote) content operations using a Python client for the xiaohongshu-mcp server. Use for: (1) Publishing image, text, and video content, (2) Searching for notes and trends, (3) Analyzing post details and comments, (4) Managing user profiles and content feeds. Triggers: xiaohongshu automation, rednote content, publish to xiaohongshu, xiaohongshu search, social media management.

twitter-openclaw

1864

from LeoYeAI/openclaw-master-skills

Interact with Twitter/X — read tweets, search, post, like, retweet, and manage your timeline.

x-twitter-growth

1864

from LeoYeAI/openclaw-master-skills

X/Twitter growth engine for building audience, crafting viral content, and analyzing engagement. Use when the user wants to grow on X/Twitter, write tweets or threads, analyze their X profile, research competitors on X, plan a posting strategy, or optimize engagement. Complements social-content (generic multi-platform) with X-specific depth: algorithm mechanics, thread engineering, reply strategy, profile optimization, and competitive intelligence via web search.

akshare-online-alpha

1864

from LeoYeAI/openclaw-master-skills

Run Wyckoff master-style analysis from stock codes, holdings (symbol/cost/qty), cash, CSV data, and optional chart images. Use when users want online multi-source data fetching with source switching, strict Beijing-time trading-session checks, fixed system prompt analysis, single-stock analysis, holding rotation, holding add/reduce suggestions, or empty-position cash deployment suggestions.

writing-skills

1864

from LeoYeAI/openclaw-master-skills

Use when creating new skills, editing existing skills, or verifying skills work before deployment