Crawl4AI — LLM-Friendly Web Crawler

You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

Crawl4AI — LLM-Friendly Web Crawler is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using Crawl4AI — LLM-Friendly Web Crawler should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/crawl4ai/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/crawl4ai/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/crawl4ai/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Crawl4AI — LLM-Friendly Web Crawler Compares

Feature / Agent	Crawl4AI — LLM-Friendly Web Crawler	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Crawl4AI — LLM-Friendly Web Crawler

You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.

## Core Capabilities

### Basic Crawling

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://docs.example.com/getting-started",
        config=CrawlerRunConfig(
            word_count_threshold=10,       # Skip blocks with <10 words
            cache_mode=CacheMode.ENABLED,
        ),
    )

    print(result.markdown)                 # Clean markdown (LLM-ready)
    print(result.cleaned_html)             # Cleaned HTML
    print(result.media["images"])          # Extracted images
    print(result.links["internal"])        # Internal links
    print(result.links["external"])        # External links
    print(result.metadata)                 # Title, description, keywords
```

### LLM-Powered Structured Extraction

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: float
    description: str
    rating: float | None
    features: list[str]

extraction = LLMExtractionStrategy(
    provider="openai/gpt-4o-mini",
    api_token=os.environ["OPENAI_API_KEY"],
    schema=Product.model_json_schema(),
    instruction="Extract all products from this page",
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://shop.example.com/keyboards",
        config=CrawlerRunConfig(extraction_strategy=extraction),
    )

    products = [Product(**p) for p in json.loads(result.extracted_content)]
    for p in products:
        print(f"{p.name}: ${p.price} — {p.rating}★")
```

### CSS-Based Extraction (No LLM)

```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "Product listings",
    "baseSelector": ".product-card",       # Repeating element
    "fields": [
        {"name": "title", "selector": "h3.product-title", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"},
        {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
    ],
}

extraction = JsonCssExtractionStrategy(schema)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://shop.example.com/all",
        config=CrawlerRunConfig(
            extraction_strategy=extraction,
            js_code="window.scrollTo(0, document.body.scrollHeight);",  # Scroll to load more
            wait_for=".product-card:nth-child(20)",                     # Wait for 20 items
        ),
    )
    products = json.loads(result.extracted_content)
```

### Multi-Page Crawling

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async with AsyncWebCrawler() as crawler:
    # Session-based: maintains cookies, auth state
    session_id = "docs-crawl"

    # Login first
    await crawler.arun(
        url="https://docs.example.com/login",
        config=CrawlerRunConfig(
            session_id=session_id,
            js_code="""
                document.querySelector('#email').value = 'user@example.com';
                document.querySelector('#password').value = 'pass';
                document.querySelector('form').submit();
            """,
            wait_for="#dashboard",
        ),
    )

    # Crawl authenticated pages
    urls = ["https://docs.example.com/api", "https://docs.example.com/guides"]
    for url in urls:
        result = await crawler.arun(
            url=url,
            config=CrawlerRunConfig(session_id=session_id),
        )
        # Save markdown for RAG indexing
        save_to_knowledge_base(url, result.markdown)
```

## Installation

```bash
pip install crawl4ai
crawl4ai-setup                             # Install Playwright browsers
```

## Best Practices

1. **Markdown output** — Use `result.markdown` for LLM/RAG input; clean, structured, no HTML noise
2. **CSS extraction** — Use `JsonCssExtractionStrategy` for structured pages; no LLM cost, fast, deterministic
3. **LLM extraction** — Use `LLMExtractionStrategy` for unstructured pages; Pydantic schema ensures valid output
4. **JavaScript rendering** — Crawl4AI uses Playwright; handles SPAs, infinite scroll, dynamic content
5. **Sessions** — Use `session_id` for multi-page crawls; maintains cookies, auth state across requests
6. **Caching** — Enable `CacheMode.ENABLED` for development; avoid re-crawling the same pages
7. **Wait conditions** — Use `wait_for` CSS selectors to ensure content is loaded before extraction
8. **Rate limiting** — Add delays between requests; respect robots.txt; be a good citizen

Related Skills

ice-crawler-harvester

from ComeOnOliver/skillshub

Run ICE-Crawler’s Frost→Glacier→Crystal pipeline to ingest repositories safely, emit bounded artifact bundles, and hand off sealed fossils for downstream agents.

twitter-crawler

from ComeOnOliver/skillshub

Twitter 推文爬取器 - 指定用户名爬取推文，保存为 Markdown 格式，支持自定义数量和字段

Bruno — Git-Friendly API Client

from ComeOnOliver/skillshub

## Overview

You are a professional Product Manager who is very friendly and supportive.

from ComeOnOliver/skillshub

Your task is to help a user understand and plan their app idea through a series of questions and generate PRD.

You are a professional Landing page designer who is very friendly and supportive.

from ComeOnOliver/skillshub

Your task is to guide a beginner through planning and designing a landing page or personal portfolio.

crawl4ai

from ComeOnOliver/skillshub

Complete toolkit for web crawling and data extraction using Crawl4AI. This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.

Daily Logs

from ComeOnOliver/skillshub

Record the user's daily activities, progress, decisions, and learnings in a structured, chronological format.

Socratic Method: The Dialectic Engine

from ComeOnOliver/skillshub

This skill transforms Claude into a Socratic agent — a cognitive partner who guides

Sokratische Methode: Die Dialektik-Maschine

from ComeOnOliver/skillshub

Dieser Skill verwandelt Claude in einen sokratischen Agenten — einen kognitiven Partner, der Nutzende durch systematisches Fragen zur Wissensentdeckung führt, anstatt direkt zu instruieren.

College Football Data (CFB)

from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.

College Basketball Data (CBB)

from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.

Betting Analysis

from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for odds formats, command parameters, and key concepts.