Crawl4AI — LLM-Friendly Web Crawler
You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.
Best use case
Crawl4AI — LLM-Friendly Web Crawler is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.
Teams using Crawl4AI — LLM-Friendly Web Crawler should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/crawl4ai/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Crawl4AI — LLM-Friendly Web Crawler Compares
| Feature / Agent | Crawl4AI — LLM-Friendly Web Crawler | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Crawl4AI — LLM-Friendly Web Crawler
You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.
## Core Capabilities
### Basic Crawling
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://docs.example.com/getting-started",
config=CrawlerRunConfig(
word_count_threshold=10, # Skip blocks with <10 words
cache_mode=CacheMode.ENABLED,
),
)
print(result.markdown) # Clean markdown (LLM-ready)
print(result.cleaned_html) # Cleaned HTML
print(result.media["images"]) # Extracted images
print(result.links["internal"]) # Internal links
print(result.links["external"]) # External links
print(result.metadata) # Title, description, keywords
```
### LLM-Powered Structured Extraction
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
description: str
rating: float | None
features: list[str]
extraction = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
api_token=os.environ["OPENAI_API_KEY"],
schema=Product.model_json_schema(),
instruction="Extract all products from this page",
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://shop.example.com/keyboards",
config=CrawlerRunConfig(extraction_strategy=extraction),
)
products = [Product(**p) for p in json.loads(result.extracted_content)]
for p in products:
print(f"{p.name}: ${p.price} — {p.rating}★")
```
### CSS-Based Extraction (No LLM)
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "Product listings",
"baseSelector": ".product-card", # Repeating element
"fields": [
{"name": "title", "selector": "h3.product-title", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
],
}
extraction = JsonCssExtractionStrategy(schema)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://shop.example.com/all",
config=CrawlerRunConfig(
extraction_strategy=extraction,
js_code="window.scrollTo(0, document.body.scrollHeight);", # Scroll to load more
wait_for=".product-card:nth-child(20)", # Wait for 20 items
),
)
products = json.loads(result.extracted_content)
```
### Multi-Page Crawling
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async with AsyncWebCrawler() as crawler:
# Session-based: maintains cookies, auth state
session_id = "docs-crawl"
# Login first
await crawler.arun(
url="https://docs.example.com/login",
config=CrawlerRunConfig(
session_id=session_id,
js_code="""
document.querySelector('#email').value = 'user@example.com';
document.querySelector('#password').value = 'pass';
document.querySelector('form').submit();
""",
wait_for="#dashboard",
),
)
# Crawl authenticated pages
urls = ["https://docs.example.com/api", "https://docs.example.com/guides"]
for url in urls:
result = await crawler.arun(
url=url,
config=CrawlerRunConfig(session_id=session_id),
)
# Save markdown for RAG indexing
save_to_knowledge_base(url, result.markdown)
```
## Installation
```bash
pip install crawl4ai
crawl4ai-setup # Install Playwright browsers
```
## Best Practices
1. **Markdown output** — Use `result.markdown` for LLM/RAG input; clean, structured, no HTML noise
2. **CSS extraction** — Use `JsonCssExtractionStrategy` for structured pages; no LLM cost, fast, deterministic
3. **LLM extraction** — Use `LLMExtractionStrategy` for unstructured pages; Pydantic schema ensures valid output
4. **JavaScript rendering** — Crawl4AI uses Playwright; handles SPAs, infinite scroll, dynamic content
5. **Sessions** — Use `session_id` for multi-page crawls; maintains cookies, auth state across requests
6. **Caching** — Enable `CacheMode.ENABLED` for development; avoid re-crawling the same pages
7. **Wait conditions** — Use `wait_for` CSS selectors to ensure content is loaded before extraction
8. **Rate limiting** — Add delays between requests; respect robots.txt; be a good citizenRelated Skills
ice-crawler-harvester
Run ICE-Crawler’s Frost→Glacier→Crystal pipeline to ingest repositories safely, emit bounded artifact bundles, and hand off sealed fossils for downstream agents.
twitter-crawler
Twitter 推文爬取器 - 指定用户名爬取推文,保存为 Markdown 格式,支持自定义数量和字段
Bruno — Git-Friendly API Client
## Overview
You are a professional Product Manager who is very friendly and supportive.
Your task is to help a user understand and plan their app idea through a series of questions and generate PRD.
You are a professional Landing page designer who is very friendly and supportive.
Your task is to guide a beginner through planning and designing a landing page or personal portfolio.
crawl4ai
Complete toolkit for web crawling and data extraction using Crawl4AI. This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
Daily Logs
Record the user's daily activities, progress, decisions, and learnings in a structured, chronological format.
Socratic Method: The Dialectic Engine
This skill transforms Claude into a Socratic agent — a cognitive partner who guides
Sokratische Methode: Die Dialektik-Maschine
Dieser Skill verwandelt Claude in einen sokratischen Agenten — einen kognitiven Partner, der Nutzende durch systematisches Fragen zur Wissensentdeckung führt, anstatt direkt zu instruieren.
College Football Data (CFB)
Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.
College Basketball Data (CBB)
Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.
Betting Analysis
Before writing queries, consult `references/api-reference.md` for odds formats, command parameters, and key concepts.