crawl
Crawl any website and save pages as local markdown files. Use when you need to download documentation, knowledge bases, or web content for offline access or analysis. No code required - just provide a URL.
Best use case
crawl is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Crawl any website and save pages as local markdown files. Use when you need to download documentation, knowledge bases, or web content for offline access or analysis. No code required - just provide a URL.
Teams using crawl should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/crawl/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How crawl Compares
| Feature / Agent | crawl | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Crawl any website and save pages as local markdown files. Use when you need to download documentation, knowledge bases, or web content for offline access or analysis. No code required - just provide a URL.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Crawl Skill
Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.
## Prerequisites
**Tavily API Key Required** - Get your key at https://tavily.com
Add to `~/.claude/settings.json`:
```json
{
"env": {
"TAVILY_API_KEY": "tvly-your-api-key-here"
}
}
```
## Quick Start
### Using the Script
```bash
./scripts/crawl.sh '<json>' [output_dir]
```
**Examples:**
```bash
# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'
# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'
# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs
# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'
# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'
```
When `output_dir` is provided, each crawled page is saved as a separate markdown file.
### Basic Crawl
```bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 1,
"limit": 20
}'
```
### Focused Crawl with Instructions
```bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and code examples",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'
```
## API Reference
### Endpoint
```
POST https://api.tavily.com/crawl
```
### Headers
| Header | Value |
|--------|-------|
| `Authorization` | `Bearer <TAVILY_API_KEY>` |
| `Content-Type` | `application/json` |
### Request Body
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `url` | string | Required | Root URL to begin crawling |
| `max_depth` | integer | 1 | Levels deep to crawl (1-5) |
| `max_breadth` | integer | 20 | Links per page |
| `limit` | integer | 50 | Total pages cap |
| `instructions` | string | null | Natural language guidance for focus |
| `chunks_per_source` | integer | 3 | Chunks per page (1-5, requires instructions) |
| `extract_depth` | string | `"basic"` | `basic` or `advanced` |
| `format` | string | `"markdown"` | `markdown` or `text` |
| `select_paths` | array | null | Regex patterns to include |
| `exclude_paths` | array | null | Regex patterns to exclude |
| `allow_external` | boolean | true | Include external domain links |
| `timeout` | float | 150 | Max wait (10-150 seconds) |
### Response Format
```json
{
"base_url": "https://docs.example.com",
"results": [
{
"url": "https://docs.example.com/page",
"raw_content": "# Page Title\n\nContent..."
}
],
"response_time": 12.5
}
```
## Depth vs Performance
| Depth | Typical Pages | Time |
|-------|---------------|------|
| 1 | 10-50 | Seconds |
| 2 | 50-500 | Minutes |
| 3 | 500-5000 | Many minutes |
**Start with `max_depth=1`** and increase only if needed.
## Crawl for Context vs Data Collection
**For agentic use (feeding results into context):** Always use `instructions` + `chunks_per_source`. This returns only relevant chunks instead of full pages, preventing context window explosion.
**For data collection (saving to files):** Omit `chunks_per_source` to get full page content.
## Examples
### For Context: Agentic Research (Recommended)
Use when feeding crawl results into an LLM context:
```bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and authentication guides",
"chunks_per_source": 3
}'
```
Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.
### For Context: Targeted Technical Docs
```bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com",
"max_depth": 2,
"instructions": "Find all documentation about authentication and security",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'
```
### For Data Collection: Full Page Archive
Use when saving content to files for later processing:
```bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com/blog",
"max_depth": 2,
"max_breadth": 50,
"select_paths": ["/blog/.*"],
"exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
}'
```
Returns full page content - use the script with `output_dir` to save as markdown files.
## Map API (URL Discovery)
Use `map` instead of `crawl` when you only need URLs, not content:
```bash
curl --request POST \
--url https://api.tavily.com/map \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find all API docs and guides"
}'
```
Returns URLs only (faster than crawl):
```json
{
"base_url": "https://docs.example.com",
"results": [
"https://docs.example.com/api/auth",
"https://docs.example.com/guides/quickstart"
]
}
```
## Tips
- **Always use `chunks_per_source` for agentic workflows** - prevents context explosion when feeding results to LLMs
- **Omit `chunks_per_source` only for data collection** - when saving full pages to files
- **Start conservative** (`max_depth=1`, `limit=20`) and scale up
- **Use path patterns** to focus on relevant sections
- **Use Map first** to understand site structure before full crawl
- **Always set a `limit`** to prevent runaway crawlsRelated Skills
firecrawl-cli
Firecrawl CLI for web scraping, crawling, and search. Scrape single pages or entire websites, map site URLs, and search the web with full content extraction. Returns clean markdown optimized for LLM context. Use for research, documentation extraction, competitive intelligence, and content monitoring.
crawl4ai
AI-powered web scraping framework for extracting structured data from websites.
finviz-crawler
Continuous financial news crawler for finviz.com with SQLite storage, article extraction, and query tool.
firecrawl
Web search and scraping via Firecrawl API. Use when you need to search the web, scrape websites (including JS-heavy pages), crawl entire sites, or extract structured data from web pages. Requires FIRECRAWL_API_KEY environment variable.
crawl-for-ai
Web scraping using local Crawl4AI instance.
paylock
Non-custodial SOL escrow for AI agent deals.
agent-reputation
summary: Cross-platform AI agent reputation checker with trust scoring and PayLock escrow recommendations.
Telecom Agent Skill
Turn your AI Agent into a Telecom Operator. Bulk calling, ChatOps, and Field Monitoring.
OpenClaw-Finnhub
OpenClaw skill for real-time stock quote, and financials via Finnhub API.
```markdown
# OpenClaw-Last.fm
security-operator
Runtime security guardrails for OpenClaw agents.
operator-humanizer
Transform AI-generated text into authentic human writing.