crawl

Crawl any website and save pages as local markdown files. Use when you need to download documentation, knowledge bases, or web content for offline access or analysis. No code required - just provide a URL.

7 stars

byDemerzels-lab

View on GitHub Installation ↓

Best use case

crawl is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using crawl should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/crawl/SKILL.md --create-dirs "https://raw.githubusercontent.com/Demerzels-lab/elsamultiskillagent/main/public/skills/barneyjm/crawl/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/crawl/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How crawl Compares

Feature / Agent	crawl	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Crawl Skill

Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.

## Prerequisites

**Tavily API Key Required** - Get your key at https://tavily.com

Add to `~/.claude/settings.json`:
```json
{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}
```

## Quick Start

### Using the Script

```bash
./scripts/crawl.sh '<json>' [output_dir]
```

**Examples:**
```bash
# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'

# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'

# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs

# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'

# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'
```

When `output_dir` is provided, each crawled page is saved as a separate markdown file.

### Basic Crawl

```bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'
```

### Focused Crawl with Instructions

```bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'
```

## API Reference

### Endpoint

```
POST https://api.tavily.com/crawl
```

### Headers

| Header | Value |
|--------|-------|
| `Authorization` | `Bearer <TAVILY_API_KEY>` |
| `Content-Type` | `application/json` |

### Request Body

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `url` | string | Required | Root URL to begin crawling |
| `max_depth` | integer | 1 | Levels deep to crawl (1-5) |
| `max_breadth` | integer | 20 | Links per page |
| `limit` | integer | 50 | Total pages cap |
| `instructions` | string | null | Natural language guidance for focus |
| `chunks_per_source` | integer | 3 | Chunks per page (1-5, requires instructions) |
| `extract_depth` | string | `"basic"` | `basic` or `advanced` |
| `format` | string | `"markdown"` | `markdown` or `text` |
| `select_paths` | array | null | Regex patterns to include |
| `exclude_paths` | array | null | Regex patterns to exclude |
| `allow_external` | boolean | true | Include external domain links |
| `timeout` | float | 150 | Max wait (10-150 seconds) |

### Response Format

```json
{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}
```

## Depth vs Performance

| Depth | Typical Pages | Time |
|-------|---------------|------|
| 1 | 10-50 | Seconds |
| 2 | 50-500 | Minutes |
| 3 | 500-5000 | Many minutes |

**Start with `max_depth=1`** and increase only if needed.

## Crawl for Context vs Data Collection

**For agentic use (feeding results into context):** Always use `instructions` + `chunks_per_source`. This returns only relevant chunks instead of full pages, preventing context window explosion.

**For data collection (saving to files):** Omit `chunks_per_source` to get full page content.

## Examples

### For Context: Agentic Research (Recommended)

Use when feeding crawl results into an LLM context:

```bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'
```

Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.

### For Context: Targeted Technical Docs

```bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'
```

### For Data Collection: Full Page Archive

Use when saving content to files for later processing:

```bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'
```

Returns full page content - use the script with `output_dir` to save as markdown files.

## Map API (URL Discovery)

Use `map` instead of `crawl` when you only need URLs, not content:

```bash
curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'
```

Returns URLs only (faster than crawl):

```json
{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}
```

## Tips

- **Always use `chunks_per_source` for agentic workflows** - prevents context explosion when feeding results to LLMs
- **Omit `chunks_per_source` only for data collection** - when saving full pages to files
- **Start conservative** (`max_depth=1`, `limit=20`) and scale up
- **Use path patterns** to focus on relevant sections
- **Use Map first** to understand site structure before full crawl
- **Always set a `limit`** to prevent runaway crawls

Related Skills

firecrawl-cli

from Demerzels-lab/elsamultiskillagent

Firecrawl CLI for web scraping, crawling, and search. Scrape single pages or entire websites, map site URLs, and search the web with full content extraction. Returns clean markdown optimized for LLM context. Use for research, documentation extraction, competitive intelligence, and content monitoring.

crawl4ai

from Demerzels-lab/elsamultiskillagent

AI-powered web scraping framework for extracting structured data from websites.

finviz-crawler

from Demerzels-lab/elsamultiskillagent

Continuous financial news crawler for finviz.com with SQLite storage, article extraction, and query tool.

firecrawl

from Demerzels-lab/elsamultiskillagent

Web search and scraping via Firecrawl API. Use when you need to search the web, scrape websites (including JS-heavy pages), crawl entire sites, or extract structured data from web pages. Requires FIRECRAWL_API_KEY environment variable.