harvest-deep-crawl
Multi-page deep crawling - documentation sites, wikis, knowledge bases
Best use case
harvest-deep-crawl is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Multi-page deep crawling - documentation sites, wikis, knowledge bases
Teams using harvest-deep-crawl should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/harvest-deep-crawl/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How harvest-deep-crawl Compares
| Feature / Agent | harvest-deep-crawl | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Multi-page deep crawling - documentation sites, wikis, knowledge bases
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Harvest Deep Crawl
Crawl multi-page websites following internal links to a specified depth. Ideal for building complete knowledge bases from documentation sites, wikis, and reference materials.
## Usage
```
/crawl <url> --depth <N>
```
## Examples
```bash
# Crawl docs site 3 levels deep
/crawl https://docs.example.com --depth 3
# Crawl a specific section
/crawl https://docs.example.com/api --depth 2
# Crawl with page limit
/crawl https://wiki.example.com --depth 5 --max-pages 50
```
## Parameters
| Param | Default | Description |
|-------|---------|-------------|
| `--depth` | 2 | Max link-following depth |
| `--max-pages` | 100 | Max pages to crawl |
| `--same-domain` | true | Stay on same domain |
| `--include` | * | URL pattern to include |
| `--exclude` | - | URL pattern to exclude |
## How It Works
1. Start at root URL, extract all internal links
2. Follow links up to specified depth (BFS order)
3. Extract content from each page
4. Deduplicate pages with > 90% content overlap
5. Build table of contents from page hierarchy
6. Merge into coherent knowledge base
7. Save to `.claude/cache/agents/harvest/crawl-{domain}/`
## Output Structure
```
crawl-{domain}-{timestamp}/
index.md # Table of contents + summary
page-001.md # First page content
page-002.md # Second page content
...
metadata.json # Crawl stats, URLs, timings
```
## Crawl Engine
### Primary: crawl4ai (Docker port 11235)
```bash
curl -s http://localhost:11235/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://docs.example.com"],
"max_depth": 3,
"same_domain": true,
"word_count_threshold": 50
}'
```
### Fallback: Manual Link Following
When Docker unavailable:
1. WebFetch root URL
2. Parse links from markdown output
3. WebFetch each linked page (depth-limited)
4. Compile results
## Use Cases
| Scenario | Depth | Max Pages |
|----------|-------|-----------|
| API reference | 2-3 | 50 |
| Full documentation site | 3-5 | 100 |
| Wiki section | 2 | 30 |
| Changelog history | 1-2 | 20 |
| Tutorial series | 2-3 | 30 |
## Rules
- Respect robots.txt
- Max 2 requests/second
- Skip binary files (PDF, images, videos)
- Detect and skip infinite pagination
- Cache results for 24 hoursRelated Skills
tldr-deep
Run full 5-layer analysis (AST, call graph, CFG, DFG, slice) on a specific function for deep debugging or understanding.
harvest-structured
Structured data extraction - tables, pricing, products, API endpoints with schema
harvest-single
Single page smart extraction - articles, docs, blog posts to clean markdown
harvest-monitor
Web change monitoring - track changes on pages, detect updates, changelog diffs
harvest-competitive
Competitive intelligence - extract features, pricing, tech stack from competitor sites
harvest-adaptive
Adaptive content summarization - auto-detect content type and produce relevant summary
firecrawl-scrape
Scrape web pages and extract content via Firecrawl MCP
deep-interview
Mathematically rigorous Socratic interview system that drives ambiguity below 20% before any code is written. One question per message, weighted ambiguity scoring, brownfield-aware, outputs a complete PRD. Replaces discovery-interview with a stricter protocol.
workflow-router
Goal-based workflow orchestration - routes tasks to specialist agents based on user goals
wiring
Wiring Verification
websocket-patterns
Connection management, room patterns, reconnection strategies, message buffering, and binary protocol design.
visual-verdict
Screenshot comparison QA for frontend development. Takes a screenshot of the current implementation, scores it across multiple visual dimensions, and returns a structured PASS/REVISE/FAIL verdict with concrete fixes. Use when implementing UI from a design reference or verifying visual correctness.