harvest-deep-crawl

Multi-page deep crawling - documentation sites, wikis, knowledge bases

422 stars

Best use case

harvest-deep-crawl is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Multi-page deep crawling - documentation sites, wikis, knowledge bases

Teams using harvest-deep-crawl should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/harvest-deep-crawl/SKILL.md --create-dirs "https://raw.githubusercontent.com/vibeeval/vibecosystem/main/skills/harvest-deep-crawl/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/harvest-deep-crawl/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How harvest-deep-crawl Compares

Feature / Agentharvest-deep-crawlStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Multi-page deep crawling - documentation sites, wikis, knowledge bases

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Harvest Deep Crawl

Crawl multi-page websites following internal links to a specified depth. Ideal for building complete knowledge bases from documentation sites, wikis, and reference materials.

## Usage

```
/crawl <url> --depth <N>
```

## Examples

```bash
# Crawl docs site 3 levels deep
/crawl https://docs.example.com --depth 3

# Crawl a specific section
/crawl https://docs.example.com/api --depth 2

# Crawl with page limit
/crawl https://wiki.example.com --depth 5 --max-pages 50
```

## Parameters

| Param | Default | Description |
|-------|---------|-------------|
| `--depth` | 2 | Max link-following depth |
| `--max-pages` | 100 | Max pages to crawl |
| `--same-domain` | true | Stay on same domain |
| `--include` | * | URL pattern to include |
| `--exclude` | - | URL pattern to exclude |

## How It Works

1. Start at root URL, extract all internal links
2. Follow links up to specified depth (BFS order)
3. Extract content from each page
4. Deduplicate pages with > 90% content overlap
5. Build table of contents from page hierarchy
6. Merge into coherent knowledge base
7. Save to `.claude/cache/agents/harvest/crawl-{domain}/`

## Output Structure

```
crawl-{domain}-{timestamp}/
  index.md          # Table of contents + summary
  page-001.md       # First page content
  page-002.md       # Second page content
  ...
  metadata.json     # Crawl stats, URLs, timings
```

## Crawl Engine

### Primary: crawl4ai (Docker port 11235)

```bash
curl -s http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://docs.example.com"],
    "max_depth": 3,
    "same_domain": true,
    "word_count_threshold": 50
  }'
```

### Fallback: Manual Link Following

When Docker unavailable:
1. WebFetch root URL
2. Parse links from markdown output
3. WebFetch each linked page (depth-limited)
4. Compile results

## Use Cases

| Scenario | Depth | Max Pages |
|----------|-------|-----------|
| API reference | 2-3 | 50 |
| Full documentation site | 3-5 | 100 |
| Wiki section | 2 | 30 |
| Changelog history | 1-2 | 20 |
| Tutorial series | 2-3 | 30 |

## Rules

- Respect robots.txt
- Max 2 requests/second
- Skip binary files (PDF, images, videos)
- Detect and skip infinite pagination
- Cache results for 24 hours

Related Skills

tldr-deep

422
from vibeeval/vibecosystem

Run full 5-layer analysis (AST, call graph, CFG, DFG, slice) on a specific function for deep debugging or understanding.

harvest-structured

422
from vibeeval/vibecosystem

Structured data extraction - tables, pricing, products, API endpoints with schema

harvest-single

422
from vibeeval/vibecosystem

Single page smart extraction - articles, docs, blog posts to clean markdown

harvest-monitor

422
from vibeeval/vibecosystem

Web change monitoring - track changes on pages, detect updates, changelog diffs

harvest-competitive

422
from vibeeval/vibecosystem

Competitive intelligence - extract features, pricing, tech stack from competitor sites

harvest-adaptive

422
from vibeeval/vibecosystem

Adaptive content summarization - auto-detect content type and produce relevant summary

firecrawl-scrape

422
from vibeeval/vibecosystem

Scrape web pages and extract content via Firecrawl MCP

deep-interview

422
from vibeeval/vibecosystem

Mathematically rigorous Socratic interview system that drives ambiguity below 20% before any code is written. One question per message, weighted ambiguity scoring, brownfield-aware, outputs a complete PRD. Replaces discovery-interview with a stricter protocol.

workflow-router

422
from vibeeval/vibecosystem

Goal-based workflow orchestration - routes tasks to specialist agents based on user goals

wiring

422
from vibeeval/vibecosystem

Wiring Verification

websocket-patterns

422
from vibeeval/vibecosystem

Connection management, room patterns, reconnection strategies, message buffering, and binary protocol design.

visual-verdict

422
from vibeeval/vibecosystem

Screenshot comparison QA for frontend development. Takes a screenshot of the current implementation, scores it across multiple visual dimensions, and returns a structured PASS/REVISE/FAIL verdict with concrete fixes. Use when implementing UI from a design reference or verifying visual correctness.