skill-scrape

PDF annotation extraction routing to scrape-agent

438 stars

Best use case

skill-scrape is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

PDF annotation extraction routing to scrape-agent

Teams using skill-scrape should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/skill-scrape/SKILL.md --create-dirs "https://raw.githubusercontent.com/benbrastmckie/nvim/main/.claude/extensions/filetypes/skills/skill-scrape/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/skill-scrape/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How skill-scrape Compares

Feature / Agentskill-scrapeStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

PDF annotation extraction routing to scrape-agent

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Scrape Skill

Thin wrapper that routes PDF annotation extraction to the `scrape-agent`, which extracts highlights, comments, notes, and other annotations from PDF files.

## Context Pointers

Reference (do not load eagerly):
- Path: `.claude/context/formats/subagent-return.md`
- Purpose: Return validation
- Load at: Subagent execution only

Note: This skill is a thin wrapper. Context is loaded by the delegated agent, not this skill.

## Trigger Conditions

This skill activates when:

### Direct Invocation
- User explicitly runs `/scrape` command
- User requests annotation extraction in conversation

### Implicit Invocation (during task implementation)

When an implementing agent encounters any of these patterns:

**Plan step language patterns**:
- "Extract annotations from [file].pdf"
- "Extract highlights from [file].pdf"
- "Extract comments from [file]"
- "Scrape notes from [file].pdf"
- "Collect annotations in [file]"
- "Export PDF comments to markdown"
- "Gather highlights and notes from [file]"

**File extension detection**:
- Source file has extension: `.pdf`
- Target mentions: "annotations", "highlights", "comments", "notes"

**Task description keywords**:
- "annotation extraction"
- "PDF scraping"
- "extract highlights"
- "collect comments"
- "gather notes from PDF"

### When NOT to Trigger

Do not invoke for:
- Non-PDF files (.docx, .html, .txt, etc.)
- General document conversion (use skill-filetypes)
- Reading PDF text content without annotation context (use skill-filetypes)
- Operations on spreadsheets or presentations (use skill-spreadsheet, skill-presentation)

---

## Execution

### 1. Input Validation

Validate required inputs:
- `pdf_path` - Must be provided, must exist, must be a .pdf file
- `output_path` - Optional, defaults to `{basename}_annotations.md` in same directory
- `annotation_types` - Optional array, defaults to all types
- `output_format` - Optional, defaults to "markdown"

```bash
# Validate source exists
if [ ! -f "$pdf_path" ]; then
  return error "PDF file not found: $pdf_path"
fi

# Validate source is a PDF
if [[ "${pdf_path##*.}" != "pdf" ]]; then
  return error "Source must be a .pdf file: $pdf_path"
fi

# Determine output path if not provided
if [ -z "$output_path" ]; then
  source_dir=$(dirname "$pdf_path")
  source_base=$(basename "$pdf_path" .pdf)
  output_path="${source_dir}/${source_base}_annotations.md"
fi
```

### 2. Context Preparation

Prepare delegation context:

```json
{
  "pdf_path": "/absolute/path/to/document.pdf",
  "output_path": "/absolute/path/to/document_annotations.md",
  "annotation_types": ["highlights", "comments", "notes", "bookmarks"],
  "output_format": "markdown",
  "metadata": {
    "session_id": "sess_{timestamp}_{random}",
    "delegation_depth": 1,
    "delegation_path": ["orchestrator", "scrape", "skill-scrape"]
  }
}
```

### 3. Invoke Agent

**CRITICAL**: You MUST use the **Task** tool to spawn the scrape agent.

**Required Tool Invocation**:
```
Tool: Task (NOT Skill)
Parameters:
  - subagent_type: "scrape-agent"
  - prompt: [Include pdf_path, output_path, annotation_types, output_format, metadata]
  - description: "Extract annotations from {pdf_path} to {output_path}"
```

**DO NOT** use `Skill(scrape-agent)` - this will FAIL.
Agents live in `.claude/agents/` or extension agent directories, not `.claude/skills/`.
The Skill tool can only invoke skills from `.claude/skills/`.

The agent will:
- Open the PDF and enumerate all annotation objects
- Filter by requested annotation types
- Format annotations according to output_format
- Write structured output to output_path
- Return standardized JSON result

### 4. Return Validation

Validate return matches `subagent-return.md` schema:
- Status is one of: scraped, partial, failed
- Summary is non-empty and <100 tokens
- Artifacts array present with output file path
- Metadata contains session_id, agent_type, delegation info

### 5. Return Propagation

Return validated result to caller without modification.

---

## Return Format

See `.claude/context/formats/subagent-return.md` for full specification.

Expected successful return:
```json
{
  "status": "scraped",
  "summary": "Extracted 42 annotations from document.pdf: 28 highlights, 10 comments, 4 notes",
  "artifacts": [
    {
      "type": "implementation",
      "path": "/absolute/path/to/document_annotations.md",
      "summary": "Extracted annotations in markdown format"
    }
  ],
  "metadata": {
    "session_id": "sess_...",
    "agent_type": "scrape-agent",
    "delegation_depth": 2,
    "delegation_path": ["orchestrator", "scrape", "skill-scrape", "scrape-agent"],
    "annotation_count": 42,
    "annotation_breakdown": {
      "highlights": 28,
      "comments": 10,
      "notes": 4
    }
  },
  "next_steps": "Review extracted annotations"
}
```

---

## Error Handling

### Input Validation Errors
Return immediately with failed status if PDF not found or not a .pdf file.

### Unsupported Format
Return failed status with clear message about supported annotation types.

### Agent Errors
Pass through the agent's error return verbatim.

### Tool Not Available
Return failed status with installation instructions for PDF annotation tools (e.g., pdfannots, PyMuPDF).

---

## MUST NOT

- Run the postflight step (git commit, status update) — that is the command's responsibility
- Modify the return from scrape-agent before propagating
- Load context files eagerly — only reference them when needed
- Use the Skill tool to invoke scrape-agent (use Task tool instead)

Related Skills

skill-learn

438
from benbrastmckie/nvim

Scan codebase for FIX:/NOTE:/TODO:/QUESTION: tags and create structured tasks with interactive selection. Invoke for /learn command.

skill-deck

438
from benbrastmckie/nvim

Generate YC-style investor pitch decks in Typst

skill-todo

438
from benbrastmckie/nvim

Archive completed and abandoned tasks with CHANGE_LOG.md updates and memory harvest suggestions

skill-team-research

438
from benbrastmckie/nvim

Orchestrate multi-agent research with wave-based parallel execution. Spawns 2-4 teammates for diverse investigation angles and synthesizes findings.

skill-team-plan

438
from benbrastmckie/nvim

Orchestrate multi-agent planning with parallel plan generation. Spawns 2-3 teammates for diverse planning approaches and synthesizes into final plan with trade-off analysis.

skill-team-implement

438
from benbrastmckie/nvim

Orchestrate multi-agent implementation with parallel phase execution. Spawns teammates for independent phases and coordinates dependent phases. Includes debugger teammate for error recovery.

skill-status-sync

438
from benbrastmckie/nvim

Atomically update task status across TODO.md and state.json. For standalone use only.

skill-spawn

438
from benbrastmckie/nvim

Research blockers and spawn new tasks to overcome them, updating parent task dependencies

skill-researcher

438
from benbrastmckie/nvim

Conduct general research using web search, documentation, and codebase exploration. Invoke for general research tasks.

skill-refresh

438
from benbrastmckie/nvim

Manage Claude Code resources - terminate orphaned processes and clean up ~/.claude/ directory

skill-planner

438
from benbrastmckie/nvim

Create phased implementation plans from research findings. Invoke when a task needs an implementation plan.

skill-orchestrator

438
from benbrastmckie/nvim

Route commands to appropriate workflows based on task language and status. Invoke when executing /task, /research, /plan, /implement commands.