web-scrape

Intelligent web scraper with content extraction, multiple output formats, and error handling

242 stars

Best use case

web-scrape is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. Intelligent web scraper with content extraction, multiple output formats, and error handling

Intelligent web scraper with content extraction, multiple output formats, and error handling

Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.

Practical example

Example input

Use the "web-scrape" skill to help with this workflow task. Context: Intelligent web scraper with content extraction, multiple output formats, and error handling

Example output

A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.

When to use this skill

Use this skill when you want a reusable workflow rather than writing the same prompt again and again.

When not to use this skill

Do not use this when you only need a one-off answer and do not need a reusable workflow.
Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/web-scrape/SKILL.md --create-dirs "https://raw.githubusercontent.com/aiskillstore/marketplace/main/skills/21pounder/web-scrape/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/web-scrape/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How web-scrape Compares

Feature / Agent	web-scrape	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Intelligent web scraper with content extraction, multiple output formats, and error handling

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Web Scraping Skill v3.0

## Usage

```
/web-scrape <url> [options]
```

**Options:**
- `--format=markdown|json|text` - Output format (default: markdown)
- `--full` - Include full page content (skip smart extraction)
- `--screenshot` - Also save a screenshot
- `--scroll` - Scroll to load dynamic content (infinite scroll pages)

**Examples:**
```
/web-scrape https://example.com/article
/web-scrape https://news.site.com/story --format=json
/web-scrape https://spa-app.com/page --scroll --screenshot
```

---

## Execution Flow

### Phase 1: Navigate and Load

```
1. mcp__playwright__browser_navigate
   url: "<target URL>"

2. mcp__playwright__browser_wait_for
   time: 2  (allow initial render)
```

**If `--scroll` option:** Execute scroll sequence to trigger lazy loading:
```
3. mcp__playwright__browser_evaluate
   function: "async () => {
     for (let i = 0; i < 3; i++) {
       window.scrollTo(0, document.body.scrollHeight);
       await new Promise(r => setTimeout(r, 1000));
     }
     window.scrollTo(0, 0);
   }"
```

### Phase 2: Capture Content

```
4. mcp__playwright__browser_snapshot
   → Returns full accessibility tree with all text content
```

**If `--screenshot` option:**
```
5. mcp__playwright__browser_take_screenshot
   filename: "scraped_<domain>_<timestamp>.png"
   fullPage: true
```

### Phase 3: Close Browser

```
6. mcp__playwright__browser_close
```

---

## Smart Content Extraction

After getting the snapshot, apply intelligent extraction:

### Step 1: Identify Content Type

| Page Type | Indicators | Extraction Strategy |
|-----------|------------|---------------------|
| **Article/Blog** | `<article>`, long paragraphs, date/author | Extract main article body |
| **Product Page** | Price, "Add to Cart", specs | Extract title, price, description, specs |
| **Documentation** | Code blocks, headings hierarchy | Preserve structure and code |
| **List/Search** | Repeated item patterns | Extract as structured list |
| **Landing Page** | Hero section, CTAs | Extract key messaging |

### Step 2: Filter Noise

**ALWAYS REMOVE these elements from output:**
- Navigation menus and breadcrumbs
- Footer content (copyright, links)
- Sidebars (ads, related articles, social links)
- Cookie banners and popups
- Comments section (unless specifically requested)
- Share buttons and social widgets
- Login/signup prompts

### Step 3: Structure the Content

**For Articles:**
```markdown
# [Title]

**Source:** [URL]
**Date:** [if available]
**Author:** [if available]

---

[Main content in clean markdown]
```

**For Product Pages:**
```markdown
# [Product Name]

**Price:** [price]
**Availability:** [in stock/out of stock]

## Description
[product description]

## Specifications
| Spec | Value |
|------|-------|
| ... | ... |
```

---

## Output Formats

### Markdown (default)
Clean, readable markdown with proper headings, lists, and formatting.

### JSON
```json
{
  "url": "https://...",
  "title": "Page Title",
  "type": "article|product|docs|list",
  "content": {
    "main": "...",
    "metadata": {}
  },
  "extracted_at": "ISO timestamp"
}
```

### Text
Plain text with minimal formatting, suitable for further processing.

---

## Error Handling

### Navigation Errors

| Error | Detection | Action |
|-------|-----------|--------|
| **Timeout** | Page doesn't load in 30s | Report error, suggest retry |
| **404 Not Found** | "404" in title/content | Report "Page not found" |
| **403 Forbidden** | "403", "Access Denied" | Report access restriction |
| **CAPTCHA** | "captcha", "verify you're human" | Report CAPTCHA detected, cannot proceed |
| **Paywall** | "subscribe", "premium content" | Extract visible content, note paywall |

### Recovery Actions

```
If page load fails:
1. Report the specific error to user
2. Suggest: "Try again?" or "Different URL?"
3. Close browser cleanly

If content is blocked:
1. Report what was detected (CAPTCHA/paywall/geo-block)
2. Extract any available preview content
3. Suggest alternatives if applicable
```

---

## Advanced Scenarios

### Single Page Applications (SPA)
```
1. Navigate to URL
2. Wait longer (3-5 seconds) for JS hydration
3. Use browser_wait_for with specific text if known
4. Then snapshot
```

### Infinite Scroll Pages
```
1. Navigate
2. Execute scroll loop (see Phase 1)
3. Snapshot after scrolling completes
```

### Pages with Click-to-Reveal Content
```
1. Snapshot first to identify clickable elements
2. Use browser_click on "Read more" / "Show all" buttons
3. Wait briefly
4. Snapshot again for full content
```

### Multi-page Articles
```
1. Scrape first page
2. Identify "Next" or pagination links
3. Ask user: "Article has X pages. Scrape all?"
4. If yes, iterate through pages and combine
```

---

## Performance Guidelines

| Metric | Target | How |
|--------|--------|-----|
| **Speed** | < 15 seconds | Minimal waits, parallel where possible |
| **Token Usage** | < 5000 tokens | Smart extraction, not full DOM |
| **Reliability** | > 95% success | Proper error handling |

---

## Security Notes

- Never execute arbitrary JavaScript from the page
- Don't follow redirects to suspicious domains
- Don't submit forms or click login buttons
- Don't scrape pages that require authentication (unless user provides credentials flow)
- Respect robots.txt when mentioned by user

---

## Quick Reference

**Minimum viable scrape (4 tool calls):**
```
1. browser_navigate → 2. browser_wait_for → 3. browser_snapshot → 4. browser_close
```

**Full-featured scrape (with scroll + screenshot):**
```
1. browser_navigate
2. browser_wait_for
3. browser_evaluate (scroll)
4. browser_snapshot
5. browser_take_screenshot
6. browser_close
```

Remember: The goal is to deliver **clean, useful content** to the user, not raw HTML/DOM dumps.

Related Skills

firecrawl-scraper

242

from aiskillstore/marketplace

Deep web scraping, screenshots, PDF parsing, and website crawling using Firecrawl API

firecrawl-scrape

242

from aiskillstore/marketplace

Extract clean markdown from any URL, including JavaScript-rendered SPAs. Use this skill whenever the user provides a URL and wants its content, says "scrape", "grab", "fetch", "pull", "get the page", "extract from this URL", or "read this webpage". Handles JS-rendered pages, multiple concurrent URLs, and returns LLM-optimized markdown. Use this instead of WebFetch for any webpage content extraction.

twscrape

242

from aiskillstore/marketplace

Python library for scraping Twitter/X data using GraphQL API with account rotation and session management. Use when extracting tweets, user profiles, followers, trends, or building social media monitoring tools.

azure-quotas

242

from aiskillstore/marketplace

Check/manage Azure quotas and usage across providers. For deployment planning, capacity validation, region selection. WHEN: "check quotas", "service limits", "current usage", "request quota increase", "quota exceeded", "validate capacity", "regional availability", "provisioning limits", "vCPU limit", "how many vCPUs available in my subscription".

DevOps & Infrastructure

raindrop-io

242

from aiskillstore/marketplace

Manage Raindrop.io bookmarks with AI assistance. Save and organize bookmarks, search your collection, manage reading lists, and organize research materials. Use when working with bookmarks, web research, reading lists, or when user mentions Raindrop.io.

Data & Research

zlibrary-to-notebooklm

242

from aiskillstore/marketplace

自动从 Z-Library 下载书籍并上传到 Google NotebookLM。支持 PDF/EPUB 格式，自动转换，一键创建知识库。

discover-skills

242

from aiskillstore/marketplace

当你发现当前可用的技能都不够合适（或用户明确要求你寻找技能）时使用。本技能会基于任务目标和约束，给出一份精简的候选技能清单，帮助你选出最适配当前任务的技能。

web-performance-seo

242

from aiskillstore/marketplace

Fix PageSpeed Insights/Lighthouse accessibility "!" errors caused by contrast audit failures (CSS filters, OKLCH/OKLAB, low opacity, gradient text, image backgrounds). Use for accessibility-driven SEO/performance debugging and remediation.

project-to-obsidian

242

from aiskillstore/marketplace

将代码项目转换为 Obsidian 知识库。当用户提到 obsidian、项目文档、知识库、分析项目、转换项目时激活。【激活后必须执行】： 1. 先完整阅读本 SKILL.md 文件 2. 理解 AI 写入规则（默认到 00_Inbox/AI/、追加式、统一 Schema） 3. 执行 STEP 0: 使用 AskUserQuestion 询问用户确认 4. 用户确认后才开始 STEP 1 项目扫描 5. 严格按 STEP 0 → 1 → 2 → 3 → 4 顺序执行【禁止行为】： - 禁止不读 SKILL.md 就开始分析项目 - 禁止跳过 STEP 0 用户确认 - 禁止直接在 30_Resources 创建（先到 00_Inbox/AI/） - 禁止自作主张决定输出位置

obsidian-helper

242

from aiskillstore/marketplace

Obsidian 智能笔记助手。当用户提到 obsidian、日记、笔记、知识库、capture、review 时激活。【激活后必须执行】： 1. 先完整阅读本 SKILL.md 文件 2. 理解 AI 写入三条硬规矩（00_Inbox/AI/、追加式、白名单字段） 3. 按 STEP 0 → STEP 1 → ... 顺序执行 4. 不要跳过任何步骤，不要自作主张【禁止行为】： - 禁止不读 SKILL.md 就开始工作 - 禁止跳过用户确认步骤 - 禁止在非 00_Inbox/AI/ 位置创建新笔记（除非用户明确指定）

internationalizing-websites

242

from aiskillstore/marketplace

Adds multi-language support to Next.js websites with proper SEO configuration including hreflang tags, localized sitemaps, and language-specific content. Use when adding new languages, setting up i18n, optimizing for international SEO, or when user mentions localization, translation, multi-language, or specific languages like Japanese, Korean, Chinese.

google-official-seo-guide

242

from aiskillstore/marketplace

Official Google SEO guide covering search optimization, best practices, Search Console, crawling, indexing, and improving website search visibility based on official Google documentation