scrape-webpage

Scrape webpage content, extract metadata, download images, and prepare for import/migration to AEM Edge Delivery Services. Returns analysis JSON with paths, metadata, cleaned HTML, and local images.

7 stars

byddttom

View on GitHub Installation ↓

Best use case

scrape-webpage is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Scrape webpage content, extract metadata, download images, and prepare for import/migration to AEM Edge Delivery Services. Returns analysis JSON with paths, metadata, cleaned HTML, and local images.

Teams using scrape-webpage should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/scrape-webpage/SKILL.md --create-dirs "https://raw.githubusercontent.com/ddttom/webcomponents-with-eds/main/.claude/skills/scrape-webpage/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/scrape-webpage/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How scrape-webpage Compares

Feature / Agent	scrape-webpage	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Scrape webpage content, extract metadata, download images, and prepare for import/migration to AEM Edge Delivery Services. Returns analysis JSON with paths, metadata, cleaned HTML, and local images.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Scrape Webpage

Extract content, metadata, and images from a webpage for import/migration.

## When to Use This Skill

Use this skill when:
- Starting a page import and need to extract content from source URL
- Need webpage analysis with local image downloads
- Want metadata extraction (Open Graph, JSON-LD, etc.)

**Invoked by:** page-import skill (Step 1)

## Prerequisites

Before using this skill, ensure:
- ✅ Node.js is available
- ✅ npm playwright is installed (`npm install playwright`)
- ✅ Chromium browser is installed (`npx playwright install chromium`)
- ✅ Sharp image library is installed (`cd .claude/skills/scrape-webpage/scripts && npm install`)

## Related Skills

- **page-import** - Orchestrator that invokes this skill
- **identify-page-structure** - Uses this skill's output (screenshot, HTML, metadata)
- **generate-import-html** - Uses image mapping and paths from this skill

## Scraping Workflow

### Step 1: Run Analysis Script

**Command:**
```bash
node .claude/skills/scrape-webpage/scripts/analyze-webpage.js "https://example.com/page" --output ./import-work
```

**What the script does:**
1. Sets up network interception to capture all images
2. Loads page in headless Chromium
3. Scrolls through entire page to trigger lazy-loaded images
4. Downloads all images locally (converts WebP/AVIF/SVG to PNG)
5. Captures full-page screenshot for visual reference
6. Extracts metadata (title, description, Open Graph, JSON-LD, canonical)
7. **Fixes images in DOM** (background-image→img, picture elements, srcset→src, relative→absolute, inline SVG→img)
8. Extracts cleaned HTML (removes scripts/styles)
9. Replaces image URLs in HTML with local paths (./images/...)
10. Generates document paths (sanitized, lowercase, no .html extension)
11. Saves complete analysis with image mapping to metadata.json

**For detailed explanation:** See `resources/web-page-analysis.md`

---

### Step 2: Verify Output

**Output files:**
- `./import-work/metadata.json` - Complete analysis with paths and image mapping
- `./import-work/screenshot.png` - Visual reference for layout comparison
- `./import-work/cleaned.html` - Main content HTML with local image paths
- `./import-work/images/` - All downloaded images (WebP/AVIF/SVG converted to PNG)

**Verify files exist:**
```bash
ls -lh ./import-work/metadata.json ./import-work/screenshot.png ./import-work/cleaned.html
ls -lh ./import-work/images/ | head -5
```

---

### Step 3: Review Metadata JSON

**Output JSON structure:**
```json
{
  "url": "https://example.com/page",
  "timestamp": "2025-01-12T10:30:00.000Z",
  "paths": {
    "documentPath": "/us/en/about",
    "htmlFilePath": "us/en/about.plain.html",
    "mdFilePath": "us/en/about.md",
    "dirPath": "us/en",
    "filename": "about"
  },
  "screenshot": "./import-work/screenshot.png",
  "html": {
    "filePath": "./import-work/cleaned.html",
    "size": 45230
  },
  "metadata": {
    "title": "Page Title",
    "description": "Page description",
    "og:image": "https://example.com/image.jpg",
    "canonical": "https://example.com/page"
  },
  "images": {
    "count": 15,
    "mapping": {
      "https://example.com/hero.jpg": "./images/a1b2c3d4e5f6.jpg",
      "https://example.com/logo.webp": "./images/f6e5d4c3b2a1.png"
    },
    "stats": {
      "total": 15,
      "converted": 3,
      "skipped": 12,
      "failed": 0
    }
  }
}
```

**Key fields:**
- `paths.documentPath` - Used for browser preview URL
- `paths.htmlFilePath` - Where to save final HTML file
- `images.mapping` - Original URLs → local paths
- `metadata` - Extracted page metadata

---

## Output

This skill provides:
- ✅ metadata.json with paths, metadata, image mapping
- ✅ screenshot.png for visual reference
- ✅ cleaned.html with local image references
- ✅ images/ folder with all downloaded images

**Next step:** Pass these outputs to identify-page-structure skill

---

## Troubleshooting

**Browser not installed:**
```bash
npx playwright install chromium
```

**Sharp not installed:**
```bash
cd .claude/skills/scrape-webpage/scripts && npm install
```

**Image download failures:**
- Check images.stats.failed count in metadata.json
- Some images may require authentication or be blocked by CORS
- Failed images will be noted but won't stop the scraping process

**Lazy-loaded images not captured:**
- Script scrolls through page to trigger lazy loading
- Some advanced lazy-loading may need customization in scripts/analyze-webpage.js

Related Skills

webapp-testing

from ddttom/webcomponents-with-eds

Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.

theme-factory

from ddttom/webcomponents-with-eds

Toolkit for styling artifacts with a theme. These artifacts can be slides, docs, reportings, HTML landing pages, etc. There are 10 pre-set themes with colors/fonts that you can apply to any artifact that has been creating, or can generate a new theme on-the-fly.

Testing Blocks

from ddttom/webcomponents-with-eds

Guide for testing code changes in AEM Edge Delivery projects including blocks, scripts, and styles. Use this skill after making code changes and before opening a pull request to validate functionality. Covers unit testing for utilities and logic, browser testing with Playwright/Puppeteer, linting, performance validation, and guidance on which tests to maintain vs use as throwaway validation.

template-skill

from ddttom/webcomponents-with-eds

Replace with description of the skill and when Claude should use it.

slack-gif-creator

from ddttom/webcomponents-with-eds

Toolkit for creating animated GIFs optimized for Slack, with validators for size constraints and composable animation primitives. This skill applies when users request animated GIFs or emoji animations for Slack from descriptions like "make me a GIF for Slack of X doing Y".

skill-developer

from ddttom/webcomponents-with-eds

Create and manage Claude Code skills following Anthropic best practices. Use when creating new skills, modifying skill-rules.json, understanding trigger patterns, working with hooks, debugging skill activation, or implementing progressive disclosure. Covers skill structure, YAML frontmatter, trigger types (keywords, intent patterns, file paths, content patterns), enforcement levels (block, suggest, warn), hook mechanisms (UserPromptSubmit, PreToolUse), session tracking, and the 500-line rule.

skill-creator

from ddttom/webcomponents-with-eds

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.

preview-import

from ddttom/webcomponents-with-eds

Preview and verify imported content in local AEM Edge Delivery Services dev server. Validates rendering, compares with original page, and troubleshoots common issues.

page-import

from ddttom/webcomponents-with-eds

Import a single webpage from any URL to structured HTML content for authoring in AEM Edge Delivery Services. Scrapes the page, analyzes structure, maps to existing blocks, and generates HTML for immediate local preview. Also triggered by terms like "migrate", "migration", or "migrating".

page-decomposition

from ddttom/webcomponents-with-eds

Analyze content sequences within a section and provide neutral descriptions for AEM Edge Delivery Services. Invoked per section during page import to identify breaking points between default content and blocks.

mcp-builder

from ddttom/webcomponents-with-eds

Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK).

jupyter-notebook-testing

from ddttom/webcomponents-with-eds

Create and manage Jupyter notebooks for testing Adobe Edge Delivery Services (EDS) blocks interactively in the browser using the ipynb-viewer block. Interactive JavaScript execution, overlay previews with backdrop, direct ES6 imports. Use when creating notebooks, testing blocks with ipynb files in browser, generating overlay previews, or creating executable documentation.