scrapling-skill

Install, troubleshoot, and use Scrapling CLI to extract HTML, Markdown, or text from webpages. Use this skill whenever the user mentions Scrapling, `uv tool install scrapling`, `scrapling extract`, WeChat/mp.weixin articles, browser-backed page fetching, or needs help deciding between static and dynamic extraction.

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

scrapling-skill is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using scrapling-skill should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/scrapling-skill/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/daymade/claude-code-skills/scrapling-skill/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/scrapling-skill/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How scrapling-skill Compares

Feature / Agent	scrapling-skill	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Scrapling Skill

## Overview

Use Scrapling through its CLI as the default path. Start with the smallest working command, validate the saved output, and only escalate to browser-backed fetching when the static fetch does not contain the real page content.

Do not assume the user's Scrapling install is healthy. Verify it first.

## Default Workflow

Copy this checklist and keep it updated while working:

```text
Scrapling Progress:
- [ ] Step 1: Diagnose the local Scrapling install
- [ ] Step 2: Fix CLI extras or browser runtime if needed
- [ ] Step 3: Choose static or dynamic fetch
- [ ] Step 4: Save output to a file
- [ ] Step 5: Validate file size and extracted content
- [ ] Step 6: Escalate only if the previous path failed
```

## Step 1: Diagnose the Install

Run the bundled diagnostic script first:

```bash
python3 scripts/diagnose_scrapling.py
```

Use the result as the source of truth for the next step.

## Step 2: Fix the Install

### If the CLI was installed without extras

If `scrapling --help` fails with missing `click` or a message about installing Scrapling with extras, reinstall it with the CLI extra:

```bash
uv tool uninstall scrapling
uv tool install 'scrapling[shell]'
```

Do not default to `scrapling[all]` unless the user explicitly needs the broader feature set.

### If browser-backed fetchers are needed

Install the Playwright runtime:

```bash
scrapling install
```

If the install looks slow or opaque, read `references/troubleshooting.md` before guessing. Do not claim success until either:
- `scrapling install` reports that dependencies are already installed, or
- the diagnostic script confirms both Chromium and Chrome Headless Shell are present.

## Step 3: Choose the Fetcher

Use this decision rule:

- Start with `extract get` for normal pages, article pages, and most WeChat public articles.
- Use `extract fetch` when the static HTML does not contain the real content or the page depends on JavaScript rendering.
- Use `extract stealthy-fetch` only after `fetch` still fails because of anti-bot or challenge behavior. Do not make it the default.

## Step 4: Run the Smallest Useful Command

Always quote URLs in shell commands. This is mandatory in `zsh` when the URL contains `?`, `&`, or other special characters.

### Full page to HTML

```bash
scrapling extract get 'https://example.com' page.html
```

### Main content to Markdown

```bash
scrapling extract get 'https://example.com' article.md -s 'main'
```

### JS-rendered page with browser automation

```bash
scrapling extract fetch 'https://example.com' page.html --timeout 20000
```

### WeChat public article body

Use `#js_content` first. This is the default selector for article body extraction on `mp.weixin.qq.com` pages.

```bash
scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'
```

## Step 5: Validate the Output

After every extraction, verify the file instead of assuming success:

```bash
wc -c article.md
sed -n '1,40p' article.md
```

For HTML output, check that the expected title, container, or selector target is actually present:

```bash
rg -n '<title>|js_content|rich_media_title|main' page.html
```

If the file is tiny, empty, or missing the expected container, the extraction did not succeed. Go back to Step 3 and switch fetchers or selectors.

## Step 6: Handle Known Failure Modes

### Local TLS trust store problem

If `extract get` fails with `curl: (60) SSL certificate problem`, treat it as a local trust-store problem first, not a Scrapling content failure.

Retry the same command with:

```bash
--no-verify
```

Only do this after confirming the failure matches the local certificate verification error pattern. Do not silently disable verification by default.

### WeChat article pages

For `mp.weixin.qq.com`:
- Try `extract get` before `extract fetch`
- Use `-s '#js_content'` for the article body
- Validate the saved Markdown or HTML immediately

### Browser-backed fetch failures

If `extract fetch` fails:
1. Re-check the install with `python3 scripts/diagnose_scrapling.py`
2. Confirm Chromium and Chrome Headless Shell are present
3. Retry with a slightly longer timeout
4. Escalate to `stealthy-fetch` only if the site behavior justifies it

## Command Patterns

### Diagnose and smoke test a URL

```bash
python3 scripts/diagnose_scrapling.py --url 'https://example.com'
```

### Diagnose and smoke test a WeChat article body

```bash
python3 scripts/diagnose_scrapling.py \
  --url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
  --selector '#js_content' \
  --no-verify
```

### Diagnose and smoke test a browser-backed fetch

```bash
python3 scripts/diagnose_scrapling.py \
  --url 'https://example.com' \
  --dynamic
```

## Guardrails

- Do not tell the user to reinstall blindly. Verify first.
- Do not default to the Python library API when the user is clearly asking about the CLI.
- Do not jump to browser-backed fetching unless the static result is missing the real content.
- Do not claim success from exit code alone. Inspect the saved file.
- Do not hardcode user-specific absolute paths into outputs or docs.

## Resources

- Installation and smoke test helper: `scripts/diagnose_scrapling.py`
- Verified failure modes and recovery paths: `references/troubleshooting.md`

Related Skills

Scrapling Web Fetch

from ComeOnOliver/skillshub

当用户要获取网页内容、正文提取、把网页转成 markdown/text、抓取文章主体时，优先使用此技能。

Daily Logs

from ComeOnOliver/skillshub

Record the user's daily activities, progress, decisions, and learnings in a structured, chronological format.

Socratic Method: The Dialectic Engine

from ComeOnOliver/skillshub

This skill transforms Claude into a Socratic agent — a cognitive partner who guides

Sokratische Methode: Die Dialektik-Maschine

from ComeOnOliver/skillshub

Dieser Skill verwandelt Claude in einen sokratischen Agenten — einen kognitiven Partner, der Nutzende durch systematisches Fragen zur Wissensentdeckung führt, anstatt direkt zu instruieren.

College Football Data (CFB)

from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.

College Basketball Data (CBB)

from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for endpoints, conference IDs, team IDs, and data shapes.

Betting Analysis

from ComeOnOliver/skillshub

Before writing queries, consult `references/api-reference.md` for odds formats, command parameters, and key concepts.

Research Proposal Generator

from ComeOnOliver/skillshub

Generate high-quality academic research proposals for PhD applications following Nature Reviews-style academic writing conventions.

Paper Slide Deck Generator

from ComeOnOliver/skillshub

Transform academic papers and content into professional slide deck images with automatic figure extraction.

Medical Imaging AI Literature Review Skill

from ComeOnOliver/skillshub

Write comprehensive literature reviews following a systematic 7-phase workflow.

Meeting Briefing Skill

from ComeOnOliver/skillshub

You are a meeting preparation assistant for an in-house legal team. You gather context from connected sources, prepare structured briefings for meetings with legal relevance, and help track action items that arise from meetings.

Canned Responses Skill

from ComeOnOliver/skillshub

You are a response template assistant for an in-house legal team. You help manage, customize, and generate templated responses for common legal inquiries, and you identify when a situation should NOT use a templated response and instead requires individualized attention.