web-content-fetcher

Extracts clean Markdown content from any given URL, intelligently prioritizing a robust Scrapling script with stealth fallback, or using Jina Reader as an alternative.

492 stars

byshirenchuang

Complexity: medium

View on GitHub Installation ↓

About this skill

The `web-content-fetcher` is a versatile AI agent skill designed to extract and clean article content from virtually any web page, presenting it as well-structured Markdown. It intelligently employs a primary Scrapling script that automatically switches between fast and stealth modes (utilizing a headless browser for complex or anti-scraping sites), ensuring reliable content retrieval. For simpler pages or as a fallback, it leverages Jina Reader. This skill is invaluable for agents needing to process web content for various tasks such as summarization, data extraction, analysis, or simply presenting web articles in a readable, standardized format. It meticulously preserves essential elements like headings, links, images, lists, and code blocks, making the extracted Markdown highly functional and human-readable. Developers and users of AI agents will find this skill exceptionally useful for automating content intake from diverse sources, including news articles, blog posts, documentation, and even challenging platforms like WeChat articles (微信公众号), enabling agents to interact with the web's rich information landscape more effectively.

Best use case

The primary use case is providing AI agents with the ability to reliably fetch and process web content from any URL into a clean, structured Markdown format. This benefits content analysis, summarization, research, and data extraction workflows, particularly for agents that need to consume web pages as input for further processing without dealing with raw HTML or inconsistent formatting.

Extracts clean Markdown content from any given URL, intelligently prioritizing a robust Scrapling script with stealth fallback, or using Jina Reader as an alternative.

A clean, well-formatted Markdown string containing the main textual and structural content (headings, links, images, lists, code blocks) of the provided URL.

Practical example

Example input

Read this article for me: https://www.nytimes.com/2023/10/26/technology/ai-advances.html

Example output

```markdown
# AI Advances Spark New Debates

## Ethical Considerations

Recent breakthroughs in artificial intelligence have intensified discussions around [ethics and societal impact](https://example.com/ethics-report). Experts are urging for regulations to ensure responsible development.

### Key Takeaways

*   Rapid pace of innovation.
*   Growing concerns about bias.
*   Need for global collaboration.

```python
# Example AI code snippet
def hello_ai():
    print("Hello from AI!")
```

When to use this skill

When you need to extract the main article content from a URL as clean Markdown.
When summarizing, analyzing, or processing text from news articles, blog posts, or documentation.
When traditional fetching methods fail due to JavaScript rendering or anti-scraping measures.
When you need to handle content from platforms like WeChat articles (微信公众号).

When not to use this skill

When you only need raw HTML or a full screenshot of a page.
When you need to interact with dynamic elements on a page (e.g., clicking buttons, filling forms).
For very high-volume, continuous scraping that exceeds Jina Reader's free tier limits or requires custom proxy rotation.

How web-content-fetcher Compares

Feature / Agent	web-content-fetcher	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	medium	N/A

Frequently Asked Questions

What does this skill do?

Extracts clean Markdown content from any given URL, intelligently prioritizing a robust Scrapling script with stealth fallback, or using Jina Reader as an alternative.

How difficult is it to install?

The installation complexity is rated as medium. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for ChatGPT

Find the best AI skills to adapt into ChatGPT workflows for research, writing, summarization, planning, and repeatable assistant tasks.

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

AI Agent for YouTube Script Writing

Find AI agent skills for YouTube script writing, video research, content outlining, and repeatable channel production workflows.

SKILL.md Source

# Web Content Fetcher

Given a URL, return its main content as clean Markdown — headings, links, images, lists, code blocks all preserved.

## Extraction Strategy

Always try **one method per URL** — don't cascade blindly. Pick the right one upfront.

```
URL
 │
 ├─ 1. Scrapling script (preferred)
 │     Run fetch.py — check the domain routing table to decide fast vs --stealth.
 │     Works for most sites. Returns clean Markdown directly.
 │
 └─ 2. Jina Reader (fallback — only if Scrapling fails or dependencies not installed)
       web_fetch("https://r.jina.ai/<url>")
       Free tier: 200 req/day. Fast (~1-2s), good Markdown output.
       Does NOT work for: WeChat (403), some Chinese platforms.
```

### Scrapling script

```bash
python3 <SKILL_DIR>/scripts/fetch.py "<url>" [max_chars] [--stealth]
```

`<SKILL_DIR>` is the directory where this SKILL.md lives. Resolve it before calling the script.

The script has two modes built in:
- **Default (fast):** HTTP fetch, ~1-3s, works for most sites
- **`--stealth`:** Headless browser, ~5-15s, for JS-rendered or anti-scraping sites

When run without `--stealth`, the script automatically falls back to stealth if the fast result has too little content. So you rarely need to specify `--stealth` manually — the only reason to force it is when you already know the site needs it (see routing table), which saves the initial fast attempt.

## Domain Routing

Use this table to pick the right mode on the first call:

| Domain | Command | Why |
|--------|---------|-----|
| `mp.weixin.qq.com` | `fetch.py <url> --stealth` | JS-rendered content |
| `zhuanlan.zhihu.com` | `fetch.py <url> --stealth` | Anti-scraping + JS |
| `juejin.cn` | `fetch.py <url> --stealth` | JS-rendered SPA |
| `sspai.com` | `fetch.py <url>` | Static HTML |
| `blog.csdn.net` | `fetch.py <url>` | Static HTML |
| `ruanyifeng.com` | `fetch.py <url>` | Static blog |
| `openai.com` | `fetch.py <url>` | Static HTML |
| `blog.google` | `fetch.py <url>` | Static HTML |
| Everything else | `fetch.py <url>` | Auto-fallback handles it |

## Script Options

```bash
# Basic — auto-selects fast or stealth
python3 <SKILL_DIR>/scripts/fetch.py "https://sspai.com/post/73145"

# Force stealth for known JS-heavy sites
python3 <SKILL_DIR>/scripts/fetch.py "https://mp.weixin.qq.com/s/xxx" --stealth

# Limit output to 15000 characters (default: 30000)
python3 <SKILL_DIR>/scripts/fetch.py "https://example.com/article" 15000

# JSON output with metadata (url, mode, selector, content_length)
python3 <SKILL_DIR>/scripts/fetch.py "https://example.com" --json
```

## Install Dependencies

First use only — the script checks and tells you if anything is missing:

```bash
pip install scrapling html2text
```

If on system-managed Python (macOS/Linux), add `--break-system-packages` or use a venv.

## Failure Rules

- Same URL fails once → give up, tell the user "unable to extract content from this URL"
- Do not retry — each failed call wastes context tokens

Related Skills

content-pipeline

from OrangeViolin/content-pipeline

A comprehensive AI agent skill for content production and distribution, managing the entire lifecycle from material collection to multi-platform publishing across social media, podcasts, and video.

Content & Documentation

3891

from openclaw/skills

Build a complete, customized employee handbook for your company. Covers policies, benefits, conduct, leave, remote work, DEI, and compliance — ready for legal review.

Content & Documentation

web-content-fetcher

About this skill

Best use case

Practical example

Example input

Example output

When to use this skill

When not to use this skill

How web-content-fetcher Compares

Frequently Asked Questions

What does this skill do?

How difficult is it to install?

Where can I find the source code?

Related Guides

Best AI Skills for ChatGPT

Best AI Skills for Claude

AI Agent for YouTube Script Writing

SKILL.md Source

Related Skills

content-pipeline

writing-content

﻿---

humanizer

linkedin-cli

小红书长图文发布 Skill

openclaw-youtube

openclaw-media-gen

Cold Email Writer

Presentation Mastery — Complete Slide Design & Delivery System

ai-humanizer

Employee Handbook Generator

---