Web Content Extractor - 网页内容提取器

**版本**: 2.0

3,891 stars

Best use case

Web Content Extractor - 网页内容提取器 is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

**版本**: 2.0

Teams using Web Content Extractor - 网页内容提取器 should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/web-fetch-vx/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/3511815125/web-fetch-vx/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/web-fetch-vx/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Web Content Extractor - 网页内容提取器 Compares

Feature / Agent	Web Content Extractor - 网页内容提取器	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

**版本**: 2.0

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

Best AI Agents for Marketing

A curated list of the best AI agents and skills for marketing teams focused on SEO, content systems, outreach, and campaign execution.

Best AI Skills for ChatGPT

Find the best AI skills to adapt into ChatGPT workflows for research, writing, summarization, planning, and repeatable assistant tasks.

SKILL.md Source

# Web Content Extractor - 网页内容提取器

**版本**: 2.0  
**作者**: OpenClaw Team  
**更新日期**: 2026-03-15  
**许可证**: MIT

---

## 📦 技能元数据

```yaml
name: web-content-extractor
version: 2.0.0
description: 从微信文章/博客/新闻网页提取干净内容，去除广告和侧边栏
category: 内容处理
tags: [网页提取，内容清洗，微信文章，Markdown]
author: OpenClaw Team
license: MIT
```

---

## 🎯 功能概述

基于 Readability + Firecrawl + Defuddle 三引擎的网页内容提取工具，专为中文内容优化。支持微信文章、新闻网站、博客等多种来源，自动去除广告/导航/侧边栏，输出干净的 Markdown 格式。

**核心能力**：
- ✅ 微信文章提取（mp.weixin.qq.com）
- ✅ 新闻网页清洗
- ✅ 博客文章解析
- ✅ 元数据提取（标题/作者/日期）
- ✅ 多格式输出（Markdown/JSON/纯文本）
- ✅ 批量处理支持

---

## 🚀 快速开始

### 基础调用

```python
# OpenClaw 工具调用
result = web_fetch(
    url="https://mp.weixin.qq.com/s/xxx",
    extractMode="markdown",
    maxChars=8000
)
```

### 完整参数

| 参数 | 类型 | 必填 | 默认值 | 说明 |
|------|------|------|--------|------|
| url | str | ✅ | - | 网页 URL |
| extractMode | str | ❌ | "markdown" | 输出格式（markdown/text/json） |
| maxChars | int | ❌ | 8000 | 最大字符数 |
| includeMetadata | bool | ❌ | true | 是否包含元数据 |
| timeout | int | ❌ | 30 | 超时时间（秒） |

---

## 📤 输入输出

### 输入示例

```json
{
  "url": "https://mp.weixin.qq.com/s/abcdefg",
  "extractMode": "markdown",
  "maxChars": 8000,
  "includeMetadata": true
}
```

### 输出示例

```json
{
  "success": true,
  "url": "https://mp.weixin.qq.com/s/abcdefg",
  "title": "文章标题",
  "author": "作者名",
  "publishDate": "2026-03-15",
  "content": "Markdown 格式的正文内容...",
  "wordCount": 2500,
  "readTime": "10 分钟",
  "images": ["https://..."],
  "extractTime": 0.8
}
```

---

## 🔧 技术架构

### 三引擎设计

```
                    用户请求
                       ↓
              ┌────────────────┐
              │   路由判断层    │
              └────────────────┘
                       ↓
        ┌──────────────┼──────────────┐
        ↓              ↓              ↓
   ┌─────────┐   ┌─────────┐   ┌─────────┐
   │ web_fetch│   │ defuddle│   │ browser │
   │ (快速)  │   │ (专业)  │   │ (兜底)  │
   └─────────┘   └─────────┘   └─────────┘
        ↓              ↓              ↓
              ┌────────────────┐
              │   结果聚合层    │
              └────────────────┘
                       ↓
                  返回用户
```

### 引擎对比

| 引擎 | 速度 | 成功率 | 适用场景 |
|------|------|--------|----------|
| web_fetch | <1s | 70% | 微信文章/通用网页 |
| defuddle | <1s | 75% | 博客/新闻网站 |
| browser | 5-10s | 90% | 复杂 SPA/动态页面 |

---

## 📋 使用场景

### 场景 1：微信文章提取

```python
result = web_fetch(
    url="https://mp.weixin.qq.com/s/xxx",
    extractMode="markdown"
)
print(result["content"])
```

### 场景 2：批量处理

```python
urls = ["url1", "url2", "url3"]
results = [web_fetch(url=u) for u in urls]
```

### 场景 3：带元数据提取

```python
result = web_fetch(
    url="https://example.com/article",
    includeMetadata=True
)
print(f"标题：{result['title']}")
print(f"作者：{result['author']}")
print(f"字数：{result['wordCount']}")
```

---

## ⚠️ 限制与注意事项

### 不支持的场景

- ❌ 需要登录的页面
- ❌ 付费墙内容
- ❌ 验证码保护的页面
- ❌ 纯 JavaScript 渲染的 SPA（需用 browser 引擎）

### 速率限制

| 域名类型 | 请求间隔 | 并发限制 |
|----------|----------|----------|
| 微信文章 | 2 秒 | 1 |
| 新闻网站 | 1 秒 | 3 |
| 博客 | 1 秒 | 5 |

### 合规要求

1. 仅提取公开可访问内容
2. 尊重 robots.txt 协议
3. 不用于商业用途（除非获得授权）
4. 保留原作者署名

---

## 🎛️ 高级配置

### 自定义 User-Agent

```python
result = web_fetch(
    url="https://example.com",
    userAgent="Mozilla/5.0 ..."
)
```

### 代理配置

```python
result = web_fetch(
    url="https://example.com",
    proxy="http://proxy:port"
)
```

### 缓存控制

```python
# 启用缓存（1 小时）
result = web_fetch(url, cache=True, ttl=3600)

# 强制刷新
result = web_fetch(url, cache=False)
```

---

## 📊 性能指标

| 指标 | 数值 |
|------|------|
| 平均响应时间 | 0.8 秒 |
| P95 响应时间 | 2.5 秒 |
| 成功率 | 85% |
| 缓存命中率 | 60% |

---

## 🔍 故障排查

### 问题 1：提取内容为空

**原因**：页面需要 JavaScript 渲染  
**解决**：切换到 browser 引擎

### 问题 2：微信文章提取失败

**原因**：链接过期或有反爬  
**解决**：
1. 检查链接是否有效
2. 尝试 browser 引擎
3. 手动复制内容

### 问题 3：提取内容不完整

**原因**：maxChars 限制  
**解决**：增加 maxChars 参数或分页处理

---

## 📚 依赖项

```json
{
  "readability": "^0.4.4",
  "firecrawl": "^1.0.0",
  "defuddle": "^3.0.0"
}
```

---

## 🤝 贡献指南

1. Fork 本仓库
2. 创建功能分支 (`git checkout -b feature/AmazingFeature`)
3. 提交更改 (`git commit -m 'Add some AmazingFeature'`)
4. 推送到分支 (`git push origin feature/AmazingFeature`)
5. 开启 Pull Request

---

## 📄 许可证

MIT License - 详见 [LICENSE](LICENSE)

---

## 📞 支持

- **文档**: https://docs.openclaw.ai/skills/web-content-extractor
- **问题反馈**: https://github.com/openclaw/openclaw/issues
- **社区**: https://discord.com/invite/clawd

---

**最后更新**: 2026-03-15  
**维护状态**: ✅ 活跃维护

Related Skills

错敏信息检测 API (Sensitive Content Detection)

3891

from openclaw/skills

一个基于 FastAPI 的错敏信息检测服务，用于检测文本中的敏感词、错别字和规范表述问题。

Content Moderation & Analysis

content-parser

3891

from openclaw/skills

Extract and parse content from URLs. Triggers on: user provides a URL to extract content from, another skill needs to parse source material, "parse this URL", "extract content", "解析链接", "提取内容".

Data & Research

content-creator-pro

3891

from openclaw/skills

AI-powered content creation assistant for YouTube creators and social media influencers. Generate scripts, titles, hooks, thumbnail concepts, and social captions using natural language.

Content Repurposer - Multi-Platform Content Adaptor

3891

from openclaw/skills

Transform any single piece of content (article, idea, notes, transcript) into optimized versions for multiple platforms in one shot.

content-automation

3891

from openclaw/skills

内容创作自动化工具 Skill。支持社交媒体内容生成、视频脚本创作、定时发布任务管理。当用户需要批量生成内容、自动化社交媒体运营或创建视频脚本时触发。

recipe-video-extractor

3891

from openclaw/skills

Extract a structured cooking recipe from a shared video URL when the user sends `recipe <url>`. Prioritize caption/description and comments via browser automation, then use web search/fetch as fallback with clear source attribution.

daily-fun-content

3891

from openclaw/skills

每日趣味内容生成器 - 每天早上搜索网络，预缓存一天的笑话、热梗、聊天技巧。包括搞笑段子、网络热梗解释、高情商对话示例。用 cron 触发，内容缓存到文件，心跳时随机取用。

wechat-content-creator

3891

from openclaw/skills

Create high-quality WeChat public account articles with high eCPM. Use when writing WeChat articles, optimizing titles, selecting topics, or improving content quality. Covers 8 golden opening templates, SCQA structure, long-tail keyword integration, high-value niches like legal, finance, career, and compliance guidelines. Triggers on requests like write WeChat article, 公众号文章, 爆款文案, title optimization, 选题, eCPM optimization, or 长尾关键词.

content-factory

3891

from openclaw/skills

Multi-agent content production system. One piece of source content becomes many formats — social posts, email, scripts, headlines, and more. Five specialized agent personas: Writer, Remixer, Editor, Scriptwriter, and Headline Machine.

youtube-content-manager

3891

from openclaw/skills

YouTube内容管理后台，支持AI选题生成、脚本创作、标题优化、SEO描述生成、缩略图文案建议、发布记录管理和数据分析。集成SkillPay支付接口，每次调用收0.001USDT。

youtube-content-manager-pro

3891

from openclaw/skills

All-in-one YouTube Content Management Tool, AI generate topics, scripts, titles, SEO descriptions, tags, thumbnails, analytics. $0.005 USDT per use.

social-media-content-scraper-pro

3891

from openclaw/skills

Social Media Content Bulk Scraper, extract articles/posts from WeChat, Instagram, TikTok, YouTube, export to Markdown/HTML with full metadata. $0.005 USDT per use.