Web Content Extractor - 网页内容提取器

**版本**: 2.0

3,891 stars

Best use case

Web Content Extractor - 网页内容提取器 is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

**版本**: 2.0

Teams using Web Content Extractor - 网页内容提取器 should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/web-fetch-vx/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/3511815125/web-fetch-vx/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/web-fetch-vx/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How Web Content Extractor - 网页内容提取器 Compares

Feature / AgentWeb Content Extractor - 网页内容提取器Standard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

**版本**: 2.0

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Web Content Extractor - 网页内容提取器

**版本**: 2.0  
**作者**: OpenClaw Team  
**更新日期**: 2026-03-15  
**许可证**: MIT

---

## 📦 技能元数据

```yaml
name: web-content-extractor
version: 2.0.0
description: 从微信文章/博客/新闻网页提取干净内容,去除广告和侧边栏
category: 内容处理
tags: [网页提取,内容清洗,微信文章,Markdown]
author: OpenClaw Team
license: MIT
```

---

## 🎯 功能概述

基于 Readability + Firecrawl + Defuddle 三引擎的网页内容提取工具,专为中文内容优化。支持微信文章、新闻网站、博客等多种来源,自动去除广告/导航/侧边栏,输出干净的 Markdown 格式。

**核心能力**:
- ✅ 微信文章提取(mp.weixin.qq.com)
- ✅ 新闻网页清洗
- ✅ 博客文章解析
- ✅ 元数据提取(标题/作者/日期)
- ✅ 多格式输出(Markdown/JSON/纯文本)
- ✅ 批量处理支持

---

## 🚀 快速开始

### 基础调用

```python
# OpenClaw 工具调用
result = web_fetch(
    url="https://mp.weixin.qq.com/s/xxx",
    extractMode="markdown",
    maxChars=8000
)
```

### 完整参数

| 参数 | 类型 | 必填 | 默认值 | 说明 |
|------|------|------|--------|------|
| url | str | ✅ | - | 网页 URL |
| extractMode | str | ❌ | "markdown" | 输出格式(markdown/text/json) |
| maxChars | int | ❌ | 8000 | 最大字符数 |
| includeMetadata | bool | ❌ | true | 是否包含元数据 |
| timeout | int | ❌ | 30 | 超时时间(秒) |

---

## 📤 输入输出

### 输入示例

```json
{
  "url": "https://mp.weixin.qq.com/s/abcdefg",
  "extractMode": "markdown",
  "maxChars": 8000,
  "includeMetadata": true
}
```

### 输出示例

```json
{
  "success": true,
  "url": "https://mp.weixin.qq.com/s/abcdefg",
  "title": "文章标题",
  "author": "作者名",
  "publishDate": "2026-03-15",
  "content": "Markdown 格式的正文内容...",
  "wordCount": 2500,
  "readTime": "10 分钟",
  "images": ["https://..."],
  "extractTime": 0.8
}
```

---

## 🔧 技术架构

### 三引擎设计

```
                    用户请求
                       ↓
              ┌────────────────┐
              │   路由判断层    │
              └────────────────┘
                       ↓
        ┌──────────────┼──────────────┐
        ↓              ↓              ↓
   ┌─────────┐   ┌─────────┐   ┌─────────┐
   │ web_fetch│   │ defuddle│   │ browser │
   │ (快速)  │   │ (专业)  │   │ (兜底)  │
   └─────────┘   └─────────┘   └─────────┘
        ↓              ↓              ↓
              ┌────────────────┐
              │   结果聚合层    │
              └────────────────┘
                       ↓
                  返回用户
```

### 引擎对比

| 引擎 | 速度 | 成功率 | 适用场景 |
|------|------|--------|----------|
| web_fetch | <1s | 70% | 微信文章/通用网页 |
| defuddle | <1s | 75% | 博客/新闻网站 |
| browser | 5-10s | 90% | 复杂 SPA/动态页面 |

---

## 📋 使用场景

### 场景 1:微信文章提取

```python
result = web_fetch(
    url="https://mp.weixin.qq.com/s/xxx",
    extractMode="markdown"
)
print(result["content"])
```

### 场景 2:批量处理

```python
urls = ["url1", "url2", "url3"]
results = [web_fetch(url=u) for u in urls]
```

### 场景 3:带元数据提取

```python
result = web_fetch(
    url="https://example.com/article",
    includeMetadata=True
)
print(f"标题:{result['title']}")
print(f"作者:{result['author']}")
print(f"字数:{result['wordCount']}")
```

---

## ⚠️ 限制与注意事项

### 不支持的场景

- ❌ 需要登录的页面
- ❌ 付费墙内容
- ❌ 验证码保护的页面
- ❌ 纯 JavaScript 渲染的 SPA(需用 browser 引擎)

### 速率限制

| 域名类型 | 请求间隔 | 并发限制 |
|----------|----------|----------|
| 微信文章 | 2 秒 | 1 |
| 新闻网站 | 1 秒 | 3 |
| 博客 | 1 秒 | 5 |

### 合规要求

1. 仅提取公开可访问内容
2. 尊重 robots.txt 协议
3. 不用于商业用途(除非获得授权)
4. 保留原作者署名

---

## 🎛️ 高级配置

### 自定义 User-Agent

```python
result = web_fetch(
    url="https://example.com",
    userAgent="Mozilla/5.0 ..."
)
```

### 代理配置

```python
result = web_fetch(
    url="https://example.com",
    proxy="http://proxy:port"
)
```

### 缓存控制

```python
# 启用缓存(1 小时)
result = web_fetch(url, cache=True, ttl=3600)

# 强制刷新
result = web_fetch(url, cache=False)
```

---

## 📊 性能指标

| 指标 | 数值 |
|------|------|
| 平均响应时间 | 0.8 秒 |
| P95 响应时间 | 2.5 秒 |
| 成功率 | 85% |
| 缓存命中率 | 60% |

---

## 🔍 故障排查

### 问题 1:提取内容为空

**原因**:页面需要 JavaScript 渲染  
**解决**:切换到 browser 引擎

### 问题 2:微信文章提取失败

**原因**:链接过期或有反爬  
**解决**:
1. 检查链接是否有效
2. 尝试 browser 引擎
3. 手动复制内容

### 问题 3:提取内容不完整

**原因**:maxChars 限制  
**解决**:增加 maxChars 参数或分页处理

---

## 📚 依赖项

```json
{
  "readability": "^0.4.4",
  "firecrawl": "^1.0.0",
  "defuddle": "^3.0.0"
}
```

---

## 🤝 贡献指南

1. Fork 本仓库
2. 创建功能分支 (`git checkout -b feature/AmazingFeature`)
3. 提交更改 (`git commit -m 'Add some AmazingFeature'`)
4. 推送到分支 (`git push origin feature/AmazingFeature`)
5. 开启 Pull Request

---

## 📄 许可证

MIT License - 详见 [LICENSE](LICENSE)

---

## 📞 支持

- **文档**: https://docs.openclaw.ai/skills/web-content-extractor
- **问题反馈**: https://github.com/openclaw/openclaw/issues
- **社区**: https://discord.com/invite/clawd

---

**最后更新**: 2026-03-15  
**维护状态**: ✅ 活跃维护

Related Skills

错敏信息检测 API (Sensitive Content Detection)

3891
from openclaw/skills

一个基于 FastAPI 的错敏信息检测服务,用于检测文本中的敏感词、错别字和规范表述问题。

Content Moderation & Analysis

content-parser

3891
from openclaw/skills

Extract and parse content from URLs. Triggers on: user provides a URL to extract content from, another skill needs to parse source material, "parse this URL", "extract content", "解析链接", "提取内容".

Data & Research

content-creator-pro

3891
from openclaw/skills

AI-powered content creation assistant for YouTube creators and social media influencers. Generate scripts, titles, hooks, thumbnail concepts, and social captions using natural language.

Content Repurposer - Multi-Platform Content Adaptor

3891
from openclaw/skills

Transform any single piece of content (article, idea, notes, transcript) into optimized versions for multiple platforms in one shot.

content-automation

3891
from openclaw/skills

内容创作自动化工具 Skill。支持社交媒体内容生成、视频脚本创作、定时发布任务管理。当用户需要批量生成内容、自动化社交媒体运营或创建视频脚本时触发。

recipe-video-extractor

3891
from openclaw/skills

Extract a structured cooking recipe from a shared video URL when the user sends `recipe <url>`. Prioritize caption/description and comments via browser automation, then use web search/fetch as fallback with clear source attribution.

daily-fun-content

3891
from openclaw/skills

每日趣味内容生成器 - 每天早上搜索网络,预缓存一天的笑话、热梗、聊天技巧。包括搞笑段子、网络热梗解释、高情商对话示例。用 cron 触发,内容缓存到文件,心跳时随机取用。

wechat-content-creator

3891
from openclaw/skills

Create high-quality WeChat public account articles with high eCPM. Use when writing WeChat articles, optimizing titles, selecting topics, or improving content quality. Covers 8 golden opening templates, SCQA structure, long-tail keyword integration, high-value niches like legal, finance, career, and compliance guidelines. Triggers on requests like write WeChat article, 公众号文章, 爆款文案, title optimization, 选题, eCPM optimization, or 长尾关键词.

content-factory

3891
from openclaw/skills

Multi-agent content production system. One piece of source content becomes many formats — social posts, email, scripts, headlines, and more. Five specialized agent personas: Writer, Remixer, Editor, Scriptwriter, and Headline Machine.

youtube-content-manager

3891
from openclaw/skills

YouTube内容管理后台,支持AI选题生成、脚本创作、标题优化、SEO描述生成、缩略图文案建议、发布记录管理和数据分析。集成SkillPay支付接口,每次调用收0.001USDT。

youtube-content-manager-pro

3891
from openclaw/skills

All-in-one YouTube Content Management Tool, AI generate topics, scripts, titles, SEO descriptions, tags, thumbnails, analytics. $0.005 USDT per use.

social-media-content-scraper-pro

3891
from openclaw/skills

Social Media Content Bulk Scraper, extract articles/posts from WeChat, Instagram, TikTok, YouTube, export to Markdown/HTML with full metadata. $0.005 USDT per use.