web-reader

智能网页阅读器 - 抓取文章/下载视频并归档,支持分析、摘要、衍生。Triggers: '下载这篇文章', '抓取文章', '保存文章', 'fetch URL', '分析这篇文章', '摘要', '总结文章', '下载视频', '抓取微信文章', '抓取飞书文档', '把这个链接保存下来', '下载B站视频', 'download article', 'analyze article', 'summarize'.

3,891 stars

Best use case

web-reader is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

智能网页阅读器 - 抓取文章/下载视频并归档,支持分析、摘要、衍生。Triggers: '下载这篇文章', '抓取文章', '保存文章', 'fetch URL', '分析这篇文章', '摘要', '总结文章', '下载视频', '抓取微信文章', '抓取飞书文档', '把这个链接保存下来', '下载B站视频', 'download article', 'analyze article', 'summarize'.

Teams using web-reader should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/web-reader/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/alexxxiong/web-reader/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/web-reader/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How web-reader Compares

Feature / Agentweb-readerStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

智能网页阅读器 - 抓取文章/下载视频并归档,支持分析、摘要、衍生。Triggers: '下载这篇文章', '抓取文章', '保存文章', 'fetch URL', '分析这篇文章', '摘要', '总结文章', '下载视频', '抓取微信文章', '抓取飞书文档', '把这个链接保存下来', '下载B站视频', 'download article', 'analyze article', 'summarize'.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Web Reader

智能网页阅读器:抓取文章(图文)→ 归档到指定目录 → 分析/摘要/衍生。

## 工作流程

### Phase 1: 下载原文(优先执行)

当用户提供 URL 并要求下载/保存/抓取/分析时,**先下载原文**。

**Step 1: 确定归档路径**

检查用户是否配置了归档目录。读取 `~/.claude/web-reader.json`:

```json
{
  "archive_dir": "/Volumes/Keybase (m4)/private/biggerbear/文档/微信摘录",
  "categories": ["AI漫剧短剧", "AI工具与技术", "OpenClaw", "开发技术", "商业与产品", "飞书文档"]
}
```

如果配置文件不存在,使用 `~/Documents/docs/` 作为默认归档目录。

**Step 2: 抓取文章**

```bash
python3 {SKILL_DIR}/fetcher.py "URL" -o "ARCHIVE_DIR" --category "CATEGORY"
```

对于需要 JavaScript 渲染的页面(返回 "JavaScript enabled" 错误),升级为浏览器模式:

```bash
# 先尝试 scrapling fetch(浏览器)
scrapling extract fetch "URL" /tmp/article.md --network-idle

# 如果仍失败,用 camoufox
python3 {SKILL_DIR}/fetcher.py "URL" -o "ARCHIVE_DIR" --method camoufox --category "CATEGORY"
```

**Step 3: 处理微信图片**

对于微信公众号文章(mp.weixin.qq.com),fetcher.py 自动处理 `data-src` 图片。
如果自动处理失败(图片为空 `![]()` 占位符),手动处理:

1. 用 scrapling 获取 HTML:`scrapling extract get "URL" /tmp/wx.html -s "#js_content"`
2. 从 HTML 中提取 `data-src="https://mmbiz.qpic.cn..."` URL
3. 下载图片到 `ARCHIVE_DIR/CATEGORY/SLUG/` 目录
4. 替换 markdown 中的空占位符为本地路径

**Step 4: 自动分类**

根据文章内容判断分类。参考分类关键词:
- **AI漫剧短剧**: 漫剧, 短剧, 动漫, seedance, AI视频创作
- **AI工具与技术**: AI工具, 大模型, scrapling, agent, API, 技术方案
- **OpenClaw**: openclaw, claude code, agent team
- **开发技术**: flutter, 编程, 开发, 逆向, IDE, 部署
- **商业与产品**: 产品分析, 商业模式, 市场, 付费, 创业
- **飞书文档**: feishu.cn 域名

如果不确定分类,**询问用户**选择或创建新分类。

**Step 5: 确认结果**

输出归档信息:
- 文章标题
- 保存路径
- 图片数量
- 分类

### Phase 2: 后续探索(用户触发)

文章下载完成后,等待用户指令:

| 指令 | 动作 |
|------|------|
| **分析** | 阅读文章全文,提取核心观点、论据、方法论,给出结构化分析 |
| **摘要** | 生成 3-5 句话的精炼摘要,保留关键数据和结论 |
| **总结** | 按章节列出要点,适合快速回顾 |
| **衍生** | 基于文章内容,提出可以进一步探索的方向、相关话题、实践建议 |
| **提炼** | 提取可直接复用的方法、工具、配置、代码片段 |
| **对比** | 与之前下载的其他文章进行对比分析(需指定对比目标) |
| **洗稿** | 基于原文重写,保留核心信息但完全改写表述(配合 /wxpub 使用) |

执行分析时,直接读取归档的 markdown 文件,不需要重新抓取。

## 智能路由

| 平台 | 方法 | 说明 |
|------|------|------|
| mp.weixin.qq.com | scrapling | 提取 `data-src` 图片,处理 SVG 占位符 |
| *.feishu.cn | 虚拟滚动 | 滚动采集内容块,浏览器内下载图片 |
| zhuanlan.zhihu.com | scrapling | `.Post-RichText` 选择器 |
| www.zhihu.com | scrapling | `.RichContent` 选择器 |
| www.toutiao.com | scrapling | 处理 toutiaoimg base64 占位符 |
| www.xiaohongshu.com | camoufox | 反爬需要隐身浏览器 |
| www.weibo.com | camoufox | 反爬需要隐身浏览器 |
| bilibili.com / b23.tv | yt-dlp | 视频下载 |
| youtube.com / youtu.be | yt-dlp | 视频下载 |
| douyin.com | yt-dlp | 视频下载 |
| 其他 URL | scrapling | 通用抓取,三级降级策略 |

## 安装依赖

按需安装:

| 依赖 | 用途 | 安装 |
|------|------|------|
| scrapling | 文章抓取 | `pip install scrapling` |
| yt-dlp | 视频下载 | `pip install yt-dlp` |
| camoufox | 反检测浏览器 | `pip install camoufox && python3 -m camoufox fetch` |
| html2text | HTML 转 Markdown | `pip install html2text` |

## CLI 参考

```
python3 {SKILL_DIR}/fetcher.py [URL] [OPTIONS]

参数:
  url                    要抓取的 URL

选项:
  -o, --output DIR       输出目录(默认: 当前目录)
  -q, --quality N        视频质量(默认: 1080)
  --method METHOD        强制方法: scrapling, camoufox, ytdlp, feishu
  --selector CSS         强制 CSS 选择器
  --urls-file FILE       URL 列表文件(每行一个,# 注释)
  --audio-only           仅提取音频
  --no-images            跳过图片下载
  --cookies-browser NAME 浏览器 cookies(如 chrome, firefox)
  --category NAME        归档子目录名
  --json-output          JSON 格式输出(程序化调用)
```

## 配置

创建 `~/.claude/web-reader.json` 配置归档目录:

```json
{
  "archive_dir": "~/Documents/docs/articles",
  "categories": ["技术", "产品", "行业"]
}
```

## 平台特殊处理

### 微信公众号
- 图片使用 `data-src` 属性 + `mmbiz.qpic.cn`
- 可见 `<img>` 是 SVG 懒加载占位符
- 图片下载需要 `Referer: https://mp.weixin.qq.com/` 请求头

### 飞书文档
- 虚拟滚动:滚过的内容会从 DOM 移除
- 注入 JS 采集器在滚动过程中捕获 `[data-block-id]` 内容块
- 图片 401:必须在浏览器上下文内用 `fetch(url, {credentials: 'include'})` 下载

### Bilibili
- 短链 (b23.tv) 自动解析
- 大会员内容用 `--cookies-browser chrome`

## 故障排除

| 问题 | 解决方案 |
|------|----------|
| 页面需要 JavaScript | 用 `--method camoufox` 或先 `scrapling extract fetch` |
| 微信图片为空 | 手动从 HTML 提取 `data-src` URL |
| 飞书返回登录页 | 文档可能需要认证 |
| B站 403 | 用 `--cookies-browser chrome` |
| 内容太短 | 尝试 `--method camoufox` |

Related Skills

web-reader-pro

3891
from openclaw/skills

Advanced web content extraction skill for OpenClaw using multi-tier fallback strategy (Jina → Scrapling → WebFetch) with intelligent routing, caching, quality scoring, and domain learning. Use when: reading article content, extracting web page text, scraping dynamic JS-heavy pages, or fetching WeChat official account articles.

Data & Research

wechat-mp-reader

3891
from openclaw/skills

Read WeChat official account articles. Use the built-in browser tool to open the page and extract body text. Always append ?scene=1 to the URL.

rss-ai-reader

3891
from openclaw/skills

📰 RSS AI 阅读器 — 自动抓取订阅、LLM生成摘要、多渠道推送! 支持 Claude/OpenAI 生成中文摘要,推送到飞书/Telegram/Email。 触发条件: 用户要求订阅RSS、监控博客、抓取新闻、生成摘要、设置定时抓取、 "帮我订阅"、"监控这个网站"、"每天推送新闻"、RSS/Atom feed 相关。

DeepReader

3891
from openclaw/skills

The default web content reader for OpenClaw. Reads X (Twitter), Reddit, YouTube, and any webpage into clean Markdown — zero API keys required. Use when you need to ingest social media posts, articles, or video transcripts into agent memory.

feishu-doc-reader

3891
from openclaw/skills

Read and extract content from all Feishu (Lark) document types using the official Feishu Open API

medical-research-literature-reader-pro

3891
from openclaw/skills

A medical-research-native literature reading skill for users with clinical, bioinformatics, translational, and basic experimental backgrounds. Use this skill whenever a user wants to read, analyze, critique, or interpret a medical or scientific paper — whether they provide a PDF, abstract, DOI, PMID, or just a title. Triggers include requests like "analyze this paper", "critique this study", "is this a strong paper?", "give me similar studies", "prepare me for journal club", "help me understand this bioinformatics paper", "what are the weaknesses here?", or "turn this into a mind map". Also activate for any downstream deliverables such as journal club kits, comparison tables, PI decision briefs, replication starters, or follow-up experiment designs. Do NOT treat as a generic summarizer — this skill performs structured evidence-type classification, track-specific critical appraisal, interpretation-boundary judgment, and research-grade follow-up generation.

4chan-reader

3891
from openclaw/skills

Browse 4chan boards and extract thread discussions into structured text files. Use when you need to fetch catalog information or specific thread content (including post text and file metadata) from 4chan boards like /a/, /vg/, /v/, etc.

mockplus-reader

3891
from openclaw/skills

读取和分析 MockPlus 在线设计页面。用于:(1)打开并解析 MockPlus 网页链接,(2)提取页面中的设计信息、结构、组件,(3)分析原型稿内容和交互说明。当用户发送 MockPlus 链接或要求分析原型稿时使用此技能。

WeChat-article-reader

3891
from openclaw/skills

将微信公众号文章导出为 Markdown 格式。当用户提供微信公众号链接 (mp.weixin.qq.com) 或要求下载/导出/保存微信文章时触发。默认保存到工作空间的 source 目录。

Arxiv Paper Reader

3880
from openclaw/skills

利用python,指定某个arxiv_id/url, 基于 LLM Agent 对这篇arxiv论文进行分类与深度阅读,直接print打印阅读笔记

---

3891
from openclaw/skills

name: article-factory-wechat

Content & Documentation

humanizer

3891
from openclaw/skills

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Content & Documentation