wechat-article-extractor

Extract metadata and content from WeChat Official Account articles. Use when user needs to parse WeChat article URLs (mp.weixin.qq.com), extract article info (title, author, content, publish time, cover image), or convert WeChat articles to structured data. Supports various article types including posts, videos, images, voice messages, and reposts.

33 stars

bytheneoai

View on GitHub Installation ↓

Best use case

wechat-article-extractor is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using wechat-article-extractor should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/wechat-article-extractor/SKILL.md --create-dirs "https://raw.githubusercontent.com/theneoai/awesome-skills/main/skills/persona/content/wechat-article-extractor/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/wechat-article-extractor/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How wechat-article-extractor Compares

Feature / Agent	wechat-article-extractor	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# WeChat Article Extractor

Extract metadata and content from WeChat Official Account (微信公众号) articles.

## Quick Start 🚀

### Command Line Usage (Recommended)

The easiest way to use this skill is via the CLI command:

```bash
# Basic usage - extracts and saves as markdown
npx wechat-article-extractor https://mp.weixin.qq.com/s/xxx

# Specify output path
npx wechat-article-extractor https://mp.weixin.qq.com/s/xxx --output ./articles/post.md

# Output JSON format
npx wechat-article-extractor https://mp.weixin.qq.com/s/xxx --json

# Show help
npx wechat-article-extractor --help
```

### Programmatic Usage

```javascript
const { extract } = require('./scripts/extract.js');

const result = await extract('https://mp.weixin.qq.com/s?__biz=...');
if (result.done) {
  console.log(result.data.msg_title);
  console.log(result.data.msg_content);
}
```

## Capabilities

- Parse WeChat article URLs (`mp.weixin.qq.com`)
- Extract article metadata: title, author, description, publish time
- Extract account info: name, avatar, alias, description
- Get article content (HTML)
- Get cover image URL
- Support multiple article types: post, video, image, voice, text, repost
- Handle various error cases: deleted content, expired links, access limits

## CLI Options

| Option | Description | Default |
|--------|-------------|---------|
| `<URL>` | WeChat article URL | Required |
| `--output <path>` | Output file path | `./wechat-article.md` |
| `--format <format>` | Output format (markdown\|json\|html) | `markdown` |
| `--json` | Output JSON format | false |
| `-h, --help` | Show help | - |

### Examples

```bash
# Extract article and save to custom location
npx wechat-article-extractor https://mp.weixin.qq.com/s/xxx --output ./my-article.md

# Get JSON output for processing
npx wechat-article-extractor https://mp.weixin.qq.com/s/xxx --json > article.json

# From within the skill directory
npm run extract https://mp.weixin.qq.com/s/xxx
```

## Usage in Scripts

### Basic Extraction from URL

```javascript
const { extract } = require('./scripts/extract.js');

const result = await extract('https://mp.weixin.qq.com/s?__biz=...');
// Returns: { done: true, code: 0, data: {...} }
```

### Extraction from HTML

```javascript
const html = await fetch(url).then(r => r.text());
const result = await extract(html, { url: sourceUrl });
```

### Advanced Options

```javascript
const result = await extract(url, {
  shouldReturnContent: true,      // Return HTML content (default: true)
  shouldReturnRawMeta: false,     // Return raw metadata (default: false)
  shouldFollowTransferLink: true, // Follow migrated account links (default: true)
  shouldExtractMpLinks: false,    // Extract embedded mp.weixin links (default: false)
  shouldExtractTags: false,       // Extract article tags (default: false)
  shouldExtractRepostMeta: false  // Extract repost source info (default: false)
});
```

## Response Format

### Success Response

```javascript
{
  done: true,
  code: 0,
  data: {
    // Account info
    account_name: "公众号名称",
    account_alias: "微信号",
    account_avatar: "头像URL",
    account_description: "功能介绍",
    account_id: "原始ID",
    account_biz: "biz参数",
    account_biz_number: 1234567890,
    account_qr_code: "二维码URL",

    // Article info
    msg_title: "文章标题",
    msg_desc: "文章摘要",
    msg_content: "HTML内容",
    msg_cover: "封面图URL",
    msg_author: "作者",
    msg_type: "post", // post|video|image|voice|text|repost
    msg_has_copyright: true,
    msg_publish_time: Date,
    msg_publish_time_str: "2024/01/15 10:30:00",

    // Link params
    msg_link: "文章链接",
    msg_source_url: "阅读原文链接",
    msg_sn: "sn参数",
    msg_mid: 1234567890,
    msg_idx: 1
  }
}
```

### Error Response

```javascript
{
  done: false,
  code: 1001,
  msg: "无法获取文章信息"
}
```

## Error Codes

| Code | Message | Description |
|------|---------|-------------|
| 1000 | 文章获取失败 | General failure |
| 1001 | 无法获取文章信息 | Missing title or publish time |
| 1002 | 请求失败 | HTTP request failed |
| 1003 | 响应为空 | Empty response |
| 1004 | 访问过于频繁 | Rate limited |
| 1005 | 脚本解析失败 | Script parsing error |
| 1006 | 公众号已迁移 | Account migrated |
| 2001 | 请提供文章内容或链接 | Missing input |
| 2002 | 链接已过期 | Link expired |
| 2003 | 内容涉嫌侵权 | Content removed (copyright) |
| 2004 | 无法获取迁移后的链接 | Migration link failed |
| 2005 | 内容已被发布者删除 | Content deleted by author |
| 2006 | 内容因违规无法查看 | Content blocked |
| 2007 | 内容发送失败 | Failed to send |
| 2008 | 系统出错 | System error |
| 2009 | 不支持的链接 | Unsupported URL |
| 2010 | 内容获取失败 | Content fetch failed |
| 2011 | 涉嫌过度营销 | Marketing/spam content |
| 2012 | 账号已被屏蔽 | Account blocked |
| 2013 | 账号已自主注销 | Account deleted |
| 2014 | 内容被投诉 | Content reported |
| 2015 | 账号处于迁移流程中 | Account migrating |
| 2016 | 冒名侵权 | Impersonation |

## Dependencies

Required npm packages:
- `cheerio` - HTML parsing
- `dayjs` - Date formatting
- `request-promise` - HTTP requests
- `qs` - Query string parsing
- `lodash.unescape` - HTML entities

## Notes

- Handles various WeChat page structures and anti-scraping measures
- Automatically detects article type from page content
- Supports extracting from Sogou WeChat search results (`weixin.sogou.com`)
- Some fields may be null depending on article type and page structure

## Troubleshooting

### "MODULE_NOT_FOUND" error
Ensure you're running the command from within the skill directory or using `npx`:
```bash
cd C:\Users\xsl\.agents\skills\wechat-article-extractor
npm run extract <URL>
```

### "访问过于频繁" error
Wait a few minutes before trying again. This is WeChat's rate limiting.

### Link expired
If you see "链接已过期", the article has been removed by the author or platform.

Related Skills

wechat-writer-kit

from theneoai/awesome-skills

公众号文章写作全流程助手（通用版）。支持多账号管理，适用于任何公众号方向。当用户说"今天写什么"、"帮我写公众号"、"选题"、"公众号选题"、"写文章"、 "公众号文章"、"更新公众号风格"、"这篇文章写得好"、"新建公众号"时触发。支持完整工作流：初始化/切换账号画像 → 搜索热门话题生成选题 → 选题确认 → 文章创作 → 存档到企业微信文档 → 持续学习风格。

wechat-publisher

from theneoai/awesome-skills

告别公众号格式乱码！一行命令发布文章，100%保留样式。支持三套配色模板、青色/橙色、紫色一键切换。包含自动发布到微信草稿箱的完整流程，支持中文封面图生成。

wechat-article-reviewer

from theneoai/awesome-skills

微信公众号文章审核助手。当 theneoai 写完文章后 @铁蛋队长发送文章内容时触发审核。检查文章是否符合微信公众号发布标准：字数、内容深度、事实性、标题质量、原创度、敏感词、风格一致性、排版规范。发现问题时提供具体整改意见，退回 theneoai 重写（最多3次），3次不通过则通知 lucas 人工介入。

write-skill

from theneoai/awesome-skills

Meta-skill for creating high-quality SKILL.md files. Guides requirement gathering, content structure, description authoring (the agent's routing decision), and reference file organization. Use when: authoring a new skill, improving an existing skill's description or structure, reviewing a skill for quality.

caveman

from theneoai/awesome-skills

Ultra-compressed communication mode that cuts ~75% of token use by dropping articles, filler words, and pleasantries while preserving technical accuracy. Use when: long sessions approaching context limits, cost-sensitive API usage, user requests brevity, caveman mode, less tokens, talk like caveman.

zoom-out

from theneoai/awesome-skills

Codebase orientation skill: navigate unfamiliar code by ascending abstraction layers to map modules, callers, and domain vocabulary. Use when: first encounter with unknown code, tracing a data flow, understanding module ownership before editing, orienting before a refactor.

to-prd

from theneoai/awesome-skills

Converts conversation context into a structured Product Requirements Document (PRD) and publishes it to the project issue tracker. Do NOT interview the user — synthesize what is already known. Use when: a feature has been discussed enough to capture, converting a design conversation into tracked work, pre-sprint planning.

tdd-workflow

from theneoai/awesome-skills

Test-driven development workflow using vertical slices (tracer bullets). Enforces behavior-first testing through public interfaces. Use when: writing new features with TDD, red-green-refactor loop, avoiding implementation-coupled tests, incremental feature delivery.

issue-triage

from theneoai/awesome-skills

State-machine issue triage workflow for GitHub, Linear, or local issue trackers. Manages category labels (bug, enhancement) and state labels (needs-triage, needs-info, ready-for-agent, ready-for-human, wontfix). Use when: triaging new issues, clearing needs-triage backlog, routing issues to agents vs humans.

debug-diagnose

from theneoai/awesome-skills

Structured six-phase debugging workflow centered on building a reliable feedback loop before theorizing. Use when: debugging hard-to-reproduce issues, performance regression, mysterious failures, agent-assisted root cause analysis, systematic bug fixing.

architecture-review

from theneoai/awesome-skills

Codebase architecture review using module depth analysis. Surfaces shallow modules, tight coupling, and locality violations. Proposes deepening opportunities. Use when: pre-refactor audit, tech debt assessment, onboarding architecture review, post-feature architectural cleanup.

vault-secrets-expert

from theneoai/awesome-skills

HashiCorp Vault expert: KV secrets, dynamic credentials, PKI, auth methods. Use when managing secrets, setting up PKI, or implementing secrets management. Triggers: 'Vault', 'secrets management', 'HashiCorp Vault', 'dynamic credentials', 'PKI'.