content-parser

Extract and parse content from URLs. Triggers on: user provides a URL to extract content from, another skill needs to parse source material, "parse this URL", "extract content", "解析链接", "提取内容".

3,891 stars
Complexity: medium

About this skill

This skill empowers AI agents to intelligently extract and parse textual content from any given URL. It acts as a crucial preprocessing step, enabling agents to efficiently ingest web content by providing a standardized and structured output. The skill delivers the main content body, relevant metadata (like title, author, publish date), and associated references, making it highly suitable for subsequent analysis, summarization, or content generation tasks. It eliminates the complexities of directly handling web scraping, allowing AI agents to focus on higher-level tasks based on clean, parsed data. The skill is triggered when a user provides a URL for content extraction, or when another skill requires source material parsing from a web link. It supports commands like "parse this URL" or "extract content," including their Chinese equivalents. The core purpose is to normalize content across various platforms into a consistent data format, making web information readily usable by AI agents.

Best use case

The primary use case is to provide AI agents with a robust and reliable method to access, understand, and utilize information from the public web. This benefits developers building agents for research, summarization, content creation, or data analysis, by simplifying the initial data acquisition and preparation phase from online sources.

Extract and parse content from URLs. Triggers on: user provides a URL to extract content from, another skill needs to parse source material, "parse this URL", "extract content", "解析链接", "提取内容".

A structured JSON object containing the extracted content body, metadata (like title, author, publish date), and any found references from the provided URL.

Practical example

Example input

Please extract the content from this article: https://www.theverge.com/2023/10/26/23933010/google-pixel-8-pro-review

Example output

```json
{
  "url": "https://www.theverge.com/2023/10/26/23933010/google-pixel-8-pro-review",
  "title": "Google Pixel 8 Pro review: smarter than ever",
  "author": "Allison Johnson",
  ""publish_date": "2023-10-26T13:00:00Z",
  "body": "The Google Pixel 8 Pro is a phone with a lot going for it... [excerpt of parsed article body]",
  "keywords": ["Google", "Pixel", "Smartphone", "Review", "AI"],
  "references": []
}
```

When to use this skill

  • User provides a URL and wants its content extracted or read.
  • Another skill requires parsing source material from a URL before generation.
  • User explicitly commands to "parse this URL" or "extract content from this link."
  • When raw text content, metadata, and references are needed from a web page.

When not to use this skill

  • User already possesses the text content and doesn't need URL parsing.
  • User's request involves generating audio or video content.
  • User wishes to read a local file (standard file reading tools are more appropriate).
  • The URL provided is invalid or points to a non-HTTP(S) resource.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/content-parser/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/0xfango/content-parser/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/content-parser/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How content-parser Compares

Feature / Agentcontent-parserStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexitymediumN/A

Frequently Asked Questions

What does this skill do?

Extract and parse content from URLs. Triggers on: user provides a URL to extract content from, another skill needs to parse source material, "parse this URL", "extract content", "解析链接", "提取内容".

How difficult is it to install?

The installation complexity is rated as medium. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

## When to Use

- User provides a URL and wants to extract/read its content
- Another skill needs to parse source material from a URL before generation
- User says "parse this URL", "extract content from this link"
- User says "解析链接", "提取内容"

## When NOT to Use

- User already has text content and doesn't need URL parsing
- User wants to generate audio/video content (not content extraction)
- User wants to read a local file (use standard file reading tools)

## Purpose

Extract and normalize content from URLs across supported platforms. Returns structured data including content body, metadata, and references. Useful as a preprocessing step for content generation skills or standalone content extraction.

## Hard Constraints

- No shell scripts. Construct curl commands from the API reference files listed in Resources
- Always read `shared/authentication.md` for API key and headers
- Follow `shared/common-patterns.md` for polling, errors, and interaction patterns
- URL must be a valid HTTP(S) URL
- Always read config following `shared/config-pattern.md` before any interaction
- Never save files to `~/Downloads/` or `.listenhub/` — save to the current working directory

<HARD-GATE>
Use the AskUserQuestion tool for every multiple-choice step — do NOT print options as plain text. Ask one question at a time. Wait for the user's answer before proceeding to the next step. After collecting URL and options, confirm with the user before calling the extraction API.
</HARD-GATE>

## Step -1: API Key Check

Follow `shared/config-pattern.md` § API Key Check. If the key is missing, stop immediately.

## Step 0: Config Setup

Follow `shared/config-pattern.md` Step 0.

**If file doesn't exist** — ask location, then create immediately:
```bash
mkdir -p ".listenhub/content-parser"
echo '{"autoDownload":true}' > ".listenhub/content-parser/config.json"
CONFIG_PATH=".listenhub/content-parser/config.json"
# (or $HOME/.listenhub/content-parser/config.json for global)
```
Then run **Setup Flow** below.

**If file exists** — read config, display summary, and confirm:
```
当前配置 (content-parser):
  自动下载:{是 / 否}
```
Ask: "使用已保存的配置?" → **确认,直接继续** / **重新配置**

### Setup Flow (first run or reconfigure)

1. **autoDownload**: "自动保存提取的内容到当前目录?"
   - "是(推荐)" → `autoDownload: true`
   - "否" → `autoDownload: false`

Save immediately:
```bash
NEW_CONFIG=$(echo "$CONFIG" | jq --argjson dl {true/false} '. + {"autoDownload": $dl}')
echo "$NEW_CONFIG" > "$CONFIG_PATH"
CONFIG=$(cat "$CONFIG_PATH")
```

## Interaction Flow

### Step 1: URL Input

Free text input. Ask the user:

> What URL would you like to extract content from?

### Step 2: Options (optional)

Ask if the user wants to configure extraction options:

```
Question: "Do you want to configure extraction options?"
Options:
  - "No, use defaults" — Extract with default settings
  - "Yes, configure options" — Set summarize, maxLength, or Twitter tweet count
```

If "Yes", ask follow-up questions:
- **Summarize**: "Generate a summary of the content?" (Yes/No)
- **Max Length**: "Set maximum content length?" (Free text, e.g., "5000")
- **Twitter count** (only if URL is Twitter/X profile): "How many tweets to fetch?" (1-100, default 20)

### Step 3: Confirm & Extract

Summarize:

```
Ready to extract content:

  URL: {url}
  Options: {summarize: true, maxLength: 5000, twitter.count: 50} / default

  Proceed?
```

Wait for explicit confirmation before calling the API.

## Workflow

1. **Validate URL**: Must be HTTP(S). Normalize if needed (see `references/supported-platforms.md`)
2. **Build request body**:
   ```json
   {
     "source": {
       "type": "url",
       "uri": "{url}"
     },
     "options": {
       "summarize": true/false,
       "maxLength": 5000,
       "twitter": {
         "count": 50
       }
     }
   }
   ```
   Omit `options` if user chose defaults.
3. **Submit (foreground)**: `POST /v1/content/extract` → extract `taskId`
4. Tell the user extraction is in progress
5. **Poll (background)**: Run the following **exact** bash command with `run_in_background: true` and `timeout: 300000`. Note: status field is `.data.status` (not `processStatus`), interval is 5s, values are `processing`/`completed`/`failed`:

   ```bash
   TASK_ID="<id-from-step-3>"
   for i in $(seq 1 60); do
     RESULT=$(curl -sS "https://api.marswave.ai/openapi/v1/content/extract/$TASK_ID" \
       -H "Authorization: Bearer $LISTENHUB_API_KEY" 2>/dev/null)
     STATUS=$(echo "$RESULT" | tr -d '\000-\037\177' | jq -r '.data.status // "processing"')
     case "$STATUS" in
       completed) echo "$RESULT"; exit 0 ;;
       failed) echo "FAILED: $RESULT" >&2; exit 1 ;;
       *) sleep 5 ;;
     esac
   done
   echo "TIMEOUT" >&2; exit 2
   ```
6. When notified, **download and present result**:

   If `autoDownload` is `true`:
   - Write `{taskId}-extracted.md` to the **current directory** — full extracted content in markdown
   - Write `{taskId}-extracted.json` to the **current directory** — full raw API response data

   ```bash
   echo "$CONTENT_MD" > "${TASK_ID}-extracted.md"
   echo "$RESULT" > "${TASK_ID}-extracted.json"
   ```

   Present:
   ```
   内容提取完成!

   来源:{url}
   标题:{metadata.title}
   长度:~{character count} 字符
   消耗积分:{credits}

   已保存到当前目录:
     {taskId}-extracted.md
     {taskId}-extracted.json
   ```

7. Show a preview of the extracted content (first ~500 chars)
8. Offer to use content in another skill (e.g. `/podcast`, `/tts`)

**Estimated time**: 10-30 seconds depending on content size and platform.

## API Reference

- Content extract: `shared/api-content-extract.md`
- Supported platforms: `references/supported-platforms.md`
- Polling: `shared/common-patterns.md` § Async Polling
- Error handling: `shared/common-patterns.md` § Error Handling
- Config pattern: `shared/config-pattern.md`

## Example

**User**: "Parse this article: https://en.wikipedia.org/wiki/Topology"

**Agent workflow**:
1. URL: `https://en.wikipedia.org/wiki/Topology`
2. Options: defaults (omit options)
3. Submit extraction

```bash
curl -sS -X POST "https://api.marswave.ai/openapi/v1/content/extract" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "type": "url",
      "uri": "https://en.wikipedia.org/wiki/Topology"
    }
  }'
```

4. Poll until complete:

```bash
curl -sS "https://api.marswave.ai/openapi/v1/content/extract/69a7dac700cf95938f86d9bb" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY"
```

5. Present extracted content preview and offer next actions.

---

**User**: "Extract recent tweets from @elonmusk, get 50 tweets"

**Agent workflow**:
1. URL: `https://x.com/elonmusk`
2. Options: `{"twitter": {"count": 50}}`
3. Submit extraction

```bash
curl -sS -X POST "https://api.marswave.ai/openapi/v1/content/extract" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "type": "url",
      "uri": "https://x.com/elonmusk"
    },
    "options": {
      "twitter": {
        "count": 50
      }
    }
  }'
```

4. Poll until complete, present results.

Related Skills

tavily-search

3891
from openclaw/skills

Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.

Data & Research

baidu-search

3891
from openclaw/skills

Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.

Data & Research

notebooklm

3891
from openclaw/skills

Google NotebookLM 非官方 Python API 的 OpenClaw Skill。支持内容生成(播客、视频、幻灯片、测验、思维导图等)、文档管理和研究自动化。当用户需要使用 NotebookLM 生成音频概述、视频、学习材料或管理知识库时触发。

Data & Research

openclaw-search

3891
from openclaw/skills

Intelligent search for agents. Multi-source retrieval with confidence scoring - web, academic, and Tavily in one unified API.

Data & Research

aisa-tavily

3891
from openclaw/skills

AI-optimized web search via AIsa's Tavily API proxy. Returns concise, relevant results for AI agents through AIsa's unified API gateway.

Data & Research

Market Sizing — TAM/SAM/SOM Calculator

3891
from openclaw/skills

Build defensible market sizing for any product, pitch deck, or business case. Top-down and bottom-up methodologies combined.

Data & Research

Data Analyst — AfrexAI ⚡📊

3891
from openclaw/skills

**Transform raw data into decisions. Not just charts — answers.**

Data & Research

Competitor Monitor

3891
from openclaw/skills

Tracks and analyzes competitor moves — pricing changes, feature launches, hiring, and positioning shifts

Data & Research

afrexai-competitive-intel

3891
from openclaw/skills

Complete competitive intelligence system — market mapping, product teardowns, pricing intel, win/loss analysis, battlecards, and strategic monitoring. Goes far beyond SEO to cover the full business landscape.

Data & Research

trending-news-aggregator

3891
from openclaw/skills

智能热点新闻聚合器 - 自动抓取多平台热点新闻, AI分析趋势,支持定时推送和热度评分。 核心功能: - 每天自动聚合多平台热点(微博、知乎、百度等) - 智能分类(科技、财经、社会、国际等) - 热度评分算法 - 增量检测(标记新增热点) - AI趋势分析

Data & Research

search-cluster

3891
from openclaw/skills

Aggregated search aggregator using Google CSE, GNews RSS, Wikipedia, Reddit, and Scrapling.

Data & Research

data-analysis-partner

3891
from openclaw/skills

智能数据分析 Skill,输入 CSV/Excel 文件和分析需求,输出带交互式 ECharts 图表的 HTML 自包含分析报告

Data & Research