web-reader-pro

Advanced web content extraction skill for OpenClaw using multi-tier fallback strategy (Jina → Scrapling → WebFetch) with intelligent routing, caching, quality scoring, and domain learning. Use when: reading article content, extracting web page text, scraping dynamic JS-heavy pages, or fetching WeChat official account articles.

3,891 stars
Complexity: medium

About this skill

Web Reader Pro is an OpenClaw skill designed to extract content from web pages with high reliability and efficiency. It employs a sophisticated three-tier fallback mechanism, starting with Jina Reader for speed, then moving to Scrapling with Playwright for dynamic, JavaScript-heavy sites, and finally a basic WebFetch for simpler pages. This ensures comprehensive coverage across various web page types. Beyond its tiered approach, the skill incorporates intelligent features such as Jina API quota monitoring to prevent overages, a smart cache layer for performance, and an extraction quality scoring system to guarantee useful output. It also learns optimal extraction strategies per domain, adapting over time, and includes retry mechanisms with exponential backoff for resilience against transient network issues or rate limits. This skill is invaluable for AI agents needing to reliably consume and process information from the web, especially from complex or frequently changing sources. It automates the intricacies of web scraping, allowing the agent to focus on higher-level tasks without worrying about the underlying content retrieval challenges.

Best use case

The primary use case is enabling AI agents to reliably extract clean, structured content from diverse web pages, including articles, blogs, and dynamic sites, for analysis, summarization, or integration into other tasks. Developers and users building AI applications that require robust web data input will benefit most from its advanced features and resilience.

Advanced web content extraction skill for OpenClaw using multi-tier fallback strategy (Jina → Scrapling → WebFetch) with intelligent routing, caching, quality scoring, and domain learning. Use when: reading article content, extracting web page text, scraping dynamic JS-heavy pages, or fetching WeChat official account articles.

A clean, extracted text content (e.g., article body, main page text) from the specified URL, reliably obtained even from complex web pages.

Practical example

Example input

Read the main article content from the following URL: `https://www.example.com/a-long-article-about-ai-research`

Example output

The field of Artificial Intelligence has seen tremendous growth in recent years, with advancements across various domains such as natural language processing, computer vision, and machine learning. Large language models like GPT-4 and Claude 3 are revolutionizing how we interact with technology, offering capabilities that were once confined to science fiction...

When to use this skill

  • Reading article content from various websites.
  • Extracting clean text from general web pages.
  • Scraping content from dynamic, JavaScript-heavy sites.
  • Fetching articles from specific platforms like WeChat Official Accounts.

When not to use this skill

  • When reading local files or accessing internal, non-web resources.
  • When interacting with highly structured APIs that provide JSON/XML directly.
  • For bulk, large-scale data harvesting that requires dedicated, high-throughput scraping infrastructure beyond single-page extraction.
  • When only needing to retrieve basic page metadata (e.g., URL, title) without full content.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/web-reader-pro/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/0xcjl/web-reader-pro/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/web-reader-pro/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How web-reader-pro Compares

Feature / Agentweb-reader-proStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexitymediumN/A

Frequently Asked Questions

What does this skill do?

Advanced web content extraction skill for OpenClaw using multi-tier fallback strategy (Jina → Scrapling → WebFetch) with intelligent routing, caching, quality scoring, and domain learning. Use when: reading article content, extracting web page text, scraping dynamic JS-heavy pages, or fetching WeChat official account articles.

How difficult is it to install?

The installation complexity is rated as medium. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Web Reader Pro - OpenClaw Skill

## Overview

Web Reader Pro is an advanced web content extraction skill for OpenClaw that uses a multi-tier fallback strategy with intelligent routing, caching, and quality assessment.

## Features

### 1. Three-Tier Fallback Strategy
- **Tier 1: Jina Reader API** - Fast, reliable, best for most websites
- **Tier 2: Scrapling + Playwright** - Dynamic content rendering for JS-heavy sites
- **Tier 3: WebFetch Fallback** - Basic extraction for simple pages

### 2. Jina Quota Monitoring
- Tracks API call count with persistent counter
- Warning alerts when approaching quota limits
- Automatic fallback to lower-tier methods when quota exhausted

### 3. Smart Cache Layer
- Short-term caching (configurable TTL, default 1 hour)
- Cache key based on URL hash
- Reduces redundant API calls

### 4. Extraction Quality Scoring
- Scores based on: word count, title detection, content density
- Minimum quality threshold (default: 200 words + valid title)
- Auto-escalation to next tier if quality below threshold

### 5. Domain-Level Routing Learning
- Learns optimal extraction tier per domain
- Persists learned routes in local JSON database
- Adapts based on historical success rates

### 6. Retry with Exponential Backoff
- Configurable max retries per tier (default: 3)
- Exponential backoff: 1s, 2s, 4s, 8s...
- Respects rate limits and transient failures

## Installation

```bash
# Install dependencies
pip install -r requirements.txt

# Install Scrapling (requires Node.js)
./scripts/install_scrapling.sh

# Or install Scrapling manually
npm install -g @scrapinghub/scrapling
```

## Usage

### Basic Usage

```python
from scripts.web_reader_pro import WebReaderPro

reader = WebReaderPro()
result = reader.fetch("https://example.com")
print(result['title'])
print(result['content'])
```

### Advanced Configuration

```python
reader = WebReaderPro(
    jina_api_key="your-jina-key",      # Optional: set via env JINA_API_KEY
    cache_ttl=3600,                      # Cache TTL in seconds (default: 3600)
    quality_threshold=200,               # Min word count for quality (default: 200)
    max_retries=3,                       # Max retries per tier (default: 3)
    enable_learning=True,                # Enable domain learning (default: True)
    scrapling_path="/usr/local/bin/scrapling"  # Path to scrapling binary
)
```

## Result Format

```python
{
    "title": "Page Title",
    "content": "Extracted content in markdown...",
    "url": "https://example.com",
    "tier_used": "jina|scrapling|webfetch",
    "quality_score": 85,
    "cached": False,
    "domain_learned_tier": "jina",
    "extracted_at": "2024-01-01T00:00:00Z"
}
```

## Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `JINA_API_KEY` | Jina Reader API key | Required for Tier 1 |
| `WEB_READER_CACHE_DIR` | Cache directory path | `~/.openclaw/cache/web-reader-pro/` |
| `WEB_READER_LEARNING_DB` | Learning database path | `~/.openclaw/data/web-reader-pro/routes.json` |
| `WEB_READER_JINA_QUOTA` | Jina quota limit | `100000` |

## API Reference

### WebReaderPro.fetch(url, force_refresh=False)

Fetch and extract content from a URL.

**Parameters:**
- `url` (str): Target URL
- `force_refresh` (bool): Bypass cache if True

**Returns:** Dict with title, content, metadata

### WebReaderPro.fetch_with_tier(url, preferred_tier)

Fetch using a specific tier (bypassing automatic selection).

**Parameters:**
- `url` (str): Target URL
- `preferred_tier` (str): "jina", "scrapling", or "webfetch"

### WebReaderPro.get_jina_status()

Get current Jina API quota usage.

**Returns:** Dict with count, limit, percentage, warnings

### WebReaderPro.clear_cache(url=None)

Clear cache for specific URL or all URLs.

**Parameters:**
- `url` (str, optional): Specific URL to clear, or None for all

### WebReaderPro.get_domain_routes()

Get learned domain-to-tier mappings.

**Returns:** Dict of domain -> preferred tier

## Tier Comparison

| Tier | Speed | JS Rendering | Best For | Cost |
|------|-------|--------------|----------|------|
| Jina | Fast | No | Static pages, articles | API calls |
| Scrapling | Medium | Yes | SPAs, dynamic content | CPU |
| WebFetch | Fastest | No | Simple pages, fallbacks | Free |

## License

MIT

Related Skills

tavily-search

3891
from openclaw/skills

Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.

Data & Research

baidu-search

3891
from openclaw/skills

Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.

Data & Research

notebooklm

3891
from openclaw/skills

Google NotebookLM 非官方 Python API 的 OpenClaw Skill。支持内容生成(播客、视频、幻灯片、测验、思维导图等)、文档管理和研究自动化。当用户需要使用 NotebookLM 生成音频概述、视频、学习材料或管理知识库时触发。

Data & Research

openclaw-search

3891
from openclaw/skills

Intelligent search for agents. Multi-source retrieval with confidence scoring - web, academic, and Tavily in one unified API.

Data & Research

aisa-tavily

3891
from openclaw/skills

AI-optimized web search via AIsa's Tavily API proxy. Returns concise, relevant results for AI agents through AIsa's unified API gateway.

Data & Research

Market Sizing — TAM/SAM/SOM Calculator

3891
from openclaw/skills

Build defensible market sizing for any product, pitch deck, or business case. Top-down and bottom-up methodologies combined.

Data & Research

Data Analyst — AfrexAI ⚡📊

3891
from openclaw/skills

**Transform raw data into decisions. Not just charts — answers.**

Data & Research

Competitor Monitor

3891
from openclaw/skills

Tracks and analyzes competitor moves — pricing changes, feature launches, hiring, and positioning shifts

Data & Research

afrexai-competitive-intel

3891
from openclaw/skills

Complete competitive intelligence system — market mapping, product teardowns, pricing intel, win/loss analysis, battlecards, and strategic monitoring. Goes far beyond SEO to cover the full business landscape.

Data & Research

trending-news-aggregator

3891
from openclaw/skills

智能热点新闻聚合器 - 自动抓取多平台热点新闻, AI分析趋势,支持定时推送和热度评分。 核心功能: - 每天自动聚合多平台热点(微博、知乎、百度等) - 智能分类(科技、财经、社会、国际等) - 热度评分算法 - 增量检测(标记新增热点) - AI趋势分析

Data & Research

search-cluster

3891
from openclaw/skills

Aggregated search aggregator using Google CSE, GNews RSS, Wikipedia, Reddit, and Scrapling.

Data & Research

data-analysis-partner

3891
from openclaw/skills

智能数据分析 Skill,输入 CSV/Excel 文件和分析需求,输出带交互式 ECharts 图表的 HTML 自包含分析报告

Data & Research