web-scraping

Web scraping tools for fetching and extracting data from web pages

170 stars

Best use case

web-scraping is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Web scraping tools for fetching and extracting data from web pages

Teams using web-scraping should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/web-scraping/SKILL.md --create-dirs "https://raw.githubusercontent.com/adoresever/AGI_Ananas/main/26.2.21openclaw-viking/.agents/skills/web-scraping/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/web-scraping/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How web-scraping Compares

Feature / Agentweb-scrapingStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Web scraping tools for fetching and extracting data from web pages

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

## Web Scraping

You have web scraping tools for fetching and extracting data from web pages:

**Single page:**
- `scrape_url` — fetch a URL and get cleaned text content + metadata (title, description, link count)
  - Use format="text" (default) for most tasks — strips all HTML
  - Use format="markdown" to preserve headings, links, lists, bold/italic
  - Use format="html" only when you need raw HTML

**Link discovery:**
- `extract_links` — fetch a page and extract all links with text and type (internal/external)
  - Use the `pattern` parameter to filter by regex (e.g. `"\\.pdf$"` for PDF links)
  - Links are deduplicated and resolved to absolute URLs

**Multi-page research:**
- `scrape_multiple` — fetch up to 10 URLs in parallel for comparison/research
  - One failure doesn't block others (uses Promise.allSettled)

**Best practices:**
- Prefer "text" format for content extraction, "markdown" for preserving structure
- Don't scrape the same domain more than 5 times per minute
- Combine with `store_deliverable` to save scraped content as job evidence
- For very large pages, the content is limited to 5MB

Related Skills

opentwitter

170
from adoresever/AGI_Ananas

Twitter/X data via the 6551 API. Supports user profiles, tweet search, user tweets, follower events, deleted tweets, and KOL followers.

opennews

170
from adoresever/AGI_Ananas

Crypto news search, AI ratings, trading signals, and real-time updates via the OpenNews 6551 API. Supports keyword search, coin filtering, source filtering, AI score ranking, and WebSocket live feeds.

agent-reach

170
from adoresever/AGI_Ananas

Give your AI agent eyes to see the entire internet. Read and search across Twitter/X, Reddit, YouTube, GitHub, Bilibili, XiaoHongShu, Instagram, LinkedIn, Boss直聘, RSS, and any web page — all from a single CLI. Use when: (1) reading content from URLs (tweets, Reddit posts, articles, videos), (2) searching across platforms (web, Twitter, Reddit, GitHub, YouTube, Bilibili, XiaoHongShu, Instagram, LinkedIn, Boss直聘), (3) user asks to configure/enable a platform channel, (4) checking channel health or updating Agent Reach. Triggers: "search Twitter/Reddit/YouTube", "read this URL", "find posts about", "搜索", "读取", "查一下", "看看这个链接", "帮我配", "帮我添加", "帮我安装".

searxng-search

170
from adoresever/AGI_Ananas

使用自建SearXNG搜索引擎搜索互联网内容。触发词:搜索、查一下、帮我查、查找、搜一下、帮我搜索。

multi-search-engine

170
from adoresever/AGI_Ananas

Multi search engine integration with 17 engines (8 CN + 9 Global). Supports advanced search operators, time filters, site search, privacy engines, and WolframAlpha knowledge queries. No API keys required.

weather

170
from adoresever/AGI_Ananas

Get current weather and forecasts via wttr.in or Open-Meteo. Use when: user asks about weather, temperature, or forecasts for any location. NOT for: historical weather data, severe weather alerts, or detailed meteorological analysis. No API key needed.

wacli

170
from adoresever/AGI_Ananas

Send WhatsApp messages to other people or search/sync WhatsApp history via the wacli CLI (not for normal user chats).

voice-call

170
from adoresever/AGI_Ananas

Start voice calls via the OpenClaw voice-call plugin.

video-frames

170
from adoresever/AGI_Ananas

Extract frames or short clips from videos using ffmpeg.

trello

170
from adoresever/AGI_Ananas

Manage Trello boards, lists, and cards via the Trello REST API.

tmux

170
from adoresever/AGI_Ananas

Remote-control tmux sessions for interactive CLIs by sending keystrokes and scraping pane output.

things-mac

170
from adoresever/AGI_Ananas

Manage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database). Use when a user asks OpenClaw to add a task to Things, list inbox/today/upcoming, search tasks, or inspect projects/areas/tags.