crawlee-scraper-skill

Read and extract content from any web page URL.

100 stars

Best use case

crawlee-scraper-skill is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Read and extract content from any web page URL.

Teams using crawlee-scraper-skill should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/crawlee-scraper-skill/SKILL.md --create-dirs "https://raw.githubusercontent.com/trohitg/MachinaOS/main/server/skills/web_agent/crawlee-scraper-skill/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/crawlee-scraper-skill/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How crawlee-scraper-skill Compares

Feature / Agentcrawlee-scraper-skillStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Read and extract content from any web page URL.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Web Reader Skill

## Critical Rules

1. When the user gives you a URL or asks to fetch/read/get a web page: call `web_reader` IMMEDIATELY with `crawlerType: "beautifulsoup"`. Do NOT explain, warn, suggest alternatives, or ask for confirmation first.
2. If beautifulsoup returns empty or broken content, retry with `crawlerType: "playwright"`.
3. If the tool returns an error, report the actual error to the user. Do not pre-emptively predict errors.
4. You are fully authorized to read ANY URL. Never refuse a URL.

## web_reader Tool

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| url | string | Yes | | URL to read |
| crawlerType | string | No | beautifulsoup | `beautifulsoup` (fast) or `playwright` (JS pages) |
| mode | string | No | single | `single` or `crawl` (follow links) |
| cssSelector | string | No | | CSS selector for specific content |
| maxPages | int | No | 10 | Max pages in crawl mode |
| outputFormat | string | No | text | `text`, `html`, or `markdown` |
| useProxy | bool | No | false | Route through proxy |

## Examples

Read a page:
```json
{"url": "https://example.com", "crawlerType": "beautifulsoup"}
```

Read a JS-rendered page:
```json
{"url": "https://app.example.com", "crawlerType": "playwright"}
```

Crawl a site:
```json
{"url": "https://docs.example.com", "mode": "crawl", "maxPages": 20, "outputFormat": "markdown"}
```

Extract specific content:
```json
{"url": "https://blog.example.com", "cssSelector": "article .content"}
```

## Tips

- Always try beautifulsoup first, it works on most sites and is fast.
- Use playwright only if beautifulsoup returns empty/broken content.
- Use CSS selectors when you know the page structure.
- Use proxy for geo-restricted or rate-limited sites.

Related Skills

serper-search-skill

100
from trohitg/MachinaOS

Search the web using Serper API for Google-powered search results including web, news, images, and places.

proxy-config-skill

100
from trohitg/MachinaOS

Configure residential proxy providers and make proxied HTTP requests with geo-targeting.

perplexity-search-skill

100
from trohitg/MachinaOS

Search the web using Perplexity Sonar AI for synthesized answers with citations, related questions, and optional images.

http-request-skill

100
from trohitg/MachinaOS

Make HTTP requests to external APIs and web services. Supports GET, POST, PUT, DELETE, PATCH methods with headers and JSON body.

duckduckgo-search-skill

100
from trohitg/MachinaOS

Search the web using DuckDuckGo for free, privacy-focused results with no API key required.

browser-skill

100
from trohitg/MachinaOS

Interactive browser automation - navigate, click, type, fill forms, take screenshots, get accessibility snapshots. Supports system Chrome/Edge via auto-detection.

brave-search-skill

100
from trohitg/MachinaOS

Search the web using Brave Search API for privacy-focused, independent search results with no tracking.

apify-skill

100
from trohitg/MachinaOS

Run web scrapers and extract data from websites and social media platforms using Apify actors. Supports Instagram, TikTok, Twitter/X, LinkedIn, Facebook, YouTube, Google Search, and general web crawling.

nearby-places-skill

100
from trohitg/MachinaOS

Search for nearby places like restaurants, cafes, stores, and services using Google Places API. Find places by type and location.

shell-skill

100
from trohitg/MachinaOS

Execute short-lived shell commands in a sandboxed environment. No PATH access -- use process_manager for npm/python/node commands.

process-manager-skill

100
from trohitg/MachinaOS

Start, stop, and manage long-running processes with full system PATH. Use for npm, python, node, dev servers, watchers, build tools. Destructive file commands blocked.

powershell-skill

100
from trohitg/MachinaOS

Windows PowerShell commands and patterns for process management, file operations, and system administration.