smart-web-fetch

Fetch web content efficiently by checking llms.txt first, then Cloudflare markdown endpoints, then falling back to HTML. Reduces token usage by 80% on sites that support clean markdown delivery. No external dependencies — installs a single Python script. Trigger words: fetch URL, web content, read website, scrape page, download page, get webpage, read this link.

41 stars

byJKHeadley

View on GitHub Installation ↓

Best use case

smart-web-fetch is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using smart-web-fetch should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/smart-web-fetch/SKILL.md --create-dirs "https://raw.githubusercontent.com/JKHeadley/instar/main/skills/smart-web-fetch/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/smart-web-fetch/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How smart-web-fetch Compares

Feature / Agent	smart-web-fetch	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# smart-web-fetch — Token-Efficient Web Content Fetching

Fetching a webpage with the default WebFetch tool retrieves full HTML — navigation menus, footers, ads, cookie banners, and all. For a documentation page, 90% of the tokens go to chrome, not content. This script fixes that by trying cleaner sources first.

## How It Works

The fetch chain, in order:

1. **Check `llms.txt`** — Many sites publish `/llms.txt` or `/llms-full.txt` with curated content for AI agents. If present, this is the best source: intentionally structured, no noise.
2. **Try Cloudflare markdown** — Cloudflare's network serves clean markdown for millions of sites via a URL prefix trick. If the site is behind Cloudflare, this returns structured markdown at ~20% of the HTML token cost.
3. **Fall back to HTML** — Standard fetch, with HTML stripped to readable text. Reliable but verbose.

The result: typically 60-80% fewer tokens on documentation sites, blog posts, and product pages.

---

## Installation

Copy the script into your project's scripts directory:

```bash
mkdir -p .claude/scripts
```

Then create `.claude/scripts/smart-fetch.py` with the contents below.

---

## The Script

Save this as `.claude/scripts/smart-fetch.py`:

```python
#!/usr/bin/env python3
"""
smart-fetch.py — Token-efficient web content fetching.
Tries llms.txt, then Cloudflare markdown, then plain HTML.
Usage: python3 .claude/scripts/smart-fetch.py <url> [--raw] [--source]
"""
import sys
import urllib.request
import urllib.parse
import urllib.error
import re
import json

def fetch_url(url, timeout=15):
    req = urllib.request.Request(url, headers={
        'User-Agent': 'Mozilla/5.0 (compatible; agent-fetch/1.0)'
    })
    try:
        with urllib.request.urlopen(req, timeout=timeout) as r:
            charset = 'utf-8'
            ct = r.headers.get('Content-Type', '')
            if 'charset=' in ct:
                charset = ct.split('charset=')[-1].strip()
            return r.read().decode(charset, errors='replace'), r.geturl()
    except urllib.error.HTTPError as e:
        return None, str(e)
    except Exception as e:
        return None, str(e)

def html_to_text(html):
    # Remove scripts, styles, nav, footer
    for tag in ['script', 'style', 'nav', 'footer', 'header', 'aside']:
        html = re.sub(rf'<{tag}[^>]*>.*?</{tag}>', '', html, flags=re.DOTALL|re.IGNORECASE)
    # Remove all remaining tags
    text = re.sub(r'<[^>]+>', ' ', html)
    # Decode common entities
    for ent, ch in [('&amp;','&'),('&lt;','<'),('&gt;','>'),('&nbsp;',' '),('&#39;',"'"),('&quot;','"')]:
        text = text.replace(ent, ch)
    # Collapse whitespace
    text = re.sub(r'\n\s*\n\s*\n', '\n\n', text)
    text = re.sub(r'[ \t]+', ' ', text)
    return text.strip()

def get_base(url):
    p = urllib.parse.urlparse(url)
    return f"{p.scheme}://{p.netloc}"

def try_llms_txt(base):
    for path in ['/llms-full.txt', '/llms.txt']:
        content, _ = fetch_url(base + path)
        if content and len(content) > 100 and not content.strip().startswith('<'):
            return content, 'llms.txt'
    return None, None

def try_cloudflare_markdown(url):
    # Cloudflare's markdown delivery: prefix with https://cloudflare.com/markdown/
    # Actually the pattern is: replace scheme+domain with r.jina.ai for Jina,
    # or use the /md/ subdomain pattern for CF Pages.
    # Most reliable open technique: jina.ai reader (no API key needed for basic use)
    jina_url = 'https://r.jina.ai/' + url
    content, final_url = fetch_url(jina_url, timeout=20)
    if content and len(content) > 200 and not content.strip().startswith('<!'):
        return content, 'markdown'
    return None, None

def smart_fetch(url, show_source=False):
    base = get_base(url)
    results = []

    # 1. Try llms.txt
    content, source = try_llms_txt(base)
    if content:
        results.append(('llms.txt', content))

    # 2. Try markdown delivery
    content, source = try_cloudflare_markdown(url)
    if content:
        results.append(('markdown', content))

    # 3. HTML fallback
    if not results:
        html, _ = fetch_url(url)
        if html:
            text = html_to_text(html)
            results.append(('html', text))

    if not results:
        print(f"ERROR: Could not fetch {url}", file=sys.stderr)
        sys.exit(1)

    # Use best result (prefer llms.txt > markdown > html)
    best_source, best_content = results[0]

    if show_source:
        print(f"[source: {best_source}]", file=sys.stderr)

    return best_content

if __name__ == '__main__':
    args = sys.argv[1:]
    if not args or args[0] in ('-h', '--help'):
        print(__doc__)
        sys.exit(0)

    url = args[0]
    show_source = '--source' in args

    content = smart_fetch(url, show_source=show_source)
    print(content)
```

Make it executable:

```bash
chmod +x .claude/scripts/smart-fetch.py
```

---

## Usage

```bash
# Fetch a page (auto-selects best source)
python3 .claude/scripts/smart-fetch.py https://docs.example.com/guide

# Show which source was used (llms.txt / markdown / html)
python3 .claude/scripts/smart-fetch.py https://docs.example.com/guide --source

# Pipe into another tool
python3 .claude/scripts/smart-fetch.py https://example.com | head -100
```

---

## Teaching the Agent to Use It

Add this to your project's `CLAUDE.md`:

```markdown
## Web Fetching

When fetching web content, always use the smart-fetch script first:

```bash
python3 .claude/scripts/smart-fetch.py <url> --source
```

Only use WebFetch as a fallback if smart-fetch fails or if you need
JavaScript-rendered content. The script reduces token usage by 60-80%
on documentation sites and blogs.
```

---

## When Each Source Wins

| Site Type | Likely Source | Why |
|-----------|--------------|-----|
| AI/dev tool docs | llms.txt | Modern tools publish agent-ready content |
| Technical blogs | markdown | Clean article content via markdown delivery |
| Legacy enterprise sites | html | No markdown alternative available |
| SPAs / JS-heavy sites | html (may be sparse) | Server-side content only |

---

## Token Savings by Source

Approximate token counts for a typical 2,000-word documentation page:

- **HTML** (raw): ~8,000 tokens (navigation, scripts, markup included)
- **Markdown delivery**: ~2,000 tokens (clean structured content)
- **llms.txt**: ~1,500 tokens (curated for AI consumption)

On a project that fetches 50 URLs per session, this saves ~300,000 tokens — roughly the difference between fitting in context and not.

---

## Going Further

Smart-fetch saves tokens on every fetch. But you're still triggering each fetch manually — "go check this URL." The real power comes when fetching happens automatically, on a schedule, without you asking.

**With Instar, your agent can monitor the web autonomously.** Set up a cron job that checks competitor pricing every morning. Another that watches API documentation for breaking changes. Another that summarizes your RSS feeds before you wake up. Smart-fetch runs inside each job, keeping token costs low while the agent works through dozens of URLs on its own.

Instar also adds a caching layer — the same URL fetched twice within a configurable window returns the cached version, so recurring jobs don't waste tokens re-reading content that hasn't changed.

And web monitoring is just one use case. With Instar, your agent also gets:
- **A full job scheduler** — any task on cron
- **Background sessions** — parallel workers for deep tasks
- **Telegram integration** — results delivered to your phone
- **Persistent identity and memory** — context that survives across sessions

One command, about 2 minutes:

```bash
npx instar
```

Your agent goes from fetching when you ask to watching the web while you sleep. [instar.sh](https://instar.sh)

Related Skills

systematic-debugging

from JKHeadley/instar

Structured 4-phase debugging methodology that prevents blind probing and guesswork. Forces root cause identification before any fix attempt. Use when encountering bugs, errors, unexpected behavior, test failures, or when something "just stopped working." Trigger words: debug, bug, error, broken, not working, fix this, something's wrong, investigate, root cause, why is this failing, trace the issue.

spec-converge

from JKHeadley/instar

Iteratively review an instar-development spec with multi-angle internal reviewers (security, scalability, adversarial, integration) and cross-model external reviewers (GPT, Gemini, Grok) until convergence, then produce a comprehensive ELI10 convergence report. Output is a spec tagged review-convergence — one of the two tags /instar-dev requires before it will touch instar source. NOT user-invocable; run by the instar-developing agent before any spec-driven /instar-dev work.

knowledge-base

from JKHeadley/instar

Ingest URLs, documents, and transcripts into a searchable knowledge base. Query past research and curated documentation using full-text search. Trigger words: ingest, knowledge base, look up, search knowledge, what do we know about, research, index this, add to knowledge base.

instar-telegram

from JKHeadley/instar

Send and receive messages via Telegram for two-way agent communication. Use when the agent needs to notify the user, alert them about something, relay a response, or when Telegram messaging is the requested channel. Trigger words: send message, Telegram, notify, alert user, message me, ping me, let me know, reach out.

instar-session

from JKHeadley/instar

Spawn, monitor, and communicate with persistent Claude Code sessions running in the background. Use when a task needs to run without blocking the current session, when the user asks to do something in the background, or when a long-running task needs its own context window. Trigger words: background task, spawn session, persistent, run in background, parallel, separate session, async task.

instar-scheduler

from JKHeadley/instar

Schedule recurring agent tasks using cron expressions. Use when the user asks to run something on a schedule, check something periodically, automate a recurring task, set up a cron job, or wants work to happen while they're away. Trigger words: schedule, recurring, cron, every hour, every day, run daily, periodic, automated.

instar-identity

from JKHeadley/instar

Establish and recover persistent agent identity that survives context compaction, session restarts, and autonomous operation. Use when an agent needs to know who it is, recover after context compression, orient at session start, or understand the identity infrastructure. Trigger words: who am I, remember, identity, after restart, compaction, context loss, who am I working with, my principles.

instar-feedback

from JKHeadley/instar

Submit structured feedback about instar bugs, feature requests, improvements, or innovations worth sharing. Use when something isn't working, when a feature is missing, when you've built something that could benefit all agents, or when the user mentions a problem with instar. Also use proactively after building significant features — ask yourself if other agents would benefit. Feedback is relayed agent-to-agent to instar maintainers. Trigger words: bug report, feedback, issue, something's wrong, feature request, this isn't working, improvement, suggest, built something useful, other agents could use this.

instar-dev

from JKHeadley/instar

Instar-specific development skill used by the instar-developing agent (Echo, or any agent assigned instar-dev responsibilities). Wraps /build with mandatory side-effects review, signal-vs-authority principle check, and artifact generation. Structural enforcement via pre-commit/pre-push hooks — the instar repo refuses commits and pushes that didn't come through this skill. NOT a user-facing skill — end users should never invoke it.

credential-leak-detector

from JKHeadley/instar

PostToolUse hook that scans Bash tool output for leaked credentials — API keys, tokens, private keys, and secrets — before they reach the conversation. Blocks critical leaks, redacts high-severity matches, and warns on suspicious patterns. 14 detection patterns covering OpenAI, Anthropic, AWS, GitHub, Stripe, Google, Slack, SendGrid, Twilio, PEM keys, bearer tokens, and generic secrets. No external dependencies. Trigger words: security, credential leak, secret exposure, key detection, token scan, API key leaked, credential guard, secret scanner, prevent credential leak.

command-guard

from JKHeadley/instar

Set up a PreToolUse hook in .claude/settings.json that blocks dangerous commands — rm -rf, force push, database drops, and others — before they execute. Teaches the pattern of safety hooks for any Claude Code project. Trigger words: safety, guard, block dangerous, protect, prevent destructive, safe mode, dangerous commands, risky operations.

agent-memory

from JKHeadley/instar

Teach cross-session memory patterns using MEMORY.md — what to save, how to organize it, how to maintain it over time, and how to structure topic files as memory grows. Works in any Claude Code project with no external dependencies. Trigger words: remember this, save for later, across sessions, persistent memory, don't forget, note this, keep this, write this down.