web-archive-scraper

Search the Wayback Machine for archived versions of websites. Extract cached pages, customer lists, testimonials, and partner directories from sites that have changed or gone offline. Uses the free CDX API — no API key needed.

380 stars

Best use case

web-archive-scraper is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Search the Wayback Machine for archived versions of websites. Extract cached pages, customer lists, testimonials, and partner directories from sites that have changed or gone offline. Uses the free CDX API — no API key needed.

Teams using web-archive-scraper should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/web-archive-scraper/SKILL.md --create-dirs "https://raw.githubusercontent.com/gooseworks-ai/goose-skills/main/skills/capabilities/web-archive-scraper/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/web-archive-scraper/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How web-archive-scraper Compares

Feature / Agentweb-archive-scraperStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Search the Wayback Machine for archived versions of websites. Extract cached pages, customer lists, testimonials, and partner directories from sites that have changed or gone offline. Uses the free CDX API — no API key needed.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Web Archive Scraper

Search the Wayback Machine (Internet Archive) for archived snapshots of websites. Fetch cached page content to find customer lists, testimonials, partner directories, and other information from sites that have changed or shut down.

## Quick Start

Only dependency is `requests`. No API key needed.

```bash
# Find all snapshots of a URL
python3 skills/web-archive-scraper/scripts/search_archive.py \
  --url "https://botkeeper.com/customers"

# Search with date range
python3 skills/web-archive-scraper/scripts/search_archive.py \
  --url "https://botkeeper.com" --from 2025-01-01 --to 2026-02-01

# Search all pages under a domain (prefix match)
python3 skills/web-archive-scraper/scripts/search_archive.py \
  --url "https://botkeeper.com" --match prefix --limit 50

# Fetch the actual archived page content
python3 skills/web-archive-scraper/scripts/search_archive.py \
  --url "https://botkeeper.com/customers" --fetch

# Output formats
python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output json
python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output csv
python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output summary
```

## How It Works

1. **CDX API search** — Queries `web.archive.org/cdx/search/cdx` for snapshots matching the URL
2. **Filtering** — Filters by date range, HTTP status code, and MIME type
3. **Dedup** — Collapses to one snapshot per day by default to avoid redundant results
4. **Content fetch** — Optionally fetches the raw archived HTML (using `id_` modifier to skip Wayback toolbar)
5. **Text extraction** — Strips HTML tags for readable text output when fetching content

## CLI Reference

| Flag | Default | Description |
|------|---------|-------------|
| `--url` | *required* | Target URL to search in the archive |
| `--match` | exact | Match type: `exact`, `prefix`, `host`, `domain` |
| `--from` | none | Start date (YYYY-MM-DD) |
| `--to` | none | End date (YYYY-MM-DD) |
| `--limit` | 25 | Max number of snapshots to return |
| `--fetch` | false | Fetch and display the content of the most recent snapshot |
| `--fetch-all` | false | Fetch content of ALL matched snapshots (use with small --limit) |
| `--status` | 200 | HTTP status filter (set to "any" to include all) |
| `--output` | json | Output format: `json`, `csv`, `summary` |
| `--collapse` | day | Dedup level: `none`, `day`, `month`, `year` |

## Output Schema

```json
{
  "url": "https://botkeeper.com/customers",
  "timestamp": "20250915143022",
  "datetime": "2025-09-15T14:30:22",
  "status_code": "200",
  "mime_type": "text/html",
  "archive_url": "https://web.archive.org/web/20250915143022/https://botkeeper.com/customers",
  "raw_url": "https://web.archive.org/web/20250915143022id_/https://botkeeper.com/customers",
  "content": "..."
}
```

The `content` field is only populated when `--fetch` or `--fetch-all` is used.

## Cost

Free. The Wayback Machine CDX API requires no authentication or API key. Rate limit is ~15 requests/minute.

## Common Use Cases

- **Find customer lists from shut-down companies** (e.g., botkeeper.com)
- **Recover testimonials/case studies** before a site redesign
- **Track how a competitor's messaging changed over time**
- **Find partner directories** that have been removed

Related Skills

twitter-scraper

381
from gooseworks-ai/goose-skills

Search and scrape Twitter/X posts using Apify. Use when you need to find tweets, track brand mentions, monitor competitors on Twitter, or analyze Twitter discussions. Uses Twitter native search syntax (since:/until:) for reliable date filtering.

review-scraper

381
from gooseworks-ai/goose-skills

Scrape product reviews from G2, Capterra, and Trustpilot using Apify. Single script with platform dispatch. Use when you need to monitor competitor reviews, track product sentiment, or gather customer feedback from review sites.

reddit-scraper

381
from gooseworks-ai/goose-skills

Scrape and search Reddit posts using Apify. Use when you need to find Reddit discussions, track competitor mentions, monitor product feedback, discover pain points, or analyze subreddit content. Supports keyword filtering, time-based searches, and subreddit-specific queries.

meta-ad-scraper

381
from gooseworks-ai/goose-skills

Scrape competitor ads from Meta's Ad Library (Facebook, Instagram, Messenger, Threads, WhatsApp). Search by company name, Facebook Page URL, or keyword. Returns ad creatives, spend estimates, reach, impressions, and campaign details. Use for competitive ad research, messaging analysis, and creative inspiration.

blog-scraper

381
from gooseworks-ai/goose-skills

Scrape blog posts via RSS feeds (free, no API key) with Apify fallback for JS-heavy sites. Use when you need to monitor competitor blogs, track industry content, or aggregate blog posts by keyword.

review-site-scraper

380
from gooseworks-ai/goose-skills

Scrape product reviews from G2, Capterra, and Trustpilot using Apify. Single script with platform dispatch. Use when you need to monitor competitor reviews, track product sentiment, or gather customer feedback from review sites.

product-hunt-scraper

380
from gooseworks-ai/goose-skills

Scrape Product Hunt trending products using Apify. Use when you need to discover new product launches, track competitors on Product Hunt, or monitor the startup ecosystem for relevant launches.

orthogonal-linkedin-scraper

380
from gooseworks-ai/goose-skills

Get LinkedIn profiles, company pages, and posts

linkedin-profile-post-scraper

380
from gooseworks-ai/goose-skills

Scrape recent posts from LinkedIn profiles using Apify. Use when you need to monitor what specific people are posting on LinkedIn, track founder/exec activity, or gather LinkedIn content for competitive intelligence.

linkedin-job-scraper

380
from gooseworks-ai/goose-skills

Scrapes LinkedIn job postings using the JobSpy library (python-jobspy). Use this skill whenever the user wants to find jobs on LinkedIn, search for open roles, pull job listings, build a job pipeline, source job targets for GTM research, or monitor hiring signals. Even if the user just says "find me some jobs" or "what roles is [company] hiring for", use this skill. It runs a local Python script that outputs a CSV of job postings with title, company, location, salary, job type, description, and direct URLs.

job-scraper

380
from gooseworks-ai/goose-skills

Search for job postings across LinkedIn and Indeed. Use when users want to find open roles, monitor hiring signals, identify companies hiring for specific positions, or research competitor hiring activity. Returns job title, company, location, salary, description, seniority level, and direct apply URLs. No login or cookies required.

hacker-news-scraper

380
from gooseworks-ai/goose-skills

Search Hacker News stories and comments using the free Algolia API. No Apify token needed. Use when you need to find HN discussions, track mentions, discover Show HN launches, or monitor tech community sentiment.