site-content-catalog

Crawl a website's sitemap and blog index to build a complete content inventory. Lists every page with URL, title, publish date, content type, and topic cluster. Groups content by category and topic. Optionally deep-reads top N pages for quality analysis and funnel stage tagging. Use before SEO audits, content gap analysis, or brand voice extraction.

381 stars

Best use case

site-content-catalog is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Crawl a website's sitemap and blog index to build a complete content inventory. Lists every page with URL, title, publish date, content type, and topic cluster. Groups content by category and topic. Optionally deep-reads top N pages for quality analysis and funnel stage tagging. Use before SEO audits, content gap analysis, or brand voice extraction.

Teams using site-content-catalog should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/site-content-catalog/SKILL.md --create-dirs "https://raw.githubusercontent.com/gooseworks-ai/goose-skills/main/skills/capabilities/site-content-catalog/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/site-content-catalog/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How site-content-catalog Compares

Feature / Agentsite-content-catalogStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Crawl a website's sitemap and blog index to build a complete content inventory. Lists every page with URL, title, publish date, content type, and topic cluster. Groups content by category and topic. Optionally deep-reads top N pages for quality analysis and funnel stage tagging. Use before SEO audits, content gap analysis, or brand voice extraction.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Site Content Catalog

Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.

## Quick Start

```bash
# Basic content inventory
python3 skills/site-content-catalog/scripts/catalog_content.py \
  --domain "example.com"

# With deep analysis of top 20 pages
python3 skills/site-content-catalog/scripts/catalog_content.py \
  --domain "example.com" --deep-analyze 20

# Output to specific file
python3 skills/site-content-catalog/scripts/catalog_content.py \
  --domain "example.com" --output clients/acme/research/content-inventory.json
```

## Inputs

| Parameter | Required | Default | Description |
|-----------|----------|---------|-------------|
| domain | Yes | — | Domain to catalog (e.g., "example.com") |
| deep-analyze | No | 0 | Number of top pages to deep-read for content analysis |
| output | No | stdout | Path to save JSON output |
| include-non-blog | No | true | Also catalog landing pages, docs, etc. (not just blog) |

## Cost

- **Sitemap/RSS crawling:** Free (direct HTTP requests)
- **Apify sitemap extractor (fallback):** ~$0.50 per site
- **Deep analysis:** Free (WebFetch on individual pages)

## Process

### Phase 1: Discover All Pages

The script attempts multiple methods to find all pages on a site, in order:

#### A) Sitemap.xml
1. Fetch `https://[domain]/sitemap.xml`
2. If it's a sitemap index, recursively fetch all child sitemaps
3. Common alternate locations: `/sitemap_index.xml`, `/sitemap-index.xml`, `/wp-sitemap.xml`
4. Check `robots.txt` for `Sitemap:` directives

#### B) RSS/Atom Feeds
1. Check `/feed`, `/rss`, `/atom.xml`, `/blog/feed`, etc.
2. Extract posts with titles, dates, and URLs
3. RSS typically only surfaces recent content (last 10-50 posts)

#### C) Blog Index Crawl
1. Fetch `/blog`, `/resources`, `/insights`, `/news`, `/articles`
2. Extract links from the page
3. Follow pagination if present (`/blog/page/2`, `?page=2`, etc.)

#### D) Site: Search (fallback)
1. WebSearch: `site:[domain]` to estimate total indexed pages
2. WebSearch: `site:[domain]/blog` to find blog content
3. WebSearch: `site:[domain] intitle:` to discover page title patterns

#### E) Apify Sitemap Extractor (fallback for JS-heavy sites)
- Actor: `onescales/sitemap-url-extractor`
- Use when sitemap.xml is missing and the site is JS-rendered

### Phase 2: Classify Each Page

For each discovered URL, classify by:

#### Content Type
Classify based on URL patterns and page titles:

| Type | URL Patterns | Examples |
|------|-------------|----------|
| `blog-post` | `/blog/`, `/posts/`, `/articles/` | How-to guides, opinion pieces |
| `case-study` | `/case-study/`, `/customers/`, `/success-stories/` | Customer stories |
| `comparison` | `/vs/`, `/compare/`, `/alternative/` | X vs Y pages |
| `landing-page` | `/solutions/`, `/use-cases/`, `/for-/` | Product marketing pages |
| `docs` | `/docs/`, `/help/`, `/documentation/`, `/api/` | Technical documentation |
| `changelog` | `/changelog/`, `/releases/`, `/whats-new/` | Product updates |
| `pricing` | `/pricing/` | Pricing page |
| `about` | `/about/`, `/team/`, `/careers/` | Company pages |
| `legal` | `/privacy/`, `/terms/`, `/security/` | Legal/compliance |
| `resource` | `/resources/`, `/guides/`, `/ebooks/`, `/webinars/` | Gated/downloadable content |
| `glossary` | `/glossary/`, `/dictionary/`, `/terms/` | SEO glossary pages |
| `integration` | `/integrations/`, `/apps/`, `/marketplace/` | Integration pages |
| `other` | — | Anything else |

#### Topic Cluster
Group by extracting topic signals from URL slugs and titles:
- Extract keywords from URL path segments
- Group similar keywords into clusters (e.g., "aws-cost", "cloud-spending", "finops" → "Cloud Cost Management")
- Use simple keyword co-occurrence for clustering

### Phase 3: Analyze Publishing Patterns

From the dated content (primarily blog posts):
- **Total content pieces** by type
- **Publishing frequency:** Posts per month over last 12 months
- **Trend:** Increasing, decreasing, or stable output
- **Recency:** Date of most recent publish
- **Author diversity:** Unique authors (if extractable from RSS)

### Phase 4: Deep Analysis (Optional)

If `--deep-analyze N` is specified, fetch the top N pages (prioritizing blog posts) and extract:
- **Word count** (approximate)
- **Target keyword** (inferred from title + H1 + URL)
- **Funnel stage:** TOFU (awareness), MOFU (consideration), BOFU (decision)
- **Content depth:** Shallow (<500 words), Medium (500-1500), Deep (1500+)
- **Has images/video:** Boolean
- **Has CTA:** Boolean (detected by common CTA patterns)
- **Internal links count**

### Phase 5: Output

#### JSON Output (default)
```json
{
  "domain": "example.com",
  "crawl_date": "2026-02-25",
  "total_pages": 347,
  "discovery_methods": ["sitemap.xml", "rss"],
  "pages": [
    {
      "url": "https://example.com/blog/reduce-aws-costs",
      "title": "How to Reduce Your AWS Bill by 40%",
      "date": "2025-11-15",
      "type": "blog-post",
      "topic_cluster": "Cloud Cost Optimization",
      "deep_analysis": {
        "word_count": 2100,
        "target_keyword": "reduce aws costs",
        "funnel_stage": "TOFU",
        "content_depth": "deep",
        "has_images": true,
        "has_cta": true
      }
    }
  ],
  "summary": {
    "by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
    "by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
    "publishing_cadence": {
      "posts_per_month_avg": 4.2,
      "trend": "increasing",
      "most_recent": "2026-02-20"
    }
  }
}
```

#### Markdown Summary (also generated)
```markdown
# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347

## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...

## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...

## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20

## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |
```

## Tips

- **Sitemap.xml is the best source.** Most well-maintained sites have one. If missing, it's itself an SEO signal (negative).
- **RSS only shows recent content.** If you need the full catalog, sitemap is essential. RSS is supplementary.
- **Deep analysis is optional but valuable.** Use it when feeding into brand-voice-extractor or when you need funnel stage mapping.
- **JS-rendered sites** may need the Apify fallback. Signs: sitemap.xml returns HTML, or blog page returns mostly JavaScript.
- **Combine with seo-domain-analyzer** to overlay traffic data on the content inventory — see which content actually performs.

## Dependencies

- Python 3.8+
- `requests` library (`pip install requests`)
- `APIFY_API_TOKEN` env var (only for Apify fallback mode)

Related Skills

seo-content-engine

381
from gooseworks-ai/goose-skills

Build and run an SEO content engine: audit current state, identify gaps, build keyword architecture, generate content calendar, draft content.

seo-content-audit

381
from gooseworks-ai/goose-skills

Comprehensive SEO footprint analysis that orchestrates site-content-catalog, seo-domain-analyzer, and brand-voice-extractor into a single deep-dive report. Catalogs all content, pulls real SEO metrics, runs competitor analysis, builds topic/keyword and content-type gap matrices, extracts brand voice, and produces a prioritized recommendations report. The complete SEO audit for any company.

kol-content-monitor

381
from gooseworks-ai/goose-skills

Track what key opinion leaders (KOLs) in your space are posting on LinkedIn and Twitter/X. Surfaces trending narratives, high-engagement topics, and early signals of emerging conversations before they peak. Chains linkedin-profile-post-scraper and twitter-scraper. Use when a marketing team wants to ride trends rather than create them from scratch, or when a founder wants to know which topics are resonating with their audience.

icp-website-audit

381
from gooseworks-ai/goose-skills

End-to-end website audit through ICP eyes. Builds synthetic personas (if they don't already exist), runs a structured scorecard review of the client's site, then runs a head-to-head competitive comparison against top competitors. Produces a single consolidated report with persona feedback, competitive positioning, and prioritized recommendations. The complete "how do our buyers actually experience our site vs the competition?" workflow.

content-repurposer

381
from gooseworks-ai/goose-skills

Take a long-form asset (blog post, webinar, podcast, LinkedIn article) and generate 10+ derivative pieces ready to publish: LinkedIn posts, tweets/X threads, email snippets, short-form hooks, and pull-quotes. Pure reasoning skill — no scripts, no scraping. Use when a founder or marketer has created one piece of content and needs to distribute it across multiple channels without writing each variant from scratch.

content-brief-factory

381
from gooseworks-ai/goose-skills

Generate detailed, differentiated content briefs at scale. Each brief includes SERP analysis, competing page breakdown, unique angles from real customer language (reviews, Reddit), internal linking plan, and SERP feature targets. Batch mode produces 10-50 briefs in one run. Crushes generic "keyword density" briefs from tools like Surfer or Clearscope.

competitor-content-tracker

381
from gooseworks-ai/goose-skills

Monitor competitor content across blogs, LinkedIn, and Twitter/X on a recurring basis. Surfaces new posts, trending topics, and content gaps you can own. Chains blog-scraper, linkedin-profile-post-scraper, and twitter-scraper. Use when you want a weekly digest of what competitors are publishing and which topics are generating engagement.

icp-website-review

381
from gooseworks-ai/goose-skills

Evaluate a website, landing page, content, or any online asset through the eyes of pre-built synthetic ICP personas. Loads personas from icp-persona-builder output, then runs them against target URLs. Supports three modes: structured scorecard, freeform focus group, and head-to-head competitive comparison. Reusable — run against the same site after changes, or against new content anytime.

content-asset-creator

381
from gooseworks-ai/goose-skills

Creates beautiful, branded HTML content assets — industry reports, landing pages, comparison sheets, one-pagers — from structured data. Uses Gamma API (preferred), v0.dev Platform API, or a self-hosted HTML template system with Tailwind CSS. Outputs self-contained HTML files that can be hosted as web pages or converted to PDF.

signal-detection-pipeline

381
from gooseworks-ai/goose-skills

Detect buying signals from multiple sources, qualify leads, and generate outreach context

outbound-prospecting-engine

381
from gooseworks-ai/goose-skills

End-to-end outbound prospecting: detect intent signals, research companies, find decision-maker contacts, personalize messaging, launch campaign.

event-prospecting-pipeline

381
from gooseworks-ai/goose-skills

Find attendees at conferences/events, research their companies, qualify against ICP, and launch outreach