doc-scraper

Generic web scraper for extracting and organizing Snowflake documentation with intelligent caching and configurable spider depth. Scrapes any section of docs.snowflake.com controlled by --base-path.

31 stars

bysfc-gh-dflippo

View on GitHub Installation ↓

Best use case

doc-scraper is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Generic web scraper for extracting and organizing Snowflake documentation with intelligent caching and configurable spider depth. Scrapes any section of docs.snowflake.com controlled by --base-path.

Teams using doc-scraper should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/doc-scraper/SKILL.md --create-dirs "https://raw.githubusercontent.com/sfc-gh-dflippo/snowflake-dbt-demo/main/.claude/skills/doc-scraper/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/doc-scraper/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How doc-scraper Compares

Feature / Agent	doc-scraper	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Generic web scraper for extracting and organizing Snowflake documentation with intelligent caching and configurable spider depth. Scrapes any section of docs.snowflake.com controlled by --base-path.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Snowflake Documentation Scraper

Scrapes docs.snowflake.com sections to Markdown with SQLite caching (7-day expiration).

## Usage

**First time setup** (auto-installs uv and doc-scraper):

```bash
python3 .claude/skills/doc-scraper/scripts/doc_scraper.py
```

**Subsequent runs:**

```bash
doc-scraper --output-dir=./snowflake-docs
doc-scraper --output-dir=./snowflake-docs --base-path="/en/sql-reference/"
doc-scraper --output-dir=./snowflake-docs --spider-depth=2
```

## Command Options

| Option           | Default           | Description                           |
| ---------------- | ----------------- | ------------------------------------- |
| `--output-dir`   | **Required**      | Output directory for scraped docs     |
| `--base-path`    | `/en/migrations/` | URL section to scrape                 |
| `--spider-depth` | `1`               | Link depth: 0=seeds, 1=+links, 2=+2nd |
| `--limit`        | None              | Cap URLs (for testing)                |
| `--dry-run`      | -                 | Preview without writing               |

## Output

```sql
output-dir/
├── SKILL.md              # Auto-generated index
├── scraper_config.yaml   # Editable config (auto-created)
├── .cache/               # SQLite cache (auto-managed)
└── en/migrations/*.md    # Scraped pages with frontmatter
```

## Configuration

Auto-created at `{output-dir}/scraper_config.yaml`:

```yaml
rate_limiting:
  max_concurrent_threads: 4
spider:
  max_pages: 1000
  allowed_paths: ["/en/"]
scraped_pages:
  expiration_days: 7
```

## Troubleshooting

| Issue            | Solution                              |
| ---------------- | ------------------------------------- |
| Too many pages   | Lower `--spider-depth` or edit config |
| Missing pages    | Increase `--spider-depth`             |
| Cache corruption | Delete `{output-dir}/.cache/` (rare)  |

Related Skills

task-master

from sfc-gh-dflippo/snowflake-dbt-demo

AI-powered task management for structured, specification-driven development. Use this skill when you need to manage complex projects with PRDs, break down tasks into subtasks, track dependencies, and maintain organized development workflows across features and branches.

task-master-viewer

from sfc-gh-dflippo/snowflake-dbt-demo

Launch a Streamlit GUI for Task Master tasks.json editing. Use when users want a visual interface instead of CLI/MCP commands.

task-master-install

from sfc-gh-dflippo/snowflake-dbt-demo

Install and initialize task-master for AI-powered task management and specification-driven development. Use this skill when users ask you to parse a new PRD, when starting a new project that needs structured task management, when users mention wanting task breakdown or project planning, or when implementing specification-driven development workflows.

streamlit-development

from sfc-gh-dflippo/snowflake-dbt-demo

Developing, testing, and deploying Streamlit data applications on Snowflake. Use this skill when you're building interactive data apps, setting up local development environments, testing with pytest or Playwright, or deploying apps to Snowflake using Streamlit in Snowflake.

snowflake-connections

from sfc-gh-dflippo/snowflake-dbt-demo

Configuring Snowflake connections using connections.toml (for Snowflake CLI, Streamlit, Snowpark) or profiles.yml (for dbt) with multiple authentication methods (SSO, key pair, username/password, OAuth), managing multiple environments, and overriding settings with environment variables. Use this skill when setting up Snowflake CLI, Streamlit apps, dbt, or any tool requiring Snowflake authentication and connection management.

snowflake-cli

from sfc-gh-dflippo/snowflake-dbt-demo

Executing SQL, managing Snowflake objects, deploying applications, and orchestrating data pipelines using the Snowflake CLI (snow) command. Use this skill when you need to run SQL scripts, deploy Streamlit apps, execute Snowpark procedures, manage stages, automate Snowflake operations from CI/CD pipelines, or work with variables and templating.

skills-sync

from sfc-gh-dflippo/snowflake-dbt-demo

Manage and synchronize AI agent skills from local SKILL.md files and remote Git repositories, generating Cursor rules with Agent Skills specification XML. This skill should be used when users need to sync skills, add/remove skill repositories, or set up the skills infrastructure.

schemachange

from sfc-gh-dflippo/snowflake-dbt-demo

Deploying and managing Snowflake database objects using version control with schemachange. Use this skill when you need to manage database migrations for objects not handled by dbt, implement CI/CD pipelines for schema changes, or coordinate deployments across multiple environments.

playwright-mcp

from sfc-gh-dflippo/snowflake-dbt-demo

Browser testing, web scraping, and UI validation using Playwright MCP. Use this skill when you need to test Streamlit apps, validate web interfaces, test responsive design, check accessibility, or automate browser interactions through MCP tools.

devcontainer-setup

from sfc-gh-dflippo/snowflake-dbt-demo

Create Universal DevContainers optimized for AI agentic workflows with Claude Code, Snowflake CLI, Cortex Code, and dbt. Use when setting up development containers, configuring devcontainer.json, scaffolding AI-ready environments, or when the user mentions devcontainers, containerized development, or Docker development environments.

dbt-testing

from sfc-gh-dflippo/snowflake-dbt-demo

dbt testing strategies using dbt_constraints for database-level enforcement, generic tests, and singular tests. Use this skill when implementing data quality checks, adding primary/foreign key constraints, creating custom tests, or establishing comprehensive testing frameworks across bronze/silver/gold layers.

dbt-projects-snowflake-setup

from sfc-gh-dflippo/snowflake-dbt-demo

Step-by-step setup guide for dbt Projects on Snowflake including prerequisites, external access integration, Git API integration, event table configuration, and automated scheduling. Use this skill when setting up dbt Projects on Snowflake for the first time or troubleshooting setup issues.