orthogonal-extract-webpage-data

Extract structured data from web pages using AI

380 stars

Best use case

orthogonal-extract-webpage-data is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Extract structured data from web pages using AI

Teams using orthogonal-extract-webpage-data should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/orthogonal-extract-webpage-data/SKILL.md --create-dirs "https://raw.githubusercontent.com/gooseworks-ai/goose-skills/main/skills/capabilities/orthogonal-extract-webpage-data/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/orthogonal-extract-webpage-data/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How orthogonal-extract-webpage-data Compares

Feature / Agentorthogonal-extract-webpage-dataStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Extract structured data from web pages using AI

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Extract Webpage Data

## Setup

Read your credentials from ~/.gooseworks/credentials.json:
```bash
export GOOSEWORKS_API_KEY=$(python3 -c "import json;print(json.load(open('$HOME/.gooseworks/credentials.json'))['api_key'])")
export GOOSEWORKS_API_BASE=$(python3 -c "import json;print(json.load(open('$HOME/.gooseworks/credentials.json')).get('api_base','https://api.gooseworks.ai'))")
```

If ~/.gooseworks/credentials.json does not exist, tell the user to run: `npx gooseworks login`

All endpoints use Bearer auth: `-H "Authorization: Bearer $GOOSEWORKS_API_KEY"`


Extract structured data from any web page using AI. Turn messy HTML into clean, organized data.

## When to Use

- User wants to extract specific data from a website
- User asks to scrape information from a page
- User needs structured data from unstructured content
- User wants to pull product info, contact details, etc.
- Converting web content to usable data

## How It Works

Uses Olostep, Scrapegraph, or Riveter APIs for AI-powered data extraction.

## Usage

### Simple Scrape with Olostep

```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"olostep","path":"/v1/scrapes","body":{"url_to_scrape":"https://example.com/products"}}'
```

### AI-Powered Extraction with Scrapegraph

```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/team","user_prompt":"Extract all team members with their names, titles, and LinkedIn URLs"}}'
```

### Schema-Based Extraction with Riveter

```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"riveter","path":"/v1/scrape","body":{"url":"https://example.com","schema":{"name":"string","price":"number","description":"string"}}}'
```

### Get AI Answer from Web

```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"olostep","path":"/v1/answers","body":{"task":"Find the pricing for Notion Teams plan from their website"}}'
```

### Crawl Multiple Pages

```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"olostep","path":"/v1/crawls","body":{"start_url":"https://example.com","max_pages":10}}'
```

## Parameters

### Olostep Scrape
- **url_to_scrape** (required) - URL to scrape
- **formats** - Output formats (markdown, html, text)

### Scrapegraph
- **website_url** (required) - URL to scrape
- **user_prompt** (required) - Natural language description of what to extract

### Riveter
- **url** (required) - URL to scrape
- **schema** - JSON schema defining the data structure to extract

### Olostep Answer
- **task** (required) - Natural language task/question

## Response

### Olostep Response
Returns a scrape object:
- **id** (string) - Scrape ID (e.g., `scrape_z926lxxon3`)
- **result.markdown_content** (string|null) - Page content as markdown
- **result.html_content** (string|null) - Raw HTML (if requested via `formats`)
- **result.text_content** (string|null) - Plain text (if requested)
- **result.markdown_hosted_url** (string|null) - S3 URL for large content
- **result.links_on_page** (array) - Links found on the page
- **result.screenshot_hosted_url** (string|null) - Screenshot URL (if requested)
- **result.page_metadata** (object) - `status_code` of the page
- **credits_consumed** (integer) - Credits used for this scrape

**Async crawls**: POST `/v1/crawls` returns an `id`. Poll with GET `/v1/crawls/{id}` until complete.

### Scrapegraph Response
Returns structured extraction result:
- **request_id** (string) - Unique request identifier
- **status** (string) - `completed` or `pending`
- **result** (object) - AI-extracted data matching your prompt (dynamic keys)
- **error** (string) - Empty on success, error message on failure

**Note**: For large pages, the POST may return `status: "pending"`. Poll with GET `/v1/smartscraper/{request_id}` until `status` is `completed`.

### Riveter Response
Returns scrape result:
- **request_status** (string) - `success` or `error`
- **message** (string) - Human-readable status
- **text** (string) - Extracted page text content
- **url** (string) - URL that was scraped
- **status_code** (integer) - HTTP status of the page
- **run_key** (string) - Unique run identifier
- **base_url_for_links** (string) - Base URL for resolving relative links
- **riveter_app_link** (string) - Link to view run in Riveter dashboard
- **credit_used** (integer) - Credits consumed

## Examples

**User:** "Get all the product names and prices from this page"
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/products","user_prompt":"Extract all products with name, price, and description"}}'
```

**User:** "Scrape the team page and get everyone's info"
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/about/team","user_prompt":"Extract team members: name, role, bio, photo URL, LinkedIn"}}'
```

**User:** "What are Stripe's API pricing details?"
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"olostep","path":"/v1/answers","body":{"task":"Find Stripe API pricing breakdown from stripe.com/pricing"}}'
```

**User:** "Get all blog post titles and dates from this blog"
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"riveter","path":"/v1/scrape","body":{"url":"https://blog.example.com","schema":{"posts":[{"title":"string","date":"string","url":"string"}]}}}'
```

## Error Handling

- **504** - Olostep timeout on slow pages — retry or try a simpler URL
- **400** - Missing required parameters (`url_to_scrape` for Olostep, `website_url` + `user_prompt` for Scrapegraph, `url` for Riveter)
- Scrapegraph returns `error` field in response body — check it even on 200 status
- Riveter returns `request_status: "error"` with details in `message`
- Some sites block automated scraping — try a different API if one fails

## Tips

- Scrapegraph is best for natural language extraction
- Riveter is best when you know the exact schema you want
- Olostep is great for general scraping and AI answers
- For dynamic sites (JavaScript-heavy), these tools handle rendering
- Be specific in your prompts for better extraction results
- Some sites may block automated access

Related Skills

crustdata-supabase

381
from gooseworks-ai/goose-skills

Search CrustData People Search API for ICP-matching leads with automatic Supabase deduplication. Queries existing leads in the database, passes them as exclude_profiles to CrustData, fetches only net-new leads, and upserts results. Supports pagination, rate limiting, test mode, and reusable configs.

visual-brand-extractor

380
from gooseworks-ai/goose-skills

Extract visual branding (colors, typography, layout patterns) from a client's website and generate a style preset compatible with the HTML slides skill and a brand config JSON for the content asset creator. Uses WebFetch to read pages and analyzes CSS/HTML to identify the color palette, font pairings, and aesthetic patterns.

orthogonal-yc-batch-evaluator

380
from gooseworks-ai/goose-skills

Evaluate YC batch companies for investment — scrapes the YC directory, researches each company and its founders (work history, LinkedIn, website), assesses founder-company fit, and exports to Google Sheets with priority rankings. Use when asked to evaluate YC companies, research a YC batch, screen startups, or do due diligence on YC companies.

orthogonal-website-screenshot

380
from gooseworks-ai/goose-skills

Take screenshots of websites and web pages

orthogonal-weather

380
from gooseworks-ai/goose-skills

Get current weather and forecasts using free APIs (no API key required). Use when asked about weather, temperature, forecasts, or climate conditions for any location.

orthogonal-weather-forecast

380
from gooseworks-ai/goose-skills

Get weather forecasts - temperature, precipitation, wind, and conditions

orthogonal-vhs-terminal-recordings

380
from gooseworks-ai/goose-skills

Create polished terminal GIF recordings using VHS (Video Hardware Software) by Charmbracelet. Use when asked to create terminal demos, CLI gifs, command-line recordings, or animated terminal screenshots for documentation, READMEs, or marketing.

orthogonal-verify-email

380
from gooseworks-ai/goose-skills

Verify if an email address is valid and deliverable

orthogonal-valyu

380
from gooseworks-ai/goose-skills

Web search, AI answers, content extraction, and async deep research

orthogonal-uptime-monitor

380
from gooseworks-ai/goose-skills

Monitor website uptime - check availability, response times, and status

orthogonal-twitter-profile-lookup

380
from gooseworks-ai/goose-skills

Look up Twitter/X profiles - get bio, followers, tweets, and engagement

orthogonal-tomba

380
from gooseworks-ai/goose-skills

Email finder and verifier - find emails from domains, LinkedIn, or company search