orthogonal-extract-webpage-data
Extract structured data from web pages using AI
Best use case
orthogonal-extract-webpage-data is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Extract structured data from web pages using AI
Teams using orthogonal-extract-webpage-data should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/orthogonal-extract-webpage-data/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How orthogonal-extract-webpage-data Compares
| Feature / Agent | orthogonal-extract-webpage-data | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Extract structured data from web pages using AI
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Extract Webpage Data
## Setup
Read your credentials from ~/.gooseworks/credentials.json:
```bash
export GOOSEWORKS_API_KEY=$(python3 -c "import json;print(json.load(open('$HOME/.gooseworks/credentials.json'))['api_key'])")
export GOOSEWORKS_API_BASE=$(python3 -c "import json;print(json.load(open('$HOME/.gooseworks/credentials.json')).get('api_base','https://api.gooseworks.ai'))")
```
If ~/.gooseworks/credentials.json does not exist, tell the user to run: `npx gooseworks login`
All endpoints use Bearer auth: `-H "Authorization: Bearer $GOOSEWORKS_API_KEY"`
Extract structured data from any web page using AI. Turn messy HTML into clean, organized data.
## When to Use
- User wants to extract specific data from a website
- User asks to scrape information from a page
- User needs structured data from unstructured content
- User wants to pull product info, contact details, etc.
- Converting web content to usable data
## How It Works
Uses Olostep, Scrapegraph, or Riveter APIs for AI-powered data extraction.
## Usage
### Simple Scrape with Olostep
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
-H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"api":"olostep","path":"/v1/scrapes","body":{"url_to_scrape":"https://example.com/products"}}'
```
### AI-Powered Extraction with Scrapegraph
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
-H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/team","user_prompt":"Extract all team members with their names, titles, and LinkedIn URLs"}}'
```
### Schema-Based Extraction with Riveter
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
-H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"api":"riveter","path":"/v1/scrape","body":{"url":"https://example.com","schema":{"name":"string","price":"number","description":"string"}}}'
```
### Get AI Answer from Web
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
-H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"api":"olostep","path":"/v1/answers","body":{"task":"Find the pricing for Notion Teams plan from their website"}}'
```
### Crawl Multiple Pages
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
-H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"api":"olostep","path":"/v1/crawls","body":{"start_url":"https://example.com","max_pages":10}}'
```
## Parameters
### Olostep Scrape
- **url_to_scrape** (required) - URL to scrape
- **formats** - Output formats (markdown, html, text)
### Scrapegraph
- **website_url** (required) - URL to scrape
- **user_prompt** (required) - Natural language description of what to extract
### Riveter
- **url** (required) - URL to scrape
- **schema** - JSON schema defining the data structure to extract
### Olostep Answer
- **task** (required) - Natural language task/question
## Response
### Olostep Response
Returns a scrape object:
- **id** (string) - Scrape ID (e.g., `scrape_z926lxxon3`)
- **result.markdown_content** (string|null) - Page content as markdown
- **result.html_content** (string|null) - Raw HTML (if requested via `formats`)
- **result.text_content** (string|null) - Plain text (if requested)
- **result.markdown_hosted_url** (string|null) - S3 URL for large content
- **result.links_on_page** (array) - Links found on the page
- **result.screenshot_hosted_url** (string|null) - Screenshot URL (if requested)
- **result.page_metadata** (object) - `status_code` of the page
- **credits_consumed** (integer) - Credits used for this scrape
**Async crawls**: POST `/v1/crawls` returns an `id`. Poll with GET `/v1/crawls/{id}` until complete.
### Scrapegraph Response
Returns structured extraction result:
- **request_id** (string) - Unique request identifier
- **status** (string) - `completed` or `pending`
- **result** (object) - AI-extracted data matching your prompt (dynamic keys)
- **error** (string) - Empty on success, error message on failure
**Note**: For large pages, the POST may return `status: "pending"`. Poll with GET `/v1/smartscraper/{request_id}` until `status` is `completed`.
### Riveter Response
Returns scrape result:
- **request_status** (string) - `success` or `error`
- **message** (string) - Human-readable status
- **text** (string) - Extracted page text content
- **url** (string) - URL that was scraped
- **status_code** (integer) - HTTP status of the page
- **run_key** (string) - Unique run identifier
- **base_url_for_links** (string) - Base URL for resolving relative links
- **riveter_app_link** (string) - Link to view run in Riveter dashboard
- **credit_used** (integer) - Credits consumed
## Examples
**User:** "Get all the product names and prices from this page"
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
-H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/products","user_prompt":"Extract all products with name, price, and description"}}'
```
**User:** "Scrape the team page and get everyone's info"
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
-H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/about/team","user_prompt":"Extract team members: name, role, bio, photo URL, LinkedIn"}}'
```
**User:** "What are Stripe's API pricing details?"
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
-H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"api":"olostep","path":"/v1/answers","body":{"task":"Find Stripe API pricing breakdown from stripe.com/pricing"}}'
```
**User:** "Get all blog post titles and dates from this blog"
```bash
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
-H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"api":"riveter","path":"/v1/scrape","body":{"url":"https://blog.example.com","schema":{"posts":[{"title":"string","date":"string","url":"string"}]}}}'
```
## Error Handling
- **504** - Olostep timeout on slow pages — retry or try a simpler URL
- **400** - Missing required parameters (`url_to_scrape` for Olostep, `website_url` + `user_prompt` for Scrapegraph, `url` for Riveter)
- Scrapegraph returns `error` field in response body — check it even on 200 status
- Riveter returns `request_status: "error"` with details in `message`
- Some sites block automated scraping — try a different API if one fails
## Tips
- Scrapegraph is best for natural language extraction
- Riveter is best when you know the exact schema you want
- Olostep is great for general scraping and AI answers
- For dynamic sites (JavaScript-heavy), these tools handle rendering
- Be specific in your prompts for better extraction results
- Some sites may block automated accessRelated Skills
crustdata-supabase
Search CrustData People Search API for ICP-matching leads with automatic Supabase deduplication. Queries existing leads in the database, passes them as exclude_profiles to CrustData, fetches only net-new leads, and upserts results. Supports pagination, rate limiting, test mode, and reusable configs.
visual-brand-extractor
Extract visual branding (colors, typography, layout patterns) from a client's website and generate a style preset compatible with the HTML slides skill and a brand config JSON for the content asset creator. Uses WebFetch to read pages and analyzes CSS/HTML to identify the color palette, font pairings, and aesthetic patterns.
orthogonal-yc-batch-evaluator
Evaluate YC batch companies for investment — scrapes the YC directory, researches each company and its founders (work history, LinkedIn, website), assesses founder-company fit, and exports to Google Sheets with priority rankings. Use when asked to evaluate YC companies, research a YC batch, screen startups, or do due diligence on YC companies.
orthogonal-website-screenshot
Take screenshots of websites and web pages
orthogonal-weather
Get current weather and forecasts using free APIs (no API key required). Use when asked about weather, temperature, forecasts, or climate conditions for any location.
orthogonal-weather-forecast
Get weather forecasts - temperature, precipitation, wind, and conditions
orthogonal-vhs-terminal-recordings
Create polished terminal GIF recordings using VHS (Video Hardware Software) by Charmbracelet. Use when asked to create terminal demos, CLI gifs, command-line recordings, or animated terminal screenshots for documentation, READMEs, or marketing.
orthogonal-verify-email
Verify if an email address is valid and deliverable
orthogonal-valyu
Web search, AI answers, content extraction, and async deep research
orthogonal-uptime-monitor
Monitor website uptime - check availability, response times, and status
orthogonal-twitter-profile-lookup
Look up Twitter/X profiles - get bio, followers, tweets, and engagement
orthogonal-tomba
Email finder and verifier - find emails from domains, LinkedIn, or company search