nlweb-schema-org-grounding

Prepare and structure site content as Schema.org JSON-LD for NLWeb ingestion — covers the supported types (Recipe, Product, Movie, Event, Article, RealEstate, Course, etc.), per-type behavior in NLWeb's tool routing, JSON-LD embedding patterns in HTML, sites.xml registration, and how the `schema_object` flows through ranking back to agent results. Use when authoring or auditing the structured data on a site that will be exposed via NLWeb.

17 stars

byOrcaQubits

View on GitHub Installation ↓

Best use case

nlweb-schema-org-grounding is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using nlweb-schema-org-grounding should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/nlweb-schema-org-grounding/SKILL.md --create-dirs "https://raw.githubusercontent.com/OrcaQubits/agentic-commerce-skills-plugins/main/dist/antigravity/nlweb-protocol/.agent/skills/nlweb-schema-org-grounding/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/nlweb-schema-org-grounding/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How nlweb-schema-org-grounding Compares

Feature / Agent	nlweb-schema-org-grounding	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# NLWeb Schema.org Grounding

## Before writing code

**Fetch live references**:
1. Fetch https://schema.org/ for the canonical Schema.org vocabulary.
2. Fetch https://github.com/nlweb-ai/NLWeb/blob/main/config/site_types.xml in the live repo for the **exact list of supported Schema.org types** and the tool inheritance tree per type.
3. Fetch https://github.com/nlweb-ai/NLWeb/blob/main/docs/nlweb-prompts.md for how per-type prompts and `<returnStruc>` shapes work.
4. Web-search `schema.org JSON-LD validator` — Google's Rich Results Test is a quick way to validate before ingest.
5. Check `AskAgent/python/methods/recipe_substitution.py`, `accompaniment.py`, `compare_items.py` for examples of how type-specific tools consume the `schema_object`.

## Conceptual Architecture

### Why Schema.org Matters to NLWeb

NLWeb's defining design choice: **results carry their full Schema.org object back to the agent**. Unlike a generic RAG system that returns text chunks, NLWeb returns structured JSON-LD — so an agent receiving a `Recipe` result gets `ingredients`, `cookTime`, `nutrition`, `recipeYield`, not just a paragraph of text. This is what makes NLWeb results *agent-actionable*.

R.V. Guha (NLWeb's author) co-created Schema.org for exactly this reason — the data was already structured; NLWeb finally exposes it to agents.

### Schema.org Types NLWeb Knows About

`site_types.xml` enumerates the types with per-type tool / prompt overrides. Common types (verify the live file):

| Type | Use Case | Type-Specific Tools |
|------|----------|---------------------|
| `Recipe` | Cooking sites | recipe_substitution, accompaniment |
| `Product` | E-commerce | compare_items, item_details |
| `Movie` / `TVSeries` | Streaming/reviews | compare_items |
| `Event` | Calendars, ticketing | item_details |
| `Article` / `NewsArticle` / `BlogPosting` | News, blogs | summarize-mode default |
| `RealEstate` / `Apartment` / `House` | Listings | item_details, compare |
| `Course` | EdTech | item_details |
| `Restaurant` / `LocalBusiness` | Maps, directories | accompaniment, item_details |
| `Book` | Catalogs | compare, item_details |
| `Person` / `Organization` | Profiles | item_details |

NLWeb falls back to a default tool set for any Schema.org type not explicitly enumerated.

### JSON-LD Embedding Patterns

Schema.org JSON-LD is typically embedded in HTML via a `<script type="application/ld+json">` tag:

```html
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Recipe",
  "name": "Classic Tomato Soup",
  "url": "https://example.com/recipes/tomato-soup",
  "image": "https://example.com/images/tomato-soup.jpg",
  "author": { "@type": "Person", "name": "Jane Doe" },
  "datePublished": "2025-09-12",
  "description": "A simple weeknight tomato soup.",
  "recipeIngredient": ["6 ripe tomatoes", "1 onion", "..."],
  "recipeInstructions": [...],
  "nutrition": { "@type": "NutritionInformation", "calories": "200" },
  "cookTime": "PT30M",
  "recipeYield": "4 servings"
}
</script>
```

NLWeb's URL-list ingest path extracts this directly. The richer the JSON-LD, the more useful the result.

### The `schema_object` Field in Responses

Every NLWeb result contains:

```json
{
  "url": "...",
  "name": "...",
  "site": "...",
  "score": 0.87,
  "description": "...",
  "schema_object": { /* the full JSON-LD as ingested */ }
}
```

Agents can pattern-match on `schema_object.@type` to render appropriately, extract specific properties (e.g., `offers.price` for products), or chain to a follow-up tool call.

### sites.xml and Per-Site Registration

In addition to `config_nlweb.yaml`'s `sites:` allowlist, the demo data ships with a `sites.xml`-style registry tying site names to crawl sources and Schema.org type defaults. Check the live repo for the current registration convention — this is an area that's been evolving.

### Schema.org Required Fields by Type (high-signal subset)

| Type | Always include |
|------|----------------|
| Recipe | name, url, image, recipeIngredient, recipeInstructions, cookTime, recipeYield |
| Product | name, url, image, description, offers (price, priceCurrency, availability) |
| Article | headline, url, image, author, datePublished, description, articleBody (or summary) |
| Event | name, url, startDate, location, description |
| Movie | name, url, image, director, datePublished, genre, description |
| RealEstate | name, url, image, address, numberOfRooms, floorSize, price |

The fewer fields populated, the worse the result quality — especially for `mode=generate` answers.

### Per-Type Prompt and Tool Inheritance

`site_types.xml` defines a tree:
- Root prompts apply to all types
- Per-type overrides specialize ranking, summarization, and tool selection

This is **mixed-mode programming** in action — small, type-aware LLM calls drive the response.

## Implementation Guidance

### Auditing an Existing Site

Before ingest:
1. Visit a representative page and view source — look for `<script type="application/ld+json">`.
2. Validate with Google's Rich Results Test or Schema.org validator.
3. Confirm the `@type` is one NLWeb's `site_types.xml` knows about — if not, results still work but use default prompts.

### Authoring JSON-LD for NLWeb

- **Always set `@context: "https://schema.org"`** — NLWeb's parser keys off this.
- **Always include `url`** — it's the deduplication key across retrieval backends.
- **Use specific subtypes** (e.g., `Recipe` not `CreativeWork`) so type-specific tools activate.
- **Embed images and dates** — agents use them for rendering and freshness checks.
- **Nest related objects** with `@type` discriminators (e.g., `author` as `Person`, `offers` as `Offer`).

### Validating Schema Quality Post-Ingest

After loading, hit a result and inspect `schema_object`:

```bash
curl 'http://localhost:8000/ask?query=quick+dinners&site=recipes&streaming=false&mode=list' | jq '.results[0].schema_object'
```

If `schema_object` is missing key fields, fix the source HTML — not NLWeb's config.

### Adding a New Schema.org Type

If you want a custom domain (say, `Podcast` episodes) with type-specific tools:
1. Add a `<site_type>` entry in `site_types.xml` referencing your `@type` value.
2. Define type-specific prompts in `prompts.xml` (or inherit defaults).
3. Optionally write a handler in `methods/` (see `nlweb-tools-framework`).
4. Reload and re-test.

### Mapping Non-Schema.org Sources

If your source isn't JSON-LD (CSV, proprietary API), map fields to Schema.org **at ingest time**, not query time. Update `rss2schema.py` or write a small adapter that emits Schema.org JSON before calling `db_load`. The richer the mapping, the better the agent experience.

### Common Pitfalls

- **`@type` is missing or non-Schema.org** — results work but type-specific tools never fire.
- **`url` is relative** — breaks deduplication; always emit absolute URLs.
- **Date format is non-ISO** — `datePublished: "2025-09-12"` works; `"Sept 12, 2025"` does not.
- **`offers` is a bare string instead of an `Offer` object** — agents lose the price field.
- **Description is too short / too generic** — ranking suffers because retrieval relies on description embeddings.

Always validate JSON-LD with an external tool before assuming ingest will work — silent parser failures are common.

Related Skills

webmcp-tool-schemas

from OrcaQubits/agentic-commerce-skills-plugins

Design JSON Schemas for WebMCP tool inputs and outputs — proper types, constraints, nested objects, and agent-friendly documentation. Use when defining or refining tool schemas for agent consumption.

ucp-schema-authoring

from OrcaQubits/agentic-commerce-skills-plugins

Author custom UCP schemas and extensions — create capability schemas, extension schemas, and type definitions using JSON Schema 2020-12 composition. Use when extending UCP with custom capabilities or building domain-specific extensions.

nlweb-tools-framework

from OrcaQubits/agentic-commerce-skills-plugins

Design and implement NLWeb tools — the per-Schema.org-type handlers that turn a query into a specialized response (search, item_details, compare_items, ensemble, recipe_substitution, accompaniment, conversation_search, etc.). Covers `tools.xml`, the ToolSelector router, builtin handlers in `methods/`, writing a custom tool with a `<returnStruc>` contract, and disabling tool selection for raw retrieval. Use when extending NLWeb beyond the default query → results flow.

nlweb-setup

from OrcaQubits/agentic-commerce-skills-plugins

Bootstrap a local NLWeb development environment from scratch — clone the repo, configure .env, install Python deps via `nlweb init-python`, run `nlweb init` for interactive LLM/retrieval selection, load sample Schema.org data, and verify with `nlweb check`. Use when starting a new NLWeb deployment from zero.

nlweb-retrieval-backends

from OrcaQubits/agentic-commerce-skills-plugins

Choose and configure NLWeb retrieval backends — Qdrant (local + remote), Azure AI Search, Elasticsearch, OpenSearch (with/without k-NN), Postgres pgvector, Milvus, Snowflake Cortex Search, Cloudflare AutoRAG, Shopify MCP, and Bing Web Search. Covers `config_retrieval.yaml`, the single `write_endpoint` rule, parallel read-fanout with URL dedup, and per-backend setup pages. Use when picking a retrieval store, migrating between backends, or debugging "results are empty."

nlweb-prompts-customization

from OrcaQubits/agentic-commerce-skills-plugins

Customize NLWeb's LLM prompts and per-Schema.org-type behavior via `prompts.xml` and `site_types.xml` — covers the `<promptString>` template format, `<returnStruc>` JSON schemas, prompt inheritance, decontextualization/ranking/generate templates, per-site overrides, and pitfalls of editing prompts in place. Use when tuning answer quality, supporting a new domain, or localizing prompts.

nlweb-mcp-server

from OrcaQubits/agentic-commerce-skills-plugins

Expose NLWeb as an MCP (Model Context Protocol) server — JSON-RPC 2.0 endpoint at /mcp, the `ask` / `list_sites` / `who` tools, MCP protocol version 2024-11-05, and integration with ChatGPT, Claude, Gemini, and other agent clients. Use when wiring NLWeb to an AI agent via MCP or building an MCP client that consumes an NLWeb site.

nlweb-llm-providers

from OrcaQubits/agentic-commerce-skills-plugins

Configure NLWeb LLM and embedding providers — OpenAI, Azure OpenAI (default), Anthropic, Google Gemini, DeepSeek on Azure, Llama on Azure, HuggingFace, Inception Labs, Snowflake Cortex, Ollama, Pi Labs. Covers `config_llm.yaml` high/low tier model selection, the ModelRouter cost/quality routing logic, `config_embedding.yaml`, and adding a custom provider. Use when picking models, tuning cost, or wiring a new LLM backend.

nlweb-data-loading

from OrcaQubits/agentic-commerce-skills-plugins

Ingest site content into NLWeb's vector store using `db_load.py` — supports RSS/Atom feeds, Schema.org JSON-LD, sitemap-driven URL lists, and CSV. Covers chunking, embedding computation, site partitioning, batch sizing, delete-and-reload, and per-backend write_endpoint targeting. Use when bootstrapping a site's index, refreshing content, or migrating between retrieval backends.

nlweb-chatgpt-appsdk

from OrcaQubits/agentic-commerce-skills-plugins

Integrate NLWeb with ChatGPT's Apps SDK — the Node.js MCP server in `openai-apps-sdk-integration/`, the `nlweb-list` tool, the React widget at `ui://widget/nlweb-list.html`, and the port-8100 AppSDK adapter that translates NLWeb's message list to OpenAI Apps SDK envelopes. Use when publishing an NLWeb site as a ChatGPT app or wiring NLWeb results into an Apps SDK widget.

nlweb-auth-multitenancy

from OrcaQubits/agentic-commerce-skills-plugins

Configure NLWeb authentication and multi-tenant deployments — OAuth providers (GitHub, Google, Microsoft, Facebook), session storage, the `sites:` allowlist in `config_nlweb.yaml`, conversation persistence per authenticated user, and per-tenant data isolation. Use when adding login to an NLWeb instance, hosting multiple customers on one deployment, or persisting conversation history.

nlweb-ask-endpoint

from OrcaQubits/agentic-commerce-skills-plugins

Implement and consume the NLWeb /ask REST endpoint — request shape (GET/POST, query-string and v0.55 structured body), SSE streaming response, modes (list/summarize/generate), in-stream "message_type" headers, error envelopes, and client-side parsing. Use when building an NLWeb server route, calling /ask from a custom agent, or debugging /ask responses.