nlweb-data-loading

Ingest site content into NLWeb's vector store using `db_load.py` — supports RSS/Atom feeds, Schema.org JSON-LD, sitemap-driven URL lists, and CSV. Covers chunking, embedding computation, site partitioning, batch sizing, delete-and-reload, and per-backend write_endpoint targeting. Use when bootstrapping a site's index, refreshing content, or migrating between retrieval backends.

17 stars

byOrcaQubits

View on GitHub Installation ↓

Best use case

nlweb-data-loading is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using nlweb-data-loading should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/nlweb-data-loading/SKILL.md --create-dirs "https://raw.githubusercontent.com/OrcaQubits/agentic-commerce-skills-plugins/main/dist/antigravity/nlweb-protocol/.agent/skills/nlweb-data-loading/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/nlweb-data-loading/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How nlweb-data-loading Compares

Feature / Agent	nlweb-data-loading	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# NLWeb Data Loading

## Before writing code

**Fetch live docs**:
1. Fetch https://github.com/nlweb-ai/NLWeb/blob/main/docs/tools-database-load.md for the canonical `db_load.py` reference.
2. Inspect `AskAgent/python/data_loading/db_load.py` and `db_load_utils.py` in the live repo for exact CLI flags — they've added flags in recent releases.
3. Check `AskAgent/python/data_loading/rss2schema.py` for how RSS items map to Schema.org `Article` objects.
4. Confirm the **embedding provider** used at ingest matches `preferred_provider` in `config_embedding.yaml` for the query side — mismatch = silent retrieval failure.
5. For partner backends, check `docs/setup-snowflake.md`, `docs/setup-cloudflare-autorag.md`, etc. for backend-specific ingest steps (some bypass `db_load.py`).

## Conceptual Architecture

### What db_load Does

`db_load.py` is the canonical ingest pipeline. Given a source and a site name, it:

1. **Fetches** the source (RSS feed, JSON-LD URL, sitemap-derived URL list, CSV).
2. **Normalizes** each item to a Schema.org JSON object (uses `rss2schema.py` for feeds; passes JSON-LD through; maps CSV columns by convention).
3. **Chunks** long text fields (description, body) if needed.
4. **Computes embeddings** via the configured embedding provider in `config_embedding.yaml`.
5. **Writes** to the `write_endpoint` configured in `config_retrieval.yaml`.
6. **Tags** every record with the `site` value so retrieval can partition.

### Supported Source Types

| Source | Detection | Notes |
|--------|-----------|-------|
| RSS / Atom feed | URL ending `.rss`, `.xml`, `/feed`, or content-type | Mapped to `Article` Schema.org type |
| Schema.org JSON-LD | URL returns `application/ld+json` or HTML with embedded JSON-LD | Preserved as-is |
| Sitemap.xml | URL ending `sitemap.xml` | Crawled for child URLs |
| URL list file | `--url-list path.txt` flag | One URL per line; each fetched and parsed for JSON-LD |
| CSV | `.csv` extension | Column-to-Schema.org mapping by convention; see docs |

### Site Partitioning

Every record carries a `site` field. Queries filter by `site=<name>` to scope retrieval. **Choose site names carefully** — they're user-visible in `/sites` and become part of the agent UX. Conventions:
- Lowercase, no spaces, hyphens or underscores
- One site per logical content domain (not per RSS feed; aggregate related feeds under one site)

### Embedding Dimension Trap

The most common ingest bug: data was loaded with embedding model A (dim 1536), but at query time `config_embedding.yaml` points to model B (dim 768). Retrieval silently returns garbage because vector dimensions don't align — or fails entirely if the backend enforces dimension constraints. **Always verify the embedding provider hasn't changed between ingest and query.**

### Write Endpoint Selection

`db_load.py` writes to **one** endpoint at a time — the `write_endpoint` in `config_retrieval.yaml`, or override with `--database <endpoint-name>`. If you need data in multiple backends, run `db_load` multiple times changing the write endpoint each time.

### Delete and Reload

Sites can be wiped:

```bash
python -m data_loading.db_load --only-delete delete-site <site-name>
```

Without `--only-delete`, the loader does **upsert by URL** — re-running on the same source updates existing records but leaves stale ones. For full refresh, delete first, then load.

### Batch Sizing

`--batch-size N` controls how many records are embedded + written per round-trip. Defaults are sane (~100). Increase for large ingests if your embedding provider rate-limit allows.

### Parallel Loading

`data_loading/parallel_db_load.sh` runs multiple loaders concurrently across sources. Use for cold-start across dozens of feeds. Watch rate limits on the embedding provider — Azure OpenAI has aggressive throttling.

## Implementation Guidance

### Loading an RSS Feed

```bash
python -m data_loading.db_load https://example.com/feed.xml my-blog
```

Each item in the feed becomes a Schema.org `Article` with `headline`, `description`, `url`, `datePublished` populated from the RSS fields. Embeddings come from concatenating headline + description (verify exact field selection in `rss2schema.py`).

### Loading Schema.org JSON-LD

For sites that already serve JSON-LD (Recipe, Product, Event, Article, Movie, etc.), point `db_load` at a sitemap or URL list:

```bash
python -m data_loading.db_load --url-list urls.txt my-recipes
```

Each URL is fetched; the embedded JSON-LD is extracted and indexed verbatim. This is the **highest-fidelity ingest path** — the agent gets the full schema_object back at query time.

### Loading CSV

```bash
python -m data_loading.db_load products.csv my-store
```

CSV columns must follow the Schema.org property naming convention (or the column-mapping rules in `db_load_utils.py` — verify). For products, columns like `name`, `description`, `url`, `image`, `offers.price`, `offers.priceCurrency` are common.

### Overriding the Write Endpoint

```bash
python -m data_loading.db_load --database azure_ai_search source.xml my-site
```

Useful for parallel ingest across backends, or for promoting a dev qdrant_local index to prod Azure AI Search.

### Incremental Refresh Pattern

```bash
# Daily — incremental upsert (existing records updated, new added, stale left)
python -m data_loading.db_load https://example.com/feed.xml my-blog

# Weekly — full refresh
python -m data_loading.db_load --only-delete delete-site my-blog
python -m data_loading.db_load https://example.com/feed.xml my-blog
```

### Verifying a Load

After ingest:
- `curl http://localhost:8000/sites` — your site should appear
- `curl 'http://localhost:8000/ask?query=test&site=my-blog&streaming=false&mode=list'` — should return non-empty results
- Inspect a result's `schema_object` field — confirm it has the Schema.org properties you expect

### Backend-Specific Ingest

Some retrieval backends bypass `db_load.py` entirely:

- **Cloudflare AutoRAG** — ingest is managed by Cloudflare; you upload to R2 and AutoRAG indexes for you. See `docs/setup-cloudflare-autorag.md`.
- **Snowflake Cortex Search** — data lives in Snowflake tables; Cortex Search indexes are created via SQL. NLWeb just queries.
- **Shopify MCP** — no ingest; NLWeb proxies to Shopify's MCP endpoint live.
- **Bing Web Search** — no ingest; live web search.

### Common Failures

- **`db_load` hangs on embedding** — your embedding provider is rate-limiting. Reduce `--batch-size` or switch provider.
- **Records load but never appear in `/ask`** — check `sites:` allowlist in `config_nlweb.yaml`; check that `write_endpoint` and the enabled read endpoints actually overlap.
- **Loaded RSS but `schema_object` is sparse** — RSS doesn't carry rich Schema.org metadata. Either accept it or move to JSON-LD ingest.
- **Embedding dim mismatch** — re-ingest with the correct provider, or change `config_embedding.yaml` to match what was ingested.

Always cross-check flags against the live `db_load.py` — argument names drift release to release.

Related Skills

woo-data-stores

from OrcaQubits/agentic-commerce-skills-plugins

Work with WooCommerce CRUD data stores — WC_Product, WC_Order, WC_Customer, WC_Coupon data objects, custom data stores, HPOS migration, and getters/setters. Use when creating or modifying WooCommerce data objects or implementing custom data stores.

spree-data-model

from OrcaQubits/agentic-commerce-skills-plugins

Navigate Spree's canonical data model — the Catalog (Product/Variant/OptionType/Taxon/Property/Metafield), Pricing (Price/PriceList), Order graph (Order/LineItem/Adjustment/Shipment/Payment/PaymentSession/Refund/Reimbursement), Inventory (StockLocation/StockItem/StockMovement), Shipping (ShippingMethod/Zone), Promotions, Identity (User/Role/Address/StoreCredit/GiftCard), Taxes, and the v5.4+ Markets + Store multi-region model. Use when designing a feature that touches Spree models, writing decorators, or building admin/storefront UIs.

nlweb-tools-framework

from OrcaQubits/agentic-commerce-skills-plugins

Design and implement NLWeb tools — the per-Schema.org-type handlers that turn a query into a specialized response (search, item_details, compare_items, ensemble, recipe_substitution, accompaniment, conversation_search, etc.). Covers `tools.xml`, the ToolSelector router, builtin handlers in `methods/`, writing a custom tool with a `<returnStruc>` contract, and disabling tool selection for raw retrieval. Use when extending NLWeb beyond the default query → results flow.

nlweb-setup

from OrcaQubits/agentic-commerce-skills-plugins

Bootstrap a local NLWeb development environment from scratch — clone the repo, configure .env, install Python deps via `nlweb init-python`, run `nlweb init` for interactive LLM/retrieval selection, load sample Schema.org data, and verify with `nlweb check`. Use when starting a new NLWeb deployment from zero.

nlweb-schema-org-grounding

from OrcaQubits/agentic-commerce-skills-plugins

Prepare and structure site content as Schema.org JSON-LD for NLWeb ingestion — covers the supported types (Recipe, Product, Movie, Event, Article, RealEstate, Course, etc.), per-type behavior in NLWeb's tool routing, JSON-LD embedding patterns in HTML, sites.xml registration, and how the `schema_object` flows through ranking back to agent results. Use when authoring or auditing the structured data on a site that will be exposed via NLWeb.

nlweb-retrieval-backends

from OrcaQubits/agentic-commerce-skills-plugins

Choose and configure NLWeb retrieval backends — Qdrant (local + remote), Azure AI Search, Elasticsearch, OpenSearch (with/without k-NN), Postgres pgvector, Milvus, Snowflake Cortex Search, Cloudflare AutoRAG, Shopify MCP, and Bing Web Search. Covers `config_retrieval.yaml`, the single `write_endpoint` rule, parallel read-fanout with URL dedup, and per-backend setup pages. Use when picking a retrieval store, migrating between backends, or debugging "results are empty."

nlweb-prompts-customization

from OrcaQubits/agentic-commerce-skills-plugins

Customize NLWeb's LLM prompts and per-Schema.org-type behavior via `prompts.xml` and `site_types.xml` — covers the `<promptString>` template format, `<returnStruc>` JSON schemas, prompt inheritance, decontextualization/ranking/generate templates, per-site overrides, and pitfalls of editing prompts in place. Use when tuning answer quality, supporting a new domain, or localizing prompts.

nlweb-mcp-server

from OrcaQubits/agentic-commerce-skills-plugins

Expose NLWeb as an MCP (Model Context Protocol) server — JSON-RPC 2.0 endpoint at /mcp, the `ask` / `list_sites` / `who` tools, MCP protocol version 2024-11-05, and integration with ChatGPT, Claude, Gemini, and other agent clients. Use when wiring NLWeb to an AI agent via MCP or building an MCP client that consumes an NLWeb site.

nlweb-llm-providers

from OrcaQubits/agentic-commerce-skills-plugins

Configure NLWeb LLM and embedding providers — OpenAI, Azure OpenAI (default), Anthropic, Google Gemini, DeepSeek on Azure, Llama on Azure, HuggingFace, Inception Labs, Snowflake Cortex, Ollama, Pi Labs. Covers `config_llm.yaml` high/low tier model selection, the ModelRouter cost/quality routing logic, `config_embedding.yaml`, and adding a custom provider. Use when picking models, tuning cost, or wiring a new LLM backend.

nlweb-chatgpt-appsdk

from OrcaQubits/agentic-commerce-skills-plugins

Integrate NLWeb with ChatGPT's Apps SDK — the Node.js MCP server in `openai-apps-sdk-integration/`, the `nlweb-list` tool, the React widget at `ui://widget/nlweb-list.html`, and the port-8100 AppSDK adapter that translates NLWeb's message list to OpenAI Apps SDK envelopes. Use when publishing an NLWeb site as a ChatGPT app or wiring NLWeb results into an Apps SDK widget.

nlweb-auth-multitenancy

from OrcaQubits/agentic-commerce-skills-plugins

Configure NLWeb authentication and multi-tenant deployments — OAuth providers (GitHub, Google, Microsoft, Facebook), session storage, the `sites:` allowlist in `config_nlweb.yaml`, conversation persistence per authenticated user, and per-tenant data isolation. Use when adding login to an NLWeb instance, hosting multiple customers on one deployment, or persisting conversation history.

nlweb-ask-endpoint

from OrcaQubits/agentic-commerce-skills-plugins

Implement and consume the NLWeb /ask REST endpoint — request shape (GET/POST, query-string and v0.55 structured body), SSE streaming response, modes (list/summarize/generate), in-stream "message_type" headers, error envelopes, and client-side parsing. Use when building an NLWeb server route, calling /ask from a custom agent, or debugging /ask responses.