civil-judgment-taiwan-vectorstore

Ingest Taiwan civil court judgments (HTML or PDF) — exclusively covering Taiwan civil cases — into Qdrant with Ollama embeddings, preserving traceability, deduplication, and incremental updates.

3,891 stars

byopenclaw

View on GitHub Installation ↓

Best use case

civil-judgment-taiwan-vectorstore is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Ingest Taiwan civil court judgments (HTML or PDF) — exclusively covering Taiwan civil cases — into Qdrant with Ollama embeddings, preserving traceability, deduplication, and incremental updates.

Teams using civil-judgment-taiwan-vectorstore should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/civil-judgment-taiwan-vectorstore/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/alex02131926/civil-judgment-taiwan-vectorstore/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/civil-judgment-taiwan-vectorstore/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How civil-judgment-taiwan-vectorstore Compares

Feature / Agent	civil-judgment-taiwan-vectorstore	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Ingest Taiwan civil court judgments (HTML or PDF) — exclusively covering Taiwan civil cases — into Qdrant with Ollama embeddings, preserving traceability, deduplication, and incremental updates.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

AI Agents for Startups

Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.

SKILL.md Source

# Taiwan Civil Judgment → Vector DB (Qdrant) Ingestion

**Scope: Taiwan civil court judgments only** (民事判決). This skill ingests Taiwan civil cases (HTML **or PDF** files) into Qdrant. All parsing, chunking, and embedding logic lives in `scripts/ingest.py` — your job is to **run the script**, not to reimplement the pipeline.

---

## Quick Start (follow these steps in order)

### Step 1 — Activate venv

```bash
source {baseDir}/.venv/bin/activate
```

### Step 2 — Identify the run folder

The user will provide an **absolute path** to a run folder.

Example: `/path/to/output/judicialyuan/20260305_142030`

Verify it exists and has HTML or PDF files:
```bash
ls <RUN_FOLDER>/archive/ | grep -E '\.(html|pdf)$' | head -5
```

If no `archive/*.html` or `archive/*.pdf` files → **stop and tell the user** the folder has no ingestible data.

### Step 3 — Run ingestion

Use absolute paths throughout — no `cd` needed:

```bash
python3 {baseDir}/scripts/ingest.py \
  --run-folder <RUN_FOLDER>
```

The script handles everything: pre-flight checks, collection auto-creation (creates `civil_case_doc` / `civil_case_chunk` if they don't exist), canonicalization, chunking, embedding, Qdrant upsert, manifest + report writing.

**Re-running the same command on the same folder is always safe** — deterministic IDs mean upsert = overwrite. No special `--resume` flag needed; just run the same command again.

### Step 4 — Check the result

**Successful output looks like:**
```
OK files=42 processed=42 skipped=0 errored=0 doc_points=42 chunk_points=187
manifest=<RUN_FOLDER>/ingest_manifest.jsonl
report=<RUN_FOLDER>/ingest_report.md
```

**Read the report** (human-readable stats summary):
```bash
cat <RUN_FOLDER>/ingest_report.md
```

If there are errors, check the **manifest** (machine-readable, one JSON line per file) for per-file diagnosis:
```bash
grep -E '"status":"(skipped|error|partial)"' <RUN_FOLDER>/ingest_manifest.jsonl
```

### Step 5 — Report to user

Tell the user:
- How many docs were ingested (`doc_points`)
- How many chunks were created (`chunk_points`)
- Whether any were skipped or errored
- Where the report file is

**Done.** Do not proceed to additional steps unless the user asks.

---

## DO NOT rules (critical)

- **DO NOT** write your own HTML parsing, chunking, or embedding code. `ingest.py` handles all of this.
- **DO NOT** modify parsing/chunking logic casually. Only change heading detection or chunk fallback when the user explicitly asks to improve PDF/OCR robustness, and validate on a small sample before re-running a large batch.
- **DO NOT** call Qdrant or Ollama APIs directly. The script does this.
- **DO NOT** use `verify=False` or skip SSL verification for any HTTP request.
- **DO NOT** modify or delete files under `archive/`. Raw HTML is immutable source of truth.
- **DO NOT** change chunking defaults (`--max-chars`, `--overlap-chars`) unless the user explicitly asks.

---

## Hard constraints

- **Raw HTML/PDF is source of truth**; never overwrite it.
- **Deterministic**: same input → same canonical text → same SHA-256 → same Qdrant point IDs. Safe to re-run.
- **Traceability**: every Qdrant point carries `doc_url` + `local_path`.
- **Batched upserts** (≤ 64 points/batch) to avoid Qdrant 32MB payload limit.
- **`parser_version`** in every point's metadata. Current: `v3.5-sentence-boundary`.

---

## Troubleshooting

### `PREFLIGHT_FAILED: Qdrant not reachable`

Qdrant is down or unreachable at the default/configured URL.

```bash
# Check if Qdrant is running
curl -s http://localhost:6333/collections | head -1

# If not running, start it (or ask the user)
```

### `PREFLIGHT_FAILED: Ollama not reachable`

```bash
# Check Ollama
curl -s http://localhost:11434/api/tags | head -5
```

### `PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest`

```bash
ollama pull bge-m3:latest
```

Then re-run Step 3.

### `PREFLIGHT_FAILED: No archive/*.html or archive/*.pdf found`

The run folder exists but has no archived detail pages. Check:
- Is this the correct run folder?

### Output shows `skipped > 0` or `errored > 0`

Check `ingest_manifest.jsonl` for per-file details:
```bash
grep -E '"status":"(skipped|error|partial)"' "<RUN_FOLDER>/ingest_manifest.jsonl"
```

| Manifest status | Meaning | Action |
|-----------------|---------|--------|
| `ok` | Doc + all chunks ingested | None |
| `partial` | Doc upserted, but some section chunks failed embedding | Check Ollama stability; can re-run safely |
| `skipped` | Doc-level embedding failed — nothing upserted for this doc | Check Ollama; re-run safely |
| `error` | HTML read/parse failed | Check if the HTML file is corrupted |

Re-running is always safe — use the exact same command. No special flags needed; deterministic IDs → upsert/overwrite.

### Override service endpoints

```bash
# Via environment variables
OLLAMA_URL=http://localhost:11434 QDRANT_URL=http://localhost:6333 \
  python3 scripts/ingest.py --run-folder "..."

# Via CLI flags (take precedence over env vars)
python3 scripts/ingest.py --run-folder "..." \
  --ollama http://localhost:11434 --qdrant http://localhost:6333
```

Default endpoints:

| Service | Default | Env override |
|---------|---------|--------------|
| Ollama | `http://localhost:11434` | `$OLLAMA_URL` |
| Qdrant | `http://localhost:6333` | `$QDRANT_URL` |

### Test with a small batch first

```bash
python3 scripts/ingest.py --run-folder "..." --limit 5
```

---

## Input folder structure (expected)

```
<run_folder>/
  archive/
    fjud_detail_001.html               ← HTML input
    fjud_detail_002.html
    fjud_detail_003.pdf                ← PDF input (also supported)
    fint_detail_001.html               (if system=both)
  results_fjud.jsonl                   (optional)
  results_fint.jsonl                   (optional)
```

The script discovers all `archive/*.html` and `archive/*.pdf` files automatically (sorted by filename). HTML and PDF files can coexist in the same run folder.

**v1 limitation**: The `system` metadata field is currently hardcoded to `FJUD`. If a run folder contains both FJUD and FINT files, FINT files will be ingested but mislabeled as `FJUD`. This does not affect chunking or embeddings — only the `system` metadata field on the resulting Qdrant points.

---

## CLI reference

```
python3 scripts/ingest.py --run-folder <PATH> [options]
```

| Flag | Default | Description |
|------|---------|-------------|
| `--run-folder` | (required) | Path to an input folder |
| `--ollama` | `$OLLAMA_URL` or `http://localhost:11434` | Ollama endpoint |
| `--qdrant` | `$QDRANT_URL` or `http://localhost:6333` | Qdrant endpoint |
| `--embed-model` | `bge-m3:latest` | Ollama embedding model |
| `--vector-size` | `1024` | Vector dimension |
| `--max-chars` | `900` | Max chars per chunk (500–1000) |
| `--overlap-chars` | `150` | Overlap between chunks (10–20% of max-chars) |
| `--limit` | `0` (no limit) | Process only first N files sorted by filename (lexicographic order); for testing |

---

## Outputs

- **Qdrant collections**: `civil_case_doc` (1 point/doc), `civil_case_chunk` (many points/doc). Auto-created if they don't exist.
- **`ingest_report.md`**: human-readable summary (doc/chunk counts, error counts). **Read this first** after ingestion.
- **`ingest_manifest.jsonl`**: machine-readable, one JSON line per doc with status (`ok` / `partial` / `skipped` / `error`). **Read this to diagnose specific file failures** (grep for non-`ok` statuses). Both files overlap on aggregate counts; the manifest adds per-file detail.

---

## Roadmap
- **v1** (current): doc + section-aware chunks
- **v2**: candidate issue extraction (爭點抽取)
- **v3**: issue-level index (`civil_case_issue` collection)

---

## Internal details

For metadata schema, canonicalization rules, section-splitting patterns, and chunking implementation, see [`references/internals.md`](references/internals.md).

---

## Lessons learned / operational gotchas
- Qdrant rejects non-UUID/non-integer point IDs (`400 Bad Request`). The script uses deterministic UUIDs — do not change the ID generation logic.
- Qdrant rejects payloads > 32MB. The script batches at 64 points — do not increase batch size.
- Re-running on the same folder is safe: deterministic IDs mean upsert = overwrite.
- 台灣判決書 section headings 格式不統一（e.g.「理　由」with fullwidth space、兼容字如「⽂」）。目前 parser 已先做 heading normalization；若仍切不出 section，會 fallback 對 `full` 做 chunking，避免只留下 doc-level points。

Related Skills

validator-correlated-judgment

3891

from openclaw/skills

Helps identify when multiple attestation validators share training data, model architecture, or organizational upstream — causing correlated blind spots that make multi-validator attestation no stronger than single-validator. v1.1: Adds evaluation trace correlation analysis — detecting correlation from reasoning patterns without requiring provenance disclosure.

taiwan-md-knowledge-base

3831

from openclaw/skills

AI-native open knowledge base about Taiwan built with Astro v5, featuring bilingual content (zh-TW/en), D3.js knowledge graph, and structured Markdown SSOT architecture.

---

3891

from openclaw/skills

name: article-factory-wechat

Content & Documentation

humanizer

3891

from openclaw/skills

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Content & Documentation

find-skills

3891

from openclaw/skills

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

General Utilities

tavily-search

3891

from openclaw/skills

Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.

Data & Research

baidu-search

3891

from openclaw/skills

Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.

Data & Research

agent-autonomy-kit

3891

from openclaw/skills

Stop waiting for prompts. Keep working.

Workflow & Productivity

Meeting Prep

3891

from openclaw/skills

Never walk into a meeting unprepared again. Your agent researches all attendees before calendar events—pulling LinkedIn profiles, recent company news, mutual connections, and conversation starters. Generates a briefing doc with talking points, icebreakers, and context so you show up informed and confident. Triggered automatically before meetings or on-demand. Configure research depth, advance timing, and output format. Walking into meetings blind is amateur hour—missed connections, generic small talk, zero leverage. Use when setting up meeting intelligence, researching specific attendees, generating pre-meeting briefs, or automating your prep workflow.

Workflow & Productivity

self-improvement

3891

from openclaw/skills

Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Claude ('No, that's wrong...', 'Actually...'), (3) User requests a capability that doesn't exist, (4) An external API or tool fails, (5) Claude realizes its knowledge is outdated or incorrect, (6) A better approach is discovered for a recurring task. Also review learnings before major tasks.

Agent Intelligence & Learning

botlearn-healthcheck

3891

from openclaw/skills

botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.

DevOps & Infrastructure

linkedin-cli

3891

from openclaw/skills

A bird-like LinkedIn CLI for searching profiles, checking messages, and summarizing your feed using session cookies.

Content & Documentation

civil-judgment-taiwan-vectorstore

Best use case

When to use this skill

When not to use this skill

Installation

How civil-judgment-taiwan-vectorstore Compares

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

Related Guides

AI Agents for Coding

AI Agents for Marketing

AI Agents for Startups

SKILL.md Source

Related Skills

validator-correlated-judgment

taiwan-md-knowledge-base

﻿---

humanizer

find-skills

tavily-search

baidu-search

agent-autonomy-kit

Meeting Prep

self-improvement

botlearn-healthcheck

linkedin-cli

---