privacy-parser-pii-extraction
Extract structured PII spans from text using the OpenAI Privacy Filter 1.5B model reversed — returns what, where, and which type instead of masking.
Best use case
privacy-parser-pii-extraction is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Extract structured PII spans from text using the OpenAI Privacy Filter 1.5B model reversed — returns what, where, and which type instead of masking.
Teams using privacy-parser-pii-extraction should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/privacy-parser-pii-extraction/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How privacy-parser-pii-extraction Compares
| Feature / Agent | privacy-parser-pii-extraction | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Extract structured PII spans from text using the OpenAI Privacy Filter 1.5B model reversed — returns what, where, and which type instead of masking.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Privacy Parser — PII Span Extraction
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
**privacy-parser** is the inverse of OpenAI's Privacy Filter. Where the filter masks PII with `<REDACTED>`, this library returns structured spans — label, text, and character offsets — using the same 1.5B `opf` model weights and label taxonomy.
## Installation
```bash
# Clone the repo (includes both subpackages)
git clone https://github.com/chiefautism/privacy-parser
cd privacy-parser
uv venv
uv pip install -e ./privacy-filter # installs the opf model + weights loader
uv pip install -e ./pii_parser # installs the parser library
```
First run downloads the `opf` 1.5B checkpoint (~3 GB) to `~/.opf/privacy_filter/`.
## Quick Start
```python
from pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu") # or "cuda" / "mps"
result = parser.parse(
"Hi Quindle Testwick (quindle.testwick@openai.com / +1-415-555-0102), "
"account 40702810500001234567, 14 Beautiful Ct, Anytown USA, "
"password Priv4cy-Filt3r-2026."
)
for span in result.spans:
print(f"{span.label:18} {span.text}")
```
Output:
```
private_person Quindle Testwick
private_email quindle.testwick@openai.com
private_phone +1-415-555-0102
account_number 40702810500001234567
private_address 14 Beautiful Ct, Anytown USA
secret Priv4cy-Filt3r-2026
```
## Three Backends
Choose the backend based on your speed/accuracy tradeoff:
| Backend | Weights | Speed | F1 | When to use |
|-------------------|---------|------------|-------|------------------------------------|
| `PIIParser` | none | µs | 1.000 | Tests, known-format structured data |
| `ModelPIIParser` | 1.5B | ~500ms CPU | 0.733 | Model-only, no post-processing |
| `HybridPIIParser` | 1.5B | ~600ms CPU | 0.929 | **Production — ship this one** |
```python
# Regex-only (no model, instant, high precision on structured formats)
from pii_parser import PIIParser
parser = PIIParser()
# Model-only (raw BIOES logits → Viterbi → spans)
from pii_parser.model import ModelPIIParser
parser = ModelPIIParser(device="cpu")
# Hybrid: model + span-merge + regex backstop (recommended)
from pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu")
```
## Span Object
Each `span` in `result.spans` has:
```python
span.label # str — one of the 8 label types
span.text # str — the extracted substring
span.start # int — char offset in original string
span.end # int — char offset (exclusive)
```
## Label Taxonomy (opf v2)
```
private_person — full names of individuals
private_email — email addresses
private_phone — phone numbers (any format)
private_address — street/postal addresses
private_url — personal/private URLs
private_date — dates tied to individuals
account_number — bank/card/account identifiers
secret — passwords, tokens, API keys
```
## Common Patterns
### Batch processing
```python
from pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu")
texts = [
"Email Bob at bob@example.com",
"SSN: 123-45-6789, DOB: 1990-03-15",
"Token: ghp_abc123XYZ",
]
for text in texts:
result = parser.parse(text)
if result.spans:
print(f"Text: {text!r}")
for s in result.spans:
print(f" [{s.start}:{s.end}] {s.label} → {s.text!r}")
print()
```
### Filter by label type
```python
result = parser.parse(long_document)
emails = [s for s in result.spans if s.label == "private_email"]
phones = [s for s in result.spans if s.label == "private_phone"]
secrets = [s for s in result.spans if s.label == "secret"]
accounts = [s for s in result.spans if s.label == "account_number"]
```
### Redact after inspection
```python
def redact(text: str, spans) -> str:
"""Replace extracted PII with [LABEL] tokens."""
result = list(text)
for span in sorted(spans, key=lambda s: s.start, reverse=True):
result[span.start:span.end] = f"[{span.label.upper()}]"
return "".join(result)
result = parser.parse("Call Alice at 555-0100 re: account 9988776655.")
clean = redact("Call Alice at 555-0100 re: account 9988776655.", result.spans)
# "Call [PRIVATE_PERSON] at [PRIVATE_PHONE] re: account [ACCOUNT_NUMBER]."
```
### Export to JSON
```python
import json
result = parser.parse("Jane Doe, jane@corp.io, +44 20 7946 0958")
payload = [
{"label": s.label, "text": s.text, "start": s.start, "end": s.end}
for s in result.spans
]
print(json.dumps(payload, indent=2))
```
### GPU acceleration
```python
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
parser = HybridPIIParser(device=device)
```
## CLI
```bash
# Parse a string directly
python -m pii_parser.cli_model "Alice paid 40702810500001234567 on 2026-05-17."
# Pipe text from a file
cat dump.txt | python -m pii_parser.cli_model -
```
## Architecture
```
text
↓
opf 1.5B → BIOES logits → Viterbi (tuned transitions) → char spans
↓
span-merge (glues multi-token names: "Quindle" + "Testwick" → one span)
↓
regex backstop (URL, secret, account_number — fills model gaps)
↓
result.spans[]
```
- **BIOES tagging**: Beginning / Inside / Outside / End / Single — standard NER scheme
- **Viterbi**: enforces valid tag transitions (no I- without B-)
- **Span-merge**: heuristic that joins adjacent same-label spans separated only by whitespace
- **Regex backstop**: high-precision patterns for labels the 1.5B model under-predicts (secrets, account numbers, URLs)
## Running Tests / Benchmarks
```bash
# Full fixture suite + latency benchmark
python pii_parser/tests/test_hybrid.py
```
Expected output:
```
Fixture F1: 0.929
Scenarios: 8/8 passed
Latency: ~600 ms CPU
```
## Troubleshooting
**Slow first run** — The checkpoint (~3 GB) downloads to `~/.opf/privacy_filter/` on first use. Subsequent runs load from cache.
**CUDA out of memory** — Use `device="cpu"` or reduce batch size; the 1.5B model requires ~3 GB VRAM on GPU.
**Low recall on secrets/URLs** — Use `HybridPIIParser` (not `ModelPIIParser`); the regex backstop specifically covers these labels.
**Span text doesn't match offsets** — Offsets are byte-safe character indices into the original string passed to `parse()`. Do not preprocess/strip the string before parsing if you need offsets to remain valid.
**Import error on `privacy_filter`** — Ensure you installed both packages: `uv pip install -e ./privacy-filter` AND `uv pip install -e ./pii_parser`.
**Model not found** — Delete `~/.opf/privacy_filter/` and re-run to trigger a fresh download.Related Skills
openai-privacy-filter
OpenAI Privacy Filter — bidirectional token-classification model for PII detection and masking in text
kordoc-korean-document-parser
Parse HWP, HWPX, and PDF Korean documents to Markdown using kordoc — supports CLI, programmatic API, and MCP server integration.
```markdown
---
zeroboot-vm-sandbox
Sub-millisecond VM sandboxes for AI agents using copy-on-write KVM forking via Zeroboot
yourvpndead-vpn-detection
Android app that detects VPN/proxy servers (VLESS/xray/sing-box) via local SOCKS5 vulnerability, exposing exit IPs and server configs without root
xata-postgres-platform
Expert skill for Xata open-source cloud-native Postgres platform with copy-on-write branching, scale-to-zero, and Kubernetes deployment
x-mentor-skill-nuwa
AI-powered X (Twitter) content strategy skill that distills methodologies from 6 top creators + open-source algorithm data into actionable writing, growth, and monetization guidance.
wx-favorites-report
End-to-end pipeline to extract, decrypt, and visualize WeChat Mac favorites from encrypted SQLite DB into an interactive HTML report.
wterm-web-terminal
Web terminal emulator with Zig/WASM core, DOM rendering, and React/vanilla JS bindings
worldmonitor-intelligence-dashboard
Real-time global intelligence dashboard with AI-powered news aggregation, geopolitical monitoring, and infrastructure tracking
witr-process-inspector
CLI and TUI tool that explains why processes, services, and ports are running by tracing causality chains across supervisors, containers, and shells.
wildworld-dataset
WildWorld large-scale action-conditioned world modeling dataset with 108M+ frames from a photorealistic ARPG game, featuring per-frame annotations, 450+ actions, and explicit state information for generative world modeling research.