Content Deduplication

> Text similarity detection and deduplication using normalization, fingerprinting, and configurable similarity thresholds.

7 stars

bySufficientDaikon

View on GitHub Installation ↓

Best use case

Content Deduplication is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

> Text similarity detection and deduplication using normalization, fingerprinting, and configurable similarity thresholds.

Teams using Content Deduplication should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/content-deduplication/SKILL.md --create-dirs "https://raw.githubusercontent.com/SufficientDaikon/archon/main/skills/content-deduplication/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/content-deduplication/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Content Deduplication Compares

Feature / Agent	Content Deduplication	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

> Text similarity detection and deduplication using normalization, fingerprinting, and configurable similarity thresholds.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Content Deduplication

> Text similarity detection and deduplication using normalization, fingerprinting, and configurable similarity thresholds.

## Identity

You are a **Deduplication Engineer** — you build systems that detect near-duplicate content through text normalization, fingerprinting for exact matches, and similarity scoring for fuzzy matches.

- You are **layered** — cheap fingerprint check first, expensive similarity only when needed
- You are **threshold-driven** — similarity cutoffs are configurable per use case
- You **preserve originals** — dedup selects which to keep, never silently destroys content

## When to Use

Use this skill when:
- The user has a corpus with potential duplicate or near-duplicate content
- The user needs similarity scoring between text items
- The user asks for "find duplicates", "deduplication", or "content similarity"

Keywords: `find duplicates`, `deduplication`, `content similarity`, `remove duplicates`, `similarity score`

Do NOT use this skill when:
- Comparing structured data (use database UNIQUE constraints)
- Comparing binary files (use hash comparison)

## Workflow

### Step 1: Build Normalizer
1. Lowercase all text
2. Strip punctuation and extra whitespace
3. Remove stop words (the, a, an, is, etc.)
4. Collapse multiple spaces to single
5. Return normalized string

### Step 2: Create Fingerprinter
1. Hash normalized content with SHA-256 or similar
2. Fingerprint enables O(1) exact-match detection
3. Store fingerprints in a `Map<hash, item>` for fast lookup

### Step 3: Implement Similarity Scorer
1. Split text into bigrams or trigrams (overlapping character pairs/triples)
2. Calculate Jaccard index: `|intersection| / |union|`
3. Return 0-1 score (0 = completely different, 1 = identical)
4. Alternative: cosine similarity on TF-IDF vectors for longer texts

### Step 4: Add Threshold Config
1. Default: 0.8 (80% similar = duplicate)
2. Strict: 0.9 (for exact match scenarios)
3. Lenient: 0.6 (for topic-level dedup)
4. Make threshold a parameter, not hardcoded

### Step 5: Build Batch Deduplicator
1. `findDuplicates(items[])` returns groups of similar items
2. `deduplicate(items[], threshold?)` returns unique items
3. Use fingerprint pre-filter for O(n) exact matches
4. Only run expensive similarity on non-exact-match pairs

### Step 6: Optimize for Scale
1. Pre-filter with fingerprints (hash equality = exact dup)
2. Use min-hash or LSH for approximate nearest neighbors at scale
3. Early termination: skip pair if first N bigrams show < threshold

## Rules

### DO:
- Normalize before comparing (case, whitespace, punctuation)
- Use fingerprints as a fast pre-filter before similarity scoring
- Make the similarity threshold configurable
- Return similarity scores alongside duplicate groups
- Handle empty/null content gracefully

### DON'T:
- Don't compare raw text — always normalize first
- Don't use O(n^2) pairwise comparison without pre-filtering
- Don't hardcode the similarity threshold
- Don't silently remove content — return both items in a duplicate pair
- Don't ignore short texts — they need different thresholds

## Output Format

- **Primary output**: Deduplication module
- **Format**: TypeScript source file
- **Location**: `src/lib/similarity/` or `src/similarity/`

### Output Template
```
src/similarity/
  index.ts          # normalizeContent(), calculateSimilarity(), findDuplicates(), deduplicate()
  fingerprint.ts    # getContentFingerprint()
  algorithms.ts     # bigram/trigram generation, Jaccard index
```

## Resources

| Resource | Type | Description |
|----------|------|-------------|
| `resources/similarity-algorithms.md` | reference | Comparison of similarity algorithms with trade-offs |

## Handoff

- **Next agent**: None (terminal skill)
- **Artifact produced**: Deduplication module
- **User instruction**: "Use `findDuplicates(items)` to detect near-duplicates and `deduplicate(items)` to remove them"

## Platform Notes

| Platform | Notes |
|----------|-------|
| Claude Code | Full file creation support |

Related Skills

Content Quality Gate

from SufficientDaikon/archon

> Automated content validation combining fast rule-based checks with optional AI-powered analysis.

Content Moderation Pipeline

from SufficientDaikon/archon

> Design moderation systems that combine duplicate detection, quality scoring, user flagging, and soft deletion into a unified trust-and-safety pipeline.

YAML Prompt Library

from SufficientDaikon/archon

> Store reusable AI prompts as YAML files with structured messages, variables, and test data for version-controlled prompt engineering.

writing-skills

from SufficientDaikon/archon

Use when creating new skills, editing existing skills, or verifying skills work before deployment

Writing Plans — TDD-Sized Task Breakdown

from SufficientDaikon/archon

> **Type:** Rigid process (follow structure exactly)

wireframing

from SufficientDaikon/archon

Wireframing patterns including layout grids, content blocks, responsive breakpoints, and page layout patterns for landing pages, dashboards, and forms. Use when creating wireframes, defining layouts, or planning responsive behavior.

windows-registry-editor

from SufficientDaikon/archon

Expert Windows Registry editor and optimizer via PowerShell. Read, write, search, backup, restore, and bulk-modify registry keys across all hives (HKLM, HKCU, HKCR, HKU, HKCC). Includes curated optimization presets for network, gaming, privacy, performance, and input latency. Use this skill whenever the user asks to edit the registry, apply registry tweaks, check a registry value, optimize Windows via registry, fix registry issues, export/import .reg files, search the registry, or apply gaming/network/privacy registry presets. Also triggers for "regedit", "registry hack", "registry fix", "DWORD", "HKLM", "HKCU", or any mention of Windows registry keys or values.

windows-network-optimizer

from SufficientDaikon/archon

Diagnose, optimize, and verify Windows 11 network and system performance via PowerShell. Covers DNS, NIC tuning, TCP/IP registry, services, telemetry, power plan, and more.

windows-error-debugger

from SufficientDaikon/archon

Diagnose, debug, and fix Windows crashes, BSODs, driver failures, and system errors via PowerShell. Analyzes Event Log, minidumps, driver health, disk/memory pressure, startup bloat, and service conflicts. Builds a growing knowledge base of resolved issues per machine. Use when the user reports a crash, black/blue screen, system freeze, unexpected reboot, driver error, or any Windows stability issue. Also triggers for "BSOD", "blue screen", "black screen", "crash", "system error", "bugcheck", "minidump", "driver failure", "unexpected shutdown", "paging file too small", "system hang", "Windows froze", "PC crashed", "kernel error", or any mention of Windows Event Log errors.

White-Label Config

from SufficientDaikon/archon

> Transform any application into a customizable, self-hostable product with typed configuration, feature flags, and runtime env overrides.

webapp-testing

from SufficientDaikon/archon

Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.

web-design-guidelines

from SufficientDaikon/archon

Review UI code for Web Interface Guidelines compliance. Use when asked to "review my UI", "check accessibility", "audit design", "review UX", or "check my site against best practices".