deduplication
Event deduplication with canonical selection, reputation scoring, and hash-based grouping for multi-source data aggregation. Handles both ID-based and content-based deduplication.
Best use case
deduplication is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Event deduplication with canonical selection, reputation scoring, and hash-based grouping for multi-source data aggregation. Handles both ID-based and content-based deduplication.
Teams using deduplication should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/deduplication/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How deduplication Compares
| Feature / Agent | deduplication | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Event deduplication with canonical selection, reputation scoring, and hash-based grouping for multi-source data aggregation. Handles both ID-based and content-based deduplication.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Event Deduplication
Canonical selection with reputation scoring and hash-based grouping for multi-source data.
## When to Use This Skill
- Aggregating data from multiple sources (news, events, products)
- Same content appears from different outlets/sources
- Need to pick the "best" version from duplicates
- Tracking deduplication metrics for optimization
## Core Concepts
Simple URL deduplication isn't enough. Production needs:
- Grouping by semantic similarity (same story, different outlets)
- Canonical selection (pick the "best" version)
- Reputation scoring (prefer authoritative sources)
- Both ID-based and content-based deduplication
Two modes:
1. ID-based: When sources have unique IDs, keep the "best" version when IDs collide
2. Content-based: Group by semantic similarity, select canonical from each group
## Implementation
### TypeScript
```typescript
import { createHash } from 'crypto';
interface DeduplicationResult<T> {
items: T[];
originalCount: number;
dedupedCount: number;
reductionPercent: number;
duplicateGroups?: number;
}
// ============================================
// ID-Based Deduplication
// ============================================
function deduplicateById<T extends { id: string }>(
items: T[],
preferFn: (existing: T, candidate: T) => T
): DeduplicationResult<T> {
const seen = new Map<string, T>();
for (const item of items) {
const existing = seen.get(item.id);
if (existing) {
seen.set(item.id, preferFn(existing, item));
} else {
seen.set(item.id, item);
}
}
const dedupedItems = Array.from(seen.values());
const reductionPercent = items.length > 0
? Math.round((1 - dedupedItems.length / items.length) * 100)
: 0;
return {
items: dedupedItems,
originalCount: items.length,
dedupedCount: dedupedItems.length,
reductionPercent,
};
}
// ============================================
// Content-Based Deduplication
// ============================================
interface Article {
title: string;
url: string;
domain: string;
publishedAt: string;
tone?: number;
}
/**
* Generate deduplication key from content
* Groups by: normalized title + source country + date
*/
function generateDedupKey(article: Article): string {
const normalizedTitle = article.title
.toLowerCase()
.replace(/[^\w\s]/g, '')
.trim()
.slice(0, 50);
const dateStr = article.publishedAt?.slice(0, 10).replace(/-/g, '') || 'unknown';
return `${normalizedTitle}|${dateStr}`;
}
/**
* Generate unique ID from URL
*/
function generateEventId(url: string): string {
return createHash('md5').update(url).digest('hex').slice(0, 12);
}
/**
* Source reputation scoring
*/
function getReputationScore(domain: string): number {
// Tier 1: Wire services and major international
const tier1 = ['reuters.com', 'apnews.com', 'bbc.com', 'bbc.co.uk',
'aljazeera.com', 'france24.com', 'dw.com'];
if (tier1.some(r => domain.includes(r))) return 100;
// Tier 2: Major newspapers
const tier2 = ['nytimes.com', 'washingtonpost.com', 'theguardian.com',
'ft.com', 'economist.com', 'wsj.com'];
if (tier2.some(r => domain.includes(r))) return 75;
// Tier 3: Regional/national
const tier3 = ['cnn.com', 'foxnews.com', 'nbcnews.com', 'abcnews.go.com'];
if (tier3.some(r => domain.includes(r))) return 50;
return 10;
}
/**
* Select canonical article from duplicate group
*/
function selectCanonical<T extends Article>(
group: { item: T; source: string }[]
): { item: T; source: string } {
return group.reduce((best, current) => {
const bestScore = getReputationScore(best.item.domain) +
Math.abs(best.item.tone || 0);
const currentScore = getReputationScore(current.item.domain) +
Math.abs(current.item.tone || 0);
return currentScore > bestScore ? current : best;
});
}
/**
* Deduplicate articles from multiple sources
*/
function deduplicateArticles<T extends Article>(
sourceResults: { sourceName: string; articles: T[] }[]
): DeduplicationResult<T & { source: string }> {
const groups = new Map<string, { item: T; source: string }[]>();
let totalArticles = 0;
// Group articles by dedup key
for (const { sourceName, articles } of sourceResults) {
for (const article of articles) {
totalArticles++;
const key = generateDedupKey(article);
if (!groups.has(key)) {
groups.set(key, []);
}
groups.get(key)!.push({ item: article, source: sourceName });
}
}
// Select canonical article from each group
const items: (T & { source: string })[] = [];
for (const group of groups.values()) {
const canonical = selectCanonical(group);
items.push({ ...canonical.item, source: canonical.source });
}
const reductionPercent = totalArticles > 0
? Math.round((1 - items.length / totalArticles) * 100)
: 0;
console.log(`[Dedup] ${totalArticles} → ${items.length} (${reductionPercent}% reduction)`);
return {
items,
originalCount: totalArticles,
dedupedCount: items.length,
reductionPercent,
duplicateGroups: groups.size,
};
}
```
## Usage Examples
### ID-Based Deduplication
```typescript
const events = await fetchEvents();
const result = deduplicateById(events, (existing, candidate) => {
// Prefer events with coordinates
if (!existing.lat && candidate.lat) return candidate;
// Prefer higher sentiment magnitude
if (Math.abs(candidate.sentiment) > Math.abs(existing.sentiment)) {
return candidate;
}
return existing;
});
console.log(`Reduced ${result.reductionPercent}% duplicates`);
```
### Multi-Source Aggregation
```typescript
const results = await Promise.all([
fetchFromSourceA(),
fetchFromSourceB(),
fetchFromSourceC(),
]);
const { items, reductionPercent } = deduplicateArticles([
{ sourceName: 'source-a', articles: results[0] },
{ sourceName: 'source-b', articles: results[1] },
{ sourceName: 'source-c', articles: results[2] },
]);
// items now contains canonical articles with source attribution
```
## Best Practices
1. Semantic grouping - Group by normalized content, not just URL
2. Reputation scoring - Prefer authoritative sources as canonical
3. Best version selection - When IDs collide, keep version with most data
4. Reduction tracking - Log how much deduplication helped
5. Source attribution - Track which source the canonical came from
## Common Mistakes
- Simple URL deduplication (misses same story from different outlets)
- Random selection from duplicates (lose quality signal)
- No normalization (case/punctuation differences create false negatives)
- Not tracking reduction metrics (can't optimize)
- Hardcoded source lists (make configurable)
## Related Patterns
- batch-processing - Process deduplicated items efficiently
- validation-quarantine - Validate before deduplication
- checkpoint-resume - Track which files have been deduplicatedRelated Skills
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.
linear
Managing Linear issues, projects, and teams. Use when working with Linear tasks, creating issues, updating status, querying projects, or managing team workflows.
lilhomie
Control HomeKit devices via REST API. Use when controlling lights, switches, scenes, or checking device status in the user's home.
lightfriend-add-frontend-page
Step-by-step guide for adding new pages to the Yew frontend
library-writer
This skill should be used when writing software libraries, packages, or modules following battle-tested patterns for clean, minimal, production-ready code. It applies when creating new libraries, refactoring existing ones, designing library APIs, or when clean, dependency-minimal library code is needed. Triggers on requests like "create a library", "write a package", "design a module API", or mentions of professional library development.
Library Management
User library, favorites, and reading progress
library-doc
Index and search library documentation locally for offline use. Invoke when user asks to index docs, search library topics, or list indexed libraries.
libraries-dependencies-mastery
Complete mastery of essential modern web development libraries and dependencies. Cover Next.js, React, TypeScript, Tailwind CSS, Firebase, Zustand, redux-toolkit, react-hook-form, Zod, shadcn/ui, lucide-react, Stripe, and more. Learn setup, integration patterns, advanced usage, performance optimization, troubleshooting, common pitfalls, and version management. Includes quick reference guides, in-depth tutorials, complete examples for e-commerce and SaaS, configuration files, type definitions, error handling, and production patterns. Master how libraries work together and solve real-world challenges.
librarian
Expert in searching official documentation, APIs, and best practices. Use when you need accurate information from authoritative sources.
librarian-indexer
Meta-skill that indexes, optimizes, and auto-generates Claude skills with GitOps automation, OCA GitHub bot integration, and Odoo developer tools. Use for skill creation, CI/CD workflows, OCA module management, and advanced Odoo development.
libpdf-helper
Work with @libpdf/core - modern TypeScript PDF library for parsing, modifying, and generating PDFs. Use when (1) starting new @libpdf/core project, (2) migrating from pdf-lib/pdf.js/pdfkit, (3) understanding @libpdf/core API, (4) solving PDF tasks (forms, signatures, encryption, merging, text extraction), or (5) choosing between PDF libraries.
lexiang
腾讯乐享知识库 API 接口文档。包含通讯录管理、团队管理、知识库管理、知识节点管理、任务管理、自定义属性管理、操作日志、AI助手、单点登录、素材管理、导出任务管理等接口。