deduplication

Event deduplication with canonical selection, reputation scoring, and hash-based grouping for multi-source data aggregation. Handles both ID-based and content-based deduplication.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

deduplication is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Event deduplication with canonical selection, reputation scoring, and hash-based grouping for multi-source data aggregation. Handles both ID-based and content-based deduplication.

Teams using deduplication should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/deduplication/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/deduplication/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/deduplication/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How deduplication Compares

Feature / Agent	deduplication	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Event deduplication with canonical selection, reputation scoring, and hash-based grouping for multi-source data aggregation. Handles both ID-based and content-based deduplication.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Event Deduplication

Canonical selection with reputation scoring and hash-based grouping for multi-source data.

## When to Use This Skill

- Aggregating data from multiple sources (news, events, products)
- Same content appears from different outlets/sources
- Need to pick the "best" version from duplicates
- Tracking deduplication metrics for optimization

## Core Concepts

Simple URL deduplication isn't enough. Production needs:
- Grouping by semantic similarity (same story, different outlets)
- Canonical selection (pick the "best" version)
- Reputation scoring (prefer authoritative sources)
- Both ID-based and content-based deduplication

Two modes:
1. ID-based: When sources have unique IDs, keep the "best" version when IDs collide
2. Content-based: Group by semantic similarity, select canonical from each group

## Implementation

### TypeScript

```typescript
import { createHash } from 'crypto';

interface DeduplicationResult<T> {
  items: T[];
  originalCount: number;
  dedupedCount: number;
  reductionPercent: number;
  duplicateGroups?: number;
}

// ============================================
// ID-Based Deduplication
// ============================================

function deduplicateById<T extends { id: string }>(
  items: T[],
  preferFn: (existing: T, candidate: T) => T
): DeduplicationResult<T> {
  const seen = new Map<string, T>();
  
  for (const item of items) {
    const existing = seen.get(item.id);
    if (existing) {
      seen.set(item.id, preferFn(existing, item));
    } else {
      seen.set(item.id, item);
    }
  }
  
  const dedupedItems = Array.from(seen.values());
  const reductionPercent = items.length > 0
    ? Math.round((1 - dedupedItems.length / items.length) * 100)
    : 0;
  
  return {
    items: dedupedItems,
    originalCount: items.length,
    dedupedCount: dedupedItems.length,
    reductionPercent,
  };
}

// ============================================
// Content-Based Deduplication
// ============================================

interface Article {
  title: string;
  url: string;
  domain: string;
  publishedAt: string;
  tone?: number;
}

/**
 * Generate deduplication key from content
 * Groups by: normalized title + source country + date
 */
function generateDedupKey(article: Article): string {
  const normalizedTitle = article.title
    .toLowerCase()
    .replace(/[^\w\s]/g, '')
    .trim()
    .slice(0, 50);

  const dateStr = article.publishedAt?.slice(0, 10).replace(/-/g, '') || 'unknown';

  return `${normalizedTitle}|${dateStr}`;
}

/**
 * Generate unique ID from URL
 */
function generateEventId(url: string): string {
  return createHash('md5').update(url).digest('hex').slice(0, 12);
}

/**
 * Source reputation scoring
 */
function getReputationScore(domain: string): number {
  // Tier 1: Wire services and major international
  const tier1 = ['reuters.com', 'apnews.com', 'bbc.com', 'bbc.co.uk', 
                 'aljazeera.com', 'france24.com', 'dw.com'];
  if (tier1.some(r => domain.includes(r))) return 100;
  
  // Tier 2: Major newspapers
  const tier2 = ['nytimes.com', 'washingtonpost.com', 'theguardian.com', 
                 'ft.com', 'economist.com', 'wsj.com'];
  if (tier2.some(r => domain.includes(r))) return 75;
  
  // Tier 3: Regional/national
  const tier3 = ['cnn.com', 'foxnews.com', 'nbcnews.com', 'abcnews.go.com'];
  if (tier3.some(r => domain.includes(r))) return 50;
  
  return 10;
}

/**
 * Select canonical article from duplicate group
 */
function selectCanonical<T extends Article>(
  group: { item: T; source: string }[]
): { item: T; source: string } {
  return group.reduce((best, current) => {
    const bestScore = getReputationScore(best.item.domain) + 
                      Math.abs(best.item.tone || 0);
    const currentScore = getReputationScore(current.item.domain) + 
                         Math.abs(current.item.tone || 0);
    
    return currentScore > bestScore ? current : best;
  });
}

/**
 * Deduplicate articles from multiple sources
 */
function deduplicateArticles<T extends Article>(
  sourceResults: { sourceName: string; articles: T[] }[]
): DeduplicationResult<T & { source: string }> {
  const groups = new Map<string, { item: T; source: string }[]>();
  let totalArticles = 0;

  // Group articles by dedup key
  for (const { sourceName, articles } of sourceResults) {
    for (const article of articles) {
      totalArticles++;
      const key = generateDedupKey(article);
      
      if (!groups.has(key)) {
        groups.set(key, []);
      }
      groups.get(key)!.push({ item: article, source: sourceName });
    }
  }

  // Select canonical article from each group
  const items: (T & { source: string })[] = [];
  
  for (const group of groups.values()) {
    const canonical = selectCanonical(group);
    items.push({ ...canonical.item, source: canonical.source });
  }

  const reductionPercent = totalArticles > 0 
    ? Math.round((1 - items.length / totalArticles) * 100)
    : 0;

  console.log(`[Dedup] ${totalArticles} → ${items.length} (${reductionPercent}% reduction)`);

  return {
    items,
    originalCount: totalArticles,
    dedupedCount: items.length,
    reductionPercent,
    duplicateGroups: groups.size,
  };
}
```

## Usage Examples

### ID-Based Deduplication

```typescript
const events = await fetchEvents();

const result = deduplicateById(events, (existing, candidate) => {
  // Prefer events with coordinates
  if (!existing.lat && candidate.lat) return candidate;
  // Prefer higher sentiment magnitude
  if (Math.abs(candidate.sentiment) > Math.abs(existing.sentiment)) {
    return candidate;
  }
  return existing;
});

console.log(`Reduced ${result.reductionPercent}% duplicates`);
```

### Multi-Source Aggregation

```typescript
const results = await Promise.all([
  fetchFromSourceA(),
  fetchFromSourceB(),
  fetchFromSourceC(),
]);

const { items, reductionPercent } = deduplicateArticles([
  { sourceName: 'source-a', articles: results[0] },
  { sourceName: 'source-b', articles: results[1] },
  { sourceName: 'source-c', articles: results[2] },
]);

// items now contains canonical articles with source attribution
```

## Best Practices

1. Semantic grouping - Group by normalized content, not just URL
2. Reputation scoring - Prefer authoritative sources as canonical
3. Best version selection - When IDs collide, keep version with most data
4. Reduction tracking - Log how much deduplication helped
5. Source attribution - Track which source the canonical came from

## Common Mistakes

- Simple URL deduplication (misses same story from different outlets)
- Random selection from duplicates (lose quality signal)
- No normalization (case/punctuation differences create false negatives)
- Not tracking reduction metrics (can't optimize)
- Hardcoded source lists (make configurable)

## Related Patterns

- batch-processing - Process deduplicated items efficiently
- validation-quarantine - Validate before deduplication
- checkpoint-resume - Track which files have been deduplicated

Related Skills

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

linear

from diegosouzapw/awesome-omni-skill

Managing Linear issues, projects, and teams. Use when working with Linear tasks, creating issues, updating status, querying projects, or managing team workflows.

lilhomie

from diegosouzapw/awesome-omni-skill

Control HomeKit devices via REST API. Use when controlling lights, switches, scenes, or checking device status in the user's home.

lightfriend-add-frontend-page

from diegosouzapw/awesome-omni-skill

Step-by-step guide for adding new pages to the Yew frontend

library-writer

from diegosouzapw/awesome-omni-skill

This skill should be used when writing software libraries, packages, or modules following battle-tested patterns for clean, minimal, production-ready code. It applies when creating new libraries, refactoring existing ones, designing library APIs, or when clean, dependency-minimal library code is needed. Triggers on requests like "create a library", "write a package", "design a module API", or mentions of professional library development.

Library Management

from diegosouzapw/awesome-omni-skill

User library, favorites, and reading progress

library-doc

from diegosouzapw/awesome-omni-skill

Index and search library documentation locally for offline use. Invoke when user asks to index docs, search library topics, or list indexed libraries.

libraries-dependencies-mastery

from diegosouzapw/awesome-omni-skill

Complete mastery of essential modern web development libraries and dependencies. Cover Next.js, React, TypeScript, Tailwind CSS, Firebase, Zustand, redux-toolkit, react-hook-form, Zod, shadcn/ui, lucide-react, Stripe, and more. Learn setup, integration patterns, advanced usage, performance optimization, troubleshooting, common pitfalls, and version management. Includes quick reference guides, in-depth tutorials, complete examples for e-commerce and SaaS, configuration files, type definitions, error handling, and production patterns. Master how libraries work together and solve real-world challenges.

librarian

from diegosouzapw/awesome-omni-skill

Expert in searching official documentation, APIs, and best practices. Use when you need accurate information from authoritative sources.

librarian-indexer

from diegosouzapw/awesome-omni-skill

Meta-skill that indexes, optimizes, and auto-generates Claude skills with GitOps automation, OCA GitHub bot integration, and Odoo developer tools. Use for skill creation, CI/CD workflows, OCA module management, and advanced Odoo development.

libpdf-helper

from diegosouzapw/awesome-omni-skill

Work with @libpdf/core - modern TypeScript PDF library for parsing, modifying, and generating PDFs. Use when (1) starting new @libpdf/core project, (2) migrating from pdf-lib/pdf.js/pdfkit, (3) understanding @libpdf/core API, (4) solving PDF tasks (forms, signatures, encryption, merging, text extraction), or (5) choosing between PDF libraries.

lexiang

from diegosouzapw/awesome-omni-skill

腾讯乐享知识库 API 接口文档。包含通讯录管理、团队管理、知识库管理、知识节点管理、任务管理、自定义属性管理、操作日志、AI助手、单点登录、素材管理、导出任务管理等接口。