deduplication

Event deduplication with canonical selection, reputation scoring, and hash-based grouping for multi-source data aggregation. Handles both ID-based and content-based deduplication.

181 stars

bymajiayu000

View on GitHub Installation ↓

Best use case

deduplication is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Event deduplication with canonical selection, reputation scoring, and hash-based grouping for multi-source data aggregation. Handles both ID-based and content-based deduplication.

Teams using deduplication should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/deduplication-dadbodgeoff-drift/SKILL.md --create-dirs "https://raw.githubusercontent.com/majiayu000/claude-skill-registry/main/skills/data-access/deduplication-dadbodgeoff-drift/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/deduplication-dadbodgeoff-drift/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How deduplication Compares

Feature / Agent	deduplication	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Event deduplication with canonical selection, reputation scoring, and hash-based grouping for multi-source data aggregation. Handles both ID-based and content-based deduplication.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Event Deduplication

Canonical selection with reputation scoring and hash-based grouping for multi-source data.

## When to Use This Skill

- Aggregating data from multiple sources (news, events, products)
- Same content appears from different outlets/sources
- Need to pick the "best" version from duplicates
- Tracking deduplication metrics for optimization

## Core Concepts

Simple URL deduplication isn't enough. Production needs:
- Grouping by semantic similarity (same story, different outlets)
- Canonical selection (pick the "best" version)
- Reputation scoring (prefer authoritative sources)
- Both ID-based and content-based deduplication

Two modes:
1. ID-based: When sources have unique IDs, keep the "best" version when IDs collide
2. Content-based: Group by semantic similarity, select canonical from each group

## Implementation

### TypeScript

```typescript
import { createHash } from 'crypto';

interface DeduplicationResult<T> {
  items: T[];
  originalCount: number;
  dedupedCount: number;
  reductionPercent: number;
  duplicateGroups?: number;
}

// ============================================
// ID-Based Deduplication
// ============================================

function deduplicateById<T extends { id: string }>(
  items: T[],
  preferFn: (existing: T, candidate: T) => T
): DeduplicationResult<T> {
  const seen = new Map<string, T>();
  
  for (const item of items) {
    const existing = seen.get(item.id);
    if (existing) {
      seen.set(item.id, preferFn(existing, item));
    } else {
      seen.set(item.id, item);
    }
  }
  
  const dedupedItems = Array.from(seen.values());
  const reductionPercent = items.length > 0
    ? Math.round((1 - dedupedItems.length / items.length) * 100)
    : 0;
  
  return {
    items: dedupedItems,
    originalCount: items.length,
    dedupedCount: dedupedItems.length,
    reductionPercent,
  };
}

// ============================================
// Content-Based Deduplication
// ============================================

interface Article {
  title: string;
  url: string;
  domain: string;
  publishedAt: string;
  tone?: number;
}

/**
 * Generate deduplication key from content
 * Groups by: normalized title + source country + date
 */
function generateDedupKey(article: Article): string {
  const normalizedTitle = article.title
    .toLowerCase()
    .replace(/[^\w\s]/g, '')
    .trim()
    .slice(0, 50);

  const dateStr = article.publishedAt?.slice(0, 10).replace(/-/g, '') || 'unknown';

  return `${normalizedTitle}|${dateStr}`;
}

/**
 * Generate unique ID from URL
 */
function generateEventId(url: string): string {
  return createHash('md5').update(url).digest('hex').slice(0, 12);
}

/**
 * Source reputation scoring
 */
function getReputationScore(domain: string): number {
  // Tier 1: Wire services and major international
  const tier1 = ['reuters.com', 'apnews.com', 'bbc.com', 'bbc.co.uk', 
                 'aljazeera.com', 'france24.com', 'dw.com'];
  if (tier1.some(r => domain.includes(r))) return 100;
  
  // Tier 2: Major newspapers
  const tier2 = ['nytimes.com', 'washingtonpost.com', 'theguardian.com', 
                 'ft.com', 'economist.com', 'wsj.com'];
  if (tier2.some(r => domain.includes(r))) return 75;
  
  // Tier 3: Regional/national
  const tier3 = ['cnn.com', 'foxnews.com', 'nbcnews.com', 'abcnews.go.com'];
  if (tier3.some(r => domain.includes(r))) return 50;
  
  return 10;
}

/**
 * Select canonical article from duplicate group
 */
function selectCanonical<T extends Article>(
  group: { item: T; source: string }[]
): { item: T; source: string } {
  return group.reduce((best, current) => {
    const bestScore = getReputationScore(best.item.domain) + 
                      Math.abs(best.item.tone || 0);
    const currentScore = getReputationScore(current.item.domain) + 
                         Math.abs(current.item.tone || 0);
    
    return currentScore > bestScore ? current : best;
  });
}

/**
 * Deduplicate articles from multiple sources
 */
function deduplicateArticles<T extends Article>(
  sourceResults: { sourceName: string; articles: T[] }[]
): DeduplicationResult<T & { source: string }> {
  const groups = new Map<string, { item: T; source: string }[]>();
  let totalArticles = 0;

  // Group articles by dedup key
  for (const { sourceName, articles } of sourceResults) {
    for (const article of articles) {
      totalArticles++;
      const key = generateDedupKey(article);
      
      if (!groups.has(key)) {
        groups.set(key, []);
      }
      groups.get(key)!.push({ item: article, source: sourceName });
    }
  }

  // Select canonical article from each group
  const items: (T & { source: string })[] = [];
  
  for (const group of groups.values()) {
    const canonical = selectCanonical(group);
    items.push({ ...canonical.item, source: canonical.source });
  }

  const reductionPercent = totalArticles > 0 
    ? Math.round((1 - items.length / totalArticles) * 100)
    : 0;

  console.log(`[Dedup] ${totalArticles} → ${items.length} (${reductionPercent}% reduction)`);

  return {
    items,
    originalCount: totalArticles,
    dedupedCount: items.length,
    reductionPercent,
    duplicateGroups: groups.size,
  };
}
```

## Usage Examples

### ID-Based Deduplication

```typescript
const events = await fetchEvents();

const result = deduplicateById(events, (existing, candidate) => {
  // Prefer events with coordinates
  if (!existing.lat && candidate.lat) return candidate;
  // Prefer higher sentiment magnitude
  if (Math.abs(candidate.sentiment) > Math.abs(existing.sentiment)) {
    return candidate;
  }
  return existing;
});

console.log(`Reduced ${result.reductionPercent}% duplicates`);
```

### Multi-Source Aggregation

```typescript
const results = await Promise.all([
  fetchFromSourceA(),
  fetchFromSourceB(),
  fetchFromSourceC(),
]);

const { items, reductionPercent } = deduplicateArticles([
  { sourceName: 'source-a', articles: results[0] },
  { sourceName: 'source-b', articles: results[1] },
  { sourceName: 'source-c', articles: results[2] },
]);

// items now contains canonical articles with source attribution
```

## Best Practices

1. Semantic grouping - Group by normalized content, not just URL
2. Reputation scoring - Prefer authoritative sources as canonical
3. Best version selection - When IDs collide, keep version with most data
4. Reduction tracking - Log how much deduplication helped
5. Source attribution - Track which source the canonical came from

## Common Mistakes

- Simple URL deduplication (misses same story from different outlets)
- Random selection from duplicates (lose quality signal)
- No normalization (case/punctuation differences create false negatives)
- Not tracking reduction metrics (can't optimize)
- Hardcoded source lists (make configurable)

## Related Patterns

- batch-processing - Process deduplicated items efficiently
- validation-quarantine - Validate before deduplication
- checkpoint-resume - Track which files have been deduplicated

Related Skills

tech-blog

159

from majiayu000/claude-skill-registry

Generates comprehensive technical blog posts, offering detailed explanations of system internals, architecture, and implementation, either through source code analysis or document-driven research.

Content & DocumentationClaude

thor-skills

159

from majiayu000/claude-skill-registry

An entry point and router for AI agents to manage various THOR-related cybersecurity tasks, including running scans, analyzing logs, troubleshooting, and maintenance.

SecurityClaude

ontopo

159

from majiayu000/claude-skill-registry

An AI agent skill to search for Israeli restaurants, check table availability, view menus, and retrieve booking links via the Ontopo platform, acting as an unofficial interface to its data.

General Utilities

grail-miner

159

from majiayu000/claude-skill-registry

This skill assists in setting up, managing, and optimizing Grail miners on Bittensor Subnet 81, handling tasks like environment configuration, R2 storage, model checkpoint management, and performance tuning.

DevOps & Infrastructure

chrome-debug

159

from majiayu000/claude-skill-registry

This skill empowers AI agents to debug web applications and inspect browser behavior using the Chrome DevTools Protocol (CDP), offering both collaborative (headful) and automated (headless) modes.

Coding & DevelopmentClaude

lets-go-rss

159

from majiayu000/claude-skill-registry

A lightweight, full-platform RSS subscription manager that aggregates content from YouTube, Vimeo, Behance, Twitter/X, and Chinese platforms like Bilibili, Weibo, and Douyin, featuring deduplication and AI smart classification.

Content & Documentation

vly-money

159

from majiayu000/claude-skill-registry

Generate crypto payment links for supported tokens and networks, manage access to X402 payment-protected content, and provide direct access to the vly.money wallet interface.

Fintech & CryptoClaude

whisper-transcribe

159

from majiayu000/claude-skill-registry

Transcribes audio and video files to text using OpenAI's Whisper CLI, enhanced with contextual grounding from local markdown files for improved accuracy.

Media Processing

ux

159

from majiayu000/claude-skill-registry

This AI agent skill provides comprehensive guidance for creating professional and insightful User Experience (UX) designs, covering user research, information architecture, interaction design, visual guidance, and usability evaluation. It aims to produce actionable, user-centered solutions that avoid generic AI aesthetics.

UX Design & StrategyClaude

astro

159

from majiayu000/claude-skill-registry

This skill provides essential Astro framework patterns, focusing on server-side rendering (SSR), static site generation (SSG), middleware, and TypeScript best practices. It helps AI agents implement secure authentication, manage API routes, and debug rendering behaviors within Astro projects.

Coding & Development

modal-deployment

159

from majiayu000/claude-skill-registry

Run Python code in the cloud with serverless containers, GPUs, and autoscaling using Modal. This skill enables agents to generate code for deploying ML models, running batch jobs, serving APIs, and scaling compute-intensive workloads.

DevOps & Infrastructure

advanced-skill-creator

181

from majiayu000/claude-skill-registry

Meta-skill that generates domain-specific skills using advanced reasoning techniques. PROACTIVELY activate for: (1) Create/build/make skills, (2) Generate expert panels for any domain, (3) Design evaluation frameworks, (4) Create research workflows, (5) Structure complex multi-step processes, (6) Instantiate templates with parameters. Triggers: "create a skill for", "build evaluation for", "design workflow for", "generate expert panel for", "how should I approach [complex task]", "create skill", "new skill for", "skill template", "generate skill"