fiftyone-find-duplicates

Find duplicate or near-duplicate images in FiftyOne datasets using brain similarity computation. Use when users want to deduplicate datasets, find similar images, cluster visually similar content, or remove redundant samples. Requires FiftyOne MCP server with @voxel51/brain plugin installed.

242 stars

byaiskillstore

View on GitHub Installation ↓

Best use case

fiftyone-find-duplicates is best used when you need a repeatable AI agent workflow instead of a one-off prompt. It is especially useful for teams working in multi. Find duplicate or near-duplicate images in FiftyOne datasets using brain similarity computation. Use when users want to deduplicate datasets, find similar images, cluster visually similar content, or remove redundant samples. Requires FiftyOne MCP server with @voxel51/brain plugin installed.

Users should expect a more consistent workflow output, faster repeated execution, and less time spent rewriting prompts from scratch.

Practical example

Example input

Use the "fiftyone-find-duplicates" skill to help with this workflow task. Context: Find duplicate or near-duplicate images in FiftyOne datasets using brain similarity computation. Use when users want to deduplicate datasets, find similar images, cluster visually similar content, or remove redundant samples. Requires FiftyOne MCP server with @voxel51/brain plugin installed.

Example output

A structured workflow result with clearer steps, more consistent formatting, and an output that is easier to reuse in the next run.

When to use this skill

Use this skill when you want a reusable workflow rather than writing the same prompt again and again.

When not to use this skill

Do not use this when you only need a one-off answer and do not need a reusable workflow.
Do not use it if you cannot install or maintain the related files, repository context, or supporting tools.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/fiftyone-find-duplicates/SKILL.md --create-dirs "https://raw.githubusercontent.com/aiskillstore/marketplace/main/skills/adonaivera/fiftyone-find-duplicates/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/fiftyone-find-duplicates/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How fiftyone-find-duplicates Compares

Feature / Agent	fiftyone-find-duplicates	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Find Duplicates in FiftyOne Datasets

## Overview

Find and remove duplicate or near-duplicate images using FiftyOne's brain similarity operators. Uses deep learning embeddings to identify visually similar images.

**Use this skill when:**
- Removing duplicate images from datasets
- Finding near-duplicate images (similar but not identical)
- Clustering visually similar images
- Cleaning datasets before training

## Prerequisites

- FiftyOne MCP server installed and running
- `@voxel51/brain` plugin installed and enabled
- Dataset with image samples loaded in FiftyOne

## Key Directives

**ALWAYS follow these rules:**

### 1. Set context first
```python
set_context(dataset_name="my-dataset")
```

### 2. Launch FiftyOne App
Brain operators are delegated and require the app:
```python
launch_app()
```
Wait 5-10 seconds for initialization.

### 3. Discover operators dynamically
```python
# List all brain operators
list_operators(builtin_only=False)

# Get schema for specific operator
get_operator_schema(operator_uri="@voxel51/brain/compute_similarity")
```

### 4. Compute embeddings before finding duplicates
```python
execute_operator(
    operator_uri="@voxel51/brain/compute_similarity",
    params={"brain_key": "img_sim", "model": "mobilenet-v2-imagenet-torch"}
)
```

### 5. Close app when done
```python
close_app()
```

## Complete Workflow

### Step 1: Setup
```python
# Set context
set_context(dataset_name="my-dataset")

# Launch app (required for brain operators)
launch_app()
```

### Step 2: Verify Brain Plugin
```python
# Check if brain plugin is available
list_plugins(enabled=True)

# If not installed:
download_plugin(
    url_or_repo="voxel51/fiftyone-plugins",
    plugin_names=["@voxel51/brain"]
)
enable_plugin(plugin_name="@voxel51/brain")
```

### Step 3: Discover Brain Operators
```python
# List all available operators
list_operators(builtin_only=False)

# Get schema for compute_similarity
get_operator_schema(operator_uri="@voxel51/brain/compute_similarity")

# Get schema for find_duplicates
get_operator_schema(operator_uri="@voxel51/brain/find_duplicates")
```

### Step 4: Compute Similarity
```python
# Execute operator to compute embeddings
execute_operator(
    operator_uri="@voxel51/brain/compute_similarity",
    params={
        "brain_key": "img_duplicates",
        "model": "mobilenet-v2-imagenet-torch"
    }
)
```

### Step 5: Find Near Duplicates
```python
execute_operator(
    operator_uri="@voxel51/brain/find_near_duplicates",
    params={
        "similarity_index": "img_duplicates",
        "threshold": 0.3
    }
)
```

**Threshold guidelines (distance-based, lower = more similar):**
- `0.1` = Very similar (near-exact duplicates)
- `0.3` = Near duplicates (recommended default)
- `0.5` = Similar images
- `0.7` = Loosely similar

This operator creates two saved views automatically:
- `near duplicates`: all samples that are near duplicates
- `representatives of near duplicates`: one representative from each group

### Step 6: View Duplicates in App

After finding duplicates, use `set_view` to display them in the FiftyOne App:

**Option A: Filter by near_dup_id field**
```python
# Show all samples that have a near_dup_id (all duplicates)
set_view(exists=["near_dup_id"])
```

**Option B: Show specific duplicate group**
```python
# Show samples with a specific duplicate group ID
set_view(filters={"near_dup_id": 1})
```

**Option C: Load saved view (if available)**
```python
# Load the automatically created saved view
set_view(view_name="near duplicates")
```

**Option D: Clear filter to show all samples**
```python
clear_view()
```

The `find_near_duplicates` operator adds a `near_dup_id` field to samples. Samples with the same ID are duplicates of each other.

### Step 7: Delete Duplicates

**Option A: Use deduplicate operator (keeps one representative per group)**
```python
execute_operator(
    operator_uri="@voxel51/brain/deduplicate_near_duplicates",
    params={}
)
```

**Option B: Manual deletion from App UI**
1. Use `set_view(exists=["near_dup_id"])` to show duplicates
2. Review samples in the App at http://localhost:5151/
3. Select samples to delete
4. Use the delete action in the App

### Step 8: Clean Up
```python
close_app()
```

## Available Tools

### Session View Tools

| Tool | Description |
|------|-------------|
| `set_view(exists=[...])` | Filter samples where field(s) have non-None values |
| `set_view(filters={...})` | Filter samples by exact field values |
| `set_view(tags=[...])` | Filter samples by tags |
| `set_view(sample_ids=[...])` | Select specific sample IDs |
| `set_view(view_name="...")` | Load a saved view by name |
| `clear_view()` | Clear filters, show all samples |

### Brain Operators for Duplicates

Use `list_operators()` to discover and `get_operator_schema()` to see parameters:

| Operator | Description |
|----------|-------------|
| `@voxel51/brain/compute_similarity` | Compute embeddings and similarity index |
| `@voxel51/brain/find_near_duplicates` | Find near-duplicate samples |
| `@voxel51/brain/deduplicate_near_duplicates` | Delete duplicates, keep representatives |
| `@voxel51/brain/find_exact_duplicates` | Find exact duplicate media files |
| `@voxel51/brain/deduplicate_exact_duplicates` | Delete exact duplicates |
| `@voxel51/brain/compute_uniqueness` | Compute uniqueness scores |

## Common Use Cases

### Use Case 1: Remove Exact Duplicates
For accidentally duplicated files (identical bytes):
```python
set_context(dataset_name="my-dataset")
launch_app()

execute_operator(
    operator_uri="@voxel51/brain/find_exact_duplicates",
    params={}
)

execute_operator(
    operator_uri="@voxel51/brain/deduplicate_exact_duplicates",
    params={}
)

close_app()
```

### Use Case 2: Find and Review Near Duplicates
For visually similar but not identical images:
```python
set_context(dataset_name="my-dataset")
launch_app()

# Compute embeddings
execute_operator(
    operator_uri="@voxel51/brain/compute_similarity",
    params={"brain_key": "near_dups", "model": "mobilenet-v2-imagenet-torch"}
)

# Find duplicates
execute_operator(
    operator_uri="@voxel51/brain/find_near_duplicates",
    params={"similarity_index": "near_dups", "threshold": 0.3}
)

# View duplicates in the App
set_view(exists=["near_dup_id"])

# After review, deduplicate
execute_operator(
    operator_uri="@voxel51/brain/deduplicate_near_duplicates",
    params={}
)

# Clear view and close
clear_view()
close_app()
```

### Use Case 3: Sort by Similarity
Find images similar to a specific sample:
```python
set_context(dataset_name="my-dataset")
launch_app()

execute_operator(
    operator_uri="@voxel51/brain/compute_similarity",
    params={"brain_key": "search"}
)

execute_operator(
    operator_uri="@voxel51/brain/sort_by_similarity",
    params={
        "brain_key": "search",
        "query_id": "sample_id_here",
        "k": 20
    }
)

close_app()
```

## Troubleshooting

**Error: "No executor available"**
- Cause: Delegated operators require the App executor for UI triggers
- Solution: Direct user to App UI to view results and complete deletion manually
- Affected operators: `find_near_duplicates`, `deduplicate_near_duplicates`

**Error: "Brain key not found"**
- Cause: Embeddings not computed
- Solution: Run `compute_similarity` first with a `brain_key`

**Error: "Operator not found"**
- Cause: Brain plugin not installed
- Solution: Install with `download_plugin()` and `enable_plugin()`

**Error: "Missing dependency" (e.g., torch, tensorflow)**
- The MCP server detects missing dependencies automatically
- Response includes `missing_package` and `install_command`
- Example response:
  ```json
  {
    "error_type": "missing_dependency",
    "missing_package": "torch",
    "install_command": "pip install torch"
  }
  ```
- Offer to run the install command for the user
- After installation, restart MCP server and retry

**Similarity computation is slow**
- Use faster model: `mobilenet-v2-imagenet-torch`
- Use GPU if available
- Process large datasets in batches

## Best Practices

1. **Discover dynamically** - Use `list_operators()` and `get_operator_schema()` to get current operator names and parameters
2. **Start with default threshold** (0.3) and adjust as needed
3. **Review before deleting** - Direct user to App to inspect duplicates
4. **Store embeddings** - Reuse for multiple operations via `brain_key`
5. **Handle executor errors gracefully** - Guide user to App UI when needed

## Performance Notes

**Embedding computation time:**
- 1,000 images: ~1-2 minutes
- 10,000 images: ~10-15 minutes
- 100,000 images: ~1-2 hours

**Memory requirements:**
- ~2KB per image for embeddings
- ~4-8KB per image for similarity index

## Resources

- [FiftyOne Brain Documentation](https://docs.voxel51.com/user_guide/brain.html)
- [Brain Plugin Source](https://github.com/voxel51/fiftyone-plugins/tree/main/plugins/brain)

## License

Copyright 2017-2025, Voxel51, Inc.
Apache 2.0 License

Related Skills

find-skills

242

from aiskillstore/marketplace

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

trade-show-finder

242

from aiskillstore/marketplace

Find, compare, and research trade shows, exhibitions, expos, and industry events by vertical, region, date, or audience. Use this skill whenever the user wants to discover which trade shows exist for their industry, compare multiple events side-by-side, decide which shows are worth attending or exhibiting at, look up event dates and venues, research exhibitor counts or visitor profiles, or plan an annual trade show calendar. Also triggers on questions like 'what are the best shows for [industry]', 'when is [show name]', 'should we go to [event] or [event]', 'find me exhibitions in Germany for packaging', 'trade show calendar 2026', 'exhibition calendar Europe', 'B2B trade shows', 'what industry events should I attend', 'upcoming trade fairs', or even vague requests like 'we need to get in front of more buyers — what events should we be at'. If the user mentions any specific trade show by name (CES, MEDICA, Hannover Messe, Interpack, SXSW, Bauma, etc.) and wants information about it, use this skill.

find-bugs

242

from aiskillstore/marketplace

Find bugs, security vulnerabilities, and code quality issues in local branch changes. Use when asked to review changes, find bugs, security review, or audit code on the current branch.

first-responder-program-finder

242

from aiskillstore/marketplace

Use when an agent needs to navigate the FirstResponderHomePrograms website UI to find statewide verified programs, under-review signals, free deeper-opportunity teasers, or paid Research Vault and workspace information.

finding-shelter

242

from aiskillstore/marketplace

寻找庇护所 - 帮助Stella在盖亚星球度过第一个夜晚,寻找或建造安全的临时住所

ffind

242

from aiskillstore/marketplace

Advanced file finder with type detection and filesystem extraction for analyzing firmware and extracting embedded filesystems. Use when you need to analyze firmware files, identify file types, or extract ext2/3/4 or F2FS filesystems.

fiftyone-pr-triage

242

from aiskillstore/marketplace

Triage FiftyOne GitHub issues by validating status, categorizing resolution, and generating standardized responses. Use when reviewing issues to determine if fixed, won't fix, not reproducible, no longer relevant, or still valid.

fiftyone-embeddings-visualization

242

from aiskillstore/marketplace

Visualize datasets in 2D using embeddings with UMAP or t-SNE dimensionality reduction. Use when users want to explore dataset structure, find clusters in images, identify outliers, color samples by class or metadata, or understand data distribution. Requires FiftyOne MCP server with @voxel51/brain plugin installed.

fiftyone-develop-plugin

242

from aiskillstore/marketplace

Develop custom FiftyOne plugins (operators and panels) from scratch. Use when user wants to create a new plugin, extend FiftyOne with custom operators, build interactive panels, or integrate external APIs into FiftyOne. Guides through requirements, design, coding, testing, and iteration.

fiftyone-dataset-inference

242

from aiskillstore/marketplace

Create a FiftyOne dataset from a directory of media files (images, videos, point clouds), optionally import labels in common formats (COCO, YOLO, VOC), run model inference, and store predictions. Use when users want to load local files into FiftyOne, apply ML models for detection, classification, or segmentation, or build end-to-end inference pipelines.

fiftyone-dataset-import

242

from aiskillstore/marketplace

Universal dataset import for FiftyOne supporting all media types (images, videos, point clouds, 3D scenes), all label formats (COCO, YOLO, VOC, CVAT, KITTI, etc.), and multimodal grouped datasets. Use when users want to import any dataset regardless of format, automatically detect folder structure, handle autonomous driving data with multiple cameras and LiDAR, or create grouped datasets from multimodal data. Requires FiftyOne MCP server.

fiftyone-code-style

242

from aiskillstore/marketplace

Write Python code following FiftyOne's official conventions. Use when contributing to FiftyOne, developing plugins, or writing code that integrates with FiftyOne's codebase.