semantic-similarity

Semantic similarity computation for content relationships and intelligent discovery

509 stars

Best use case

semantic-similarity is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Semantic similarity computation for content relationships and intelligent discovery

Teams using semantic-similarity should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/semantic-similarity/SKILL.md --create-dirs "https://raw.githubusercontent.com/a5c-ai/babysitter/main/library/specializations/domains/business/knowledge-management/skills/semantic-similarity/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/semantic-similarity/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How semantic-similarity Compares

Feature / Agentsemantic-similarityStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Semantic similarity computation for content relationships and intelligent discovery

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Semantic Similarity Skill

## Overview

The Semantic Similarity skill provides advanced capabilities for computing and leveraging semantic relationships between content in knowledge management systems. Using modern embedding models and vector similarity techniques, this skill enables intelligent content discovery, recommendation, and organization beyond traditional keyword matching.

## Capabilities

### Document Embedding Generation
- Generate embeddings for documents and content
- Configure embedding models (OpenAI, Cohere, open-source)
- Implement batch embedding pipelines
- Manage embedding storage and retrieval
- Optimize embedding dimensions for use case

### Sentence Transformer Models
- Configure sentence-transformers models
- Fine-tune models for domain-specific content
- Implement multi-lingual embedding models
- Design model selection strategies

### Similarity Search and Clustering
- Implement vector similarity search (cosine, dot product)
- Configure approximate nearest neighbor (ANN) algorithms
- Design content clustering pipelines
- Implement hierarchical clustering for organization

### Related Content Recommendation
- Build content recommendation systems
- Configure "More Like This" functionality
- Implement collaborative filtering with embeddings
- Design hybrid recommendation approaches

### Duplicate Detection
- Identify duplicate and near-duplicate content
- Configure similarity thresholds for detection
- Implement deduplication workflows
- Design merge and consolidation strategies

### Topic Modeling
- Implement LDA (Latent Dirichlet Allocation)
- Configure BERTopic for modern topic modeling
- Design topic hierarchies and taxonomies
- Enable dynamic topic tracking

### Semantic Search Integration
- Configure semantic search pipelines
- Implement hybrid search (keyword + semantic)
- Design query expansion using embeddings
- Enable cross-lingual semantic search

### Content Gap Analysis
- Identify missing content through similarity analysis
- Map content coverage using embeddings
- Detect underserved topics and areas
- Design content planning recommendations

### Concept Extraction
- Extract key concepts from documents
- Build concept graphs from embeddings
- Implement keyphrase extraction
- Design concept tagging pipelines

## Dependencies

- Sentence-transformers library
- OpenAI Embeddings API
- Cohere Embed API
- Pinecone vector database
- Weaviate
- Milvus
- FAISS (Facebook AI Similarity Search)
- scikit-learn for clustering

## Process Integration

This skill integrates with:

- **search-optimization.js**: Semantic search and related content features
- **knowledge-base-content.js**: Content recommendations and gap analysis
- **tacit-to-explicit-conversion.js**: Knowledge representation and concept extraction

## Usage

### Generate Document Embeddings

```yaml
task: Generate embeddings for knowledge base content
skill: semantic-similarity
parameters:
  source: knowledge-base
  model: text-embedding-3-small
  batch_size: 100
  output: vector-store
  dimensions: 1536
```

### Configure Similarity Search

```yaml
task: Set up semantic similarity search
skill: semantic-similarity
parameters:
  vector_store: pinecone
  index_name: kb-embeddings
  similarity_metric: cosine
  top_k: 10
  hybrid_search: true
  keyword_weight: 0.3
```

### Duplicate Detection

```yaml
task: Identify duplicate content
skill: semantic-similarity
parameters:
  threshold: 0.92
  scope: all-documents
  output: duplicate-report.json
  action: flag_for_review
```

### Topic Modeling

```yaml
task: Generate topic model for knowledge base
skill: semantic-similarity
parameters:
  method: bertopic
  min_topic_size: 10
  nr_topics: auto
  output: topic-model
  visualizations: true
```

## Best Practices

1. **Choose appropriate embedding models** - Match model to content type and language
2. **Normalize embeddings** - Ensure consistent similarity scores across documents
3. **Set appropriate thresholds** - Tune similarity thresholds for your use case
4. **Implement hybrid search** - Combine semantic and keyword search for best results
5. **Monitor embedding drift** - Re-embed content periodically as models improve
6. **Consider latency** - Cache frequently used embeddings for performance
7. **Plan for scale** - Use ANN indexes for large document collections
8. **Handle long documents** - Implement chunking strategies for lengthy content

## Architecture Patterns

### Basic Semantic Search Pipeline

```
Document -> Chunking -> Embedding -> Vector Store -> Query -> Results
```

### Hybrid Search Architecture

```
Query -> [Keyword Search] -> Results
      -> [Semantic Search] -> Results
      -> [Reranking] -> Final Results
```

### Recommendation Pipeline

```
User Context -> Find Similar Content -> Filter by Metadata -> Personalize -> Recommend
```

## Metrics

Key metrics for semantic similarity systems:

| Metric | Description | Target |
|--------|-------------|--------|
| Retrieval Precision | Relevant results in top-k | > 80% |
| Search Latency | Time for similarity search | < 200ms |
| Duplicate Detection F1 | Accuracy of duplicate finding | > 90% |
| Topic Coherence | Quality of topic models | > 0.5 |
| User Satisfaction | Relevance ratings | > 4.0/5.0 |

## Related Skills

- **knowledge-graph** (SK-008): Graph-based semantic relationships
- **search-engine** (SK-005): Enterprise search integration
- **content-curation** (SK-010): Quality-based content management

## Related Agents

- **kg-specialist** (AG-008): Knowledge graph and semantic expertise
- **search-expert** (AG-004): Search optimization guidance
- **knowledge-architect** (AG-001): Overall KM strategy alignment

Related Skills

semantic-code-analyzer

509
from a5c-ai/babysitter

LLM-powered semantic analysis of code diffs to detect business-logic trojans

semantic-scholar-search

509
from a5c-ai/babysitter

Academic literature search using Semantic Scholar API for citation-aware paper discovery

operational-semantics-builder

509
from a5c-ai/babysitter

Define and test operational semantics specifications for programming languages

BI Semantic Layer Generator

509
from a5c-ai/babysitter

Generates semantic layer definitions for BI tools from dimensional models

semantic-kernel-setup

509
from a5c-ai/babysitter

Microsoft Semantic Kernel planner and plugin setup for orchestrated AI

process-builder

509
from a5c-ai/babysitter

Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.

Workflow & Productivity

babysitter

509
from a5c-ai/babysitter

Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)

yolo

509
from a5c-ai/babysitter

Run Babysitter autonomously with minimal manual interruption.

user-install

509
from a5c-ai/babysitter

Install the user-level Babysitter Codex setup.

team-install

509
from a5c-ai/babysitter

Install the team-pinned Babysitter Codex workspace setup.

retrospect

509
from a5c-ai/babysitter

Summarize or retrospect on a completed Babysitter run.

resume

509
from a5c-ai/babysitter

Resume an existing Babysitter run from Codex.