claw-semantic-sim

Semantic Similarity Index for disease research literature using PubMedBERT embeddings

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

claw-semantic-sim is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Semantic Similarity Index for disease research literature using PubMedBERT embeddings

Teams using claw-semantic-sim should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/claw-semantic-sim/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/ClawBio/ClawBio/claw-semantic-sim/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/claw-semantic-sim/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How claw-semantic-sim Compares

Feature / Agent	claw-semantic-sim	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Semantic Similarity Index for disease research literature using PubMedBERT embeddings

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 🦖 Semantic Similarity Index

Measure how isolated or connected disease research is across the global biomedical literature, using PubMedBERT embeddings on PubMed abstracts spanning 175 GBD diseases.

## What it does

1. Takes a disease list (GBD taxonomy) as input
2. Retrieves PubMed abstracts (2000-2025) for each disease with quality filtering
3. Generates 768-dimensional PubMedBERT embeddings for every abstract
4. Computes four semantic equity metrics per disease:
   - **Semantic Isolation Index (SII)**: average cosine distance to k-nearest disease neighbours; higher = more isolated, less connected research
   - **Knowledge Transfer Potential (KTP)**: cross-disease centroid similarity; higher = more potential for research spillover
   - **Research Clustering Coefficient (RCC)**: within-disease embedding variance; higher = more diverse research approaches
   - **Temporal Semantic Drift**: cosine distance between yearly centroids; measures how research focus evolves
5. Generates publication-quality multi-panel figures:
   - **Panel A**: Semantic isolation by disease category (boxplot)
   - **Panel B**: Top 20 most semantically isolated diseases (bar chart, NTD/Global South colour-coded)
   - **Panel C**: Semantic isolation vs research volume (scatter with regression)
   - **Panel D**: NTD vs non-NTD significance test (Welch's t-test, Cohen's d)
6. Produces a markdown report with all metrics, rankings, and reproducibility bundle

## Why this exists

If you ask ChatGPT to "measure research neglect for diseases," it will:
- Not know which embedding model to use for biomedical text
- Hallucinate metrics that sound plausible but have no methodological grounding
- Skip quality filtering (year coverage, abstract coverage, minimum papers)
- Not handle MPS acceleration or checkpointed batch processing
- Produce a single scatter plot with no disease classification

This skill encodes the correct methodological decisions:
- Uses PubMedBERT (the gold-standard biomedical language model)
- Fetches from PubMed with exponential backoff and NCBI rate limiting
- Quality filters: year coverage >= 70%, abstract coverage >= 95%, minimum 50 papers
- Batch embedding with Apple MPS acceleration and CPU fallback
- Checkpointed processing (resume after interruption)
- HDF5 storage with gzip compression and SHA-256 checksums
- Classification against WHO NTD list and Global South priority diseases
- Statistical significance testing (Welch's t-test, Cohen's d)

## Key Finding

Neglected tropical diseases (NTDs) are significantly more semantically isolated than other conditions (P < 0.001, Cohen's d = 0.8+). They exist in knowledge silos with limited cross-disciplinary research bridges. The 25 most isolated diseases are disproportionately Global South priority conditions.

## Pipeline

```
05-00-heim-sem-setup.py     # Validate environment, create directories
05-01-heim-sem-fetch.py     # Retrieve PubMed abstracts (checkpointed)
05-02-heim-sem-embed.py     # Generate PubMedBERT embeddings (MPS/CPU)
05-03-heim-sem-compute.py   # Compute SII, KTP, RCC, temporal drift
05-04-heim-sem-figures.py   # Generate publication figures
05-05-heim-sem-integrate.py # Merge with biobank + clinical trial dimensions
```

### Demo (works out of the box)

```bash
python semantic_sim.py --demo --output demo_report
```

The demo uses pre-computed embeddings and metrics for 175 GBD diseases and generates the full 4-panel figure instantly.

## Example Output

```
Semantic Similarity Index
=========================
Diseases analysed: 175
Total PubMed abstracts: 13,100,000
Embedding model: PubMedBERT (768-dim)

Metric Ranges:
  SII: 0.0412 - 0.1893
  KTP: 0.6234 - 0.9187
  RCC: 0.0891 - 0.3421

Key Finding:
  NTDs show +38% higher semantic isolation
  P < 0.0001, Cohen's d = 0.84
  14/25 most isolated diseases are Global South priority

Figures saved to: demo_report/
  Fig5_Semantic_Structure.png (300 dpi)
  Fig5_Semantic_Structure.pdf (vector)

Reproducibility:
  commands.sh | environment.yml | checksums.sha256
```

## Interpretation Guide

- **High SII**: Disease research exists in a knowledge silo; limited cross-disciplinary bridges
- **Low KTP**: Research on this disease has few methodological overlaps with others
- **High RCC**: Diverse research approaches within the disease (many subtopics)
- **High Temporal Drift**: Research focus has shifted significantly over time
- NTDs shown in **red**, Global South diseases in **orange**, others in **grey**
- The scatter plot (Panel C) reveals the inverse relationship between research volume and isolation

## Citation

If you use this skill in a publication, please cite:

- Corpas, M. et al. (2026). HEIM: Health Equity Index for Measuring structural bias in biomedical research. Under review.
- Corpas, M. (2026). ClawBio. https://github.com/ClawBio/ClawBio

Related Skills

Semantic Scholar API Skill

from ComeOnOliver/skillshub

## 功能描述

snowflake-semanticview

from ComeOnOliver/skillshub

Create, alter, and validate Snowflake semantic views using Snowflake CLI (snow). Use when asked to build or troubleshoot semantic views/semantic layer definitions with CREATE/ALTER SEMANTIC VIEW, to validate semantic-view DDL against Snowflake via CLI, or to guide Snowflake CLI installation and connection setup.

semantic-kernel

from ComeOnOliver/skillshub

Create, update, refactor, explain, or review Semantic Kernel solutions using shared guidance plus language-specific references for .NET and Python.

building-dbt-semantic-layer

from ComeOnOliver/skillshub

Use when creating or modifying dbt Semantic Layer components — semantic models, metrics, dimensions, entities, measures, or time spines. Covers MetricFlow configuration, metric types (simple, derived, cumulative, ratio, conversion), and validation for both latest and legacy YAML specs.

openclaw-secure-linux-cloud

from ComeOnOliver/skillshub

Use when self-hosting OpenClaw on a cloud server, hardening a remote OpenClaw gateway, choosing between SSH tunneling, Tailscale, or reverse-proxy exposure, or reviewing Podman, pairing, sandboxing, token auth, and tool-permission defaults for a secure personal deployment.

clawlabor

from ComeOnOliver/skillshub

The autonomous marketplace where AI agents discover, purchase, and sell specialized AI capabilities. Search for services, post tasks with escrow-protected payments, create listings, manage orders, and handle the full transaction lifecycle. Use when the user needs to find, hire, buy, or sell AI capabilities.

instaclaw

from ComeOnOliver/skillshub

Photo sharing platform for AI agents. Use this skill to share images, browse feeds, like posts, comment, and follow other agents. Requires ATXP authentication.

clawdirect

from ComeOnOliver/skillshub

Interact with ClawDirect, a directory of social web experiences for AI agents. Use this skill to browse the directory, like entries, or add new sites. Requires ATXP authentication for MCP tool calls. Triggers: browsing agent-oriented websites, discovering social platforms for agents, liking/voting on directory entries, or submitting new agent-facing sites to ClawDirect.

clawdirect-dev

from ComeOnOliver/skillshub

Build agent-facing web experiences with ATXP-based authentication, following the ClawDirect pattern. Use this skill when building websites that AI agents interact with via MCP tools, implementing cookie-based agent auth, or creating agent skills for web apps. Provides templates using @longrun/turtle, Express, SQLite, and ATXP.

openclaw-feishu-ops-assistant

from ComeOnOliver/skillshub

Feishu (Lark) workspace operations for OpenClaw agents. Provides document CRUD, cloud drive management, permission control, and knowledge-base navigation through a unified tool surface. Activate when user mentions Feishu docs, wiki, drive, permissions, or Lark cloud documents.

agentdb-semantic-vector-search

from ComeOnOliver/skillshub

Build semantic vector search systems with AgentDB for intelligent document retrieval, RAG applications, and knowledge bases using embedding-based similarity matching

semantic-code-hunter

from ComeOnOliver/skillshub

Use when you need to find code by concept (not just text). Uses Serena MCP for semantic code search across the codebase with minimal token usage. Ideal for understanding architecture, finding authentication flows, or multi-file refactoring.