claw-semantic-sim
Semantic Similarity Index for disease research literature using PubMedBERT embeddings
Best use case
claw-semantic-sim is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Semantic Similarity Index for disease research literature using PubMedBERT embeddings
Teams using claw-semantic-sim should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/claw-semantic-sim/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How claw-semantic-sim Compares
| Feature / Agent | claw-semantic-sim | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Semantic Similarity Index for disease research literature using PubMedBERT embeddings
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# 🦖 Semantic Similarity Index Measure how isolated or connected disease research is across the global biomedical literature, using PubMedBERT embeddings on PubMed abstracts spanning 175 GBD diseases. ## What it does 1. Takes a disease list (GBD taxonomy) as input 2. Retrieves PubMed abstracts (2000-2025) for each disease with quality filtering 3. Generates 768-dimensional PubMedBERT embeddings for every abstract 4. Computes four semantic equity metrics per disease: - **Semantic Isolation Index (SII)**: average cosine distance to k-nearest disease neighbours; higher = more isolated, less connected research - **Knowledge Transfer Potential (KTP)**: cross-disease centroid similarity; higher = more potential for research spillover - **Research Clustering Coefficient (RCC)**: within-disease embedding variance; higher = more diverse research approaches - **Temporal Semantic Drift**: cosine distance between yearly centroids; measures how research focus evolves 5. Generates publication-quality multi-panel figures: - **Panel A**: Semantic isolation by disease category (boxplot) - **Panel B**: Top 20 most semantically isolated diseases (bar chart, NTD/Global South colour-coded) - **Panel C**: Semantic isolation vs research volume (scatter with regression) - **Panel D**: NTD vs non-NTD significance test (Welch's t-test, Cohen's d) 6. Produces a markdown report with all metrics, rankings, and reproducibility bundle ## Why this exists If you ask ChatGPT to "measure research neglect for diseases," it will: - Not know which embedding model to use for biomedical text - Hallucinate metrics that sound plausible but have no methodological grounding - Skip quality filtering (year coverage, abstract coverage, minimum papers) - Not handle MPS acceleration or checkpointed batch processing - Produce a single scatter plot with no disease classification This skill encodes the correct methodological decisions: - Uses PubMedBERT (the gold-standard biomedical language model) - Fetches from PubMed with exponential backoff and NCBI rate limiting - Quality filters: year coverage >= 70%, abstract coverage >= 95%, minimum 50 papers - Batch embedding with Apple MPS acceleration and CPU fallback - Checkpointed processing (resume after interruption) - HDF5 storage with gzip compression and SHA-256 checksums - Classification against WHO NTD list and Global South priority diseases - Statistical significance testing (Welch's t-test, Cohen's d) ## Key Finding Neglected tropical diseases (NTDs) are significantly more semantically isolated than other conditions (P < 0.001, Cohen's d = 0.8+). They exist in knowledge silos with limited cross-disciplinary research bridges. The 25 most isolated diseases are disproportionately Global South priority conditions. ## Pipeline ``` 05-00-heim-sem-setup.py # Validate environment, create directories 05-01-heim-sem-fetch.py # Retrieve PubMed abstracts (checkpointed) 05-02-heim-sem-embed.py # Generate PubMedBERT embeddings (MPS/CPU) 05-03-heim-sem-compute.py # Compute SII, KTP, RCC, temporal drift 05-04-heim-sem-figures.py # Generate publication figures 05-05-heim-sem-integrate.py # Merge with biobank + clinical trial dimensions ``` ### Demo (works out of the box) ```bash python semantic_sim.py --demo --output demo_report ``` The demo uses pre-computed embeddings and metrics for 175 GBD diseases and generates the full 4-panel figure instantly. ## Example Output ``` Semantic Similarity Index ========================= Diseases analysed: 175 Total PubMed abstracts: 13,100,000 Embedding model: PubMedBERT (768-dim) Metric Ranges: SII: 0.0412 - 0.1893 KTP: 0.6234 - 0.9187 RCC: 0.0891 - 0.3421 Key Finding: NTDs show +38% higher semantic isolation P < 0.0001, Cohen's d = 0.84 14/25 most isolated diseases are Global South priority Figures saved to: demo_report/ Fig5_Semantic_Structure.png (300 dpi) Fig5_Semantic_Structure.pdf (vector) Reproducibility: commands.sh | environment.yml | checksums.sha256 ``` ## Interpretation Guide - **High SII**: Disease research exists in a knowledge silo; limited cross-disciplinary bridges - **Low KTP**: Research on this disease has few methodological overlaps with others - **High RCC**: Diverse research approaches within the disease (many subtopics) - **High Temporal Drift**: Research focus has shifted significantly over time - NTDs shown in **red**, Global South diseases in **orange**, others in **grey** - The scatter plot (Panel C) reveals the inverse relationship between research volume and isolation ## Citation If you use this skill in a publication, please cite: - Corpas, M. et al. (2026). HEIM: Health Equity Index for Measuring structural bias in biomedical research. Under review. - Corpas, M. (2026). ClawBio. https://github.com/ClawBio/ClawBio
Related Skills
Semantic Scholar API Skill
## 功能描述
snowflake-semanticview
Create, alter, and validate Snowflake semantic views using Snowflake CLI (snow). Use when asked to build or troubleshoot semantic views/semantic layer definitions with CREATE/ALTER SEMANTIC VIEW, to validate semantic-view DDL against Snowflake via CLI, or to guide Snowflake CLI installation and connection setup.
semantic-kernel
Create, update, refactor, explain, or review Semantic Kernel solutions using shared guidance plus language-specific references for .NET and Python.
building-dbt-semantic-layer
Use when creating or modifying dbt Semantic Layer components — semantic models, metrics, dimensions, entities, measures, or time spines. Covers MetricFlow configuration, metric types (simple, derived, cumulative, ratio, conversion), and validation for both latest and legacy YAML specs.
openclaw-secure-linux-cloud
Use when self-hosting OpenClaw on a cloud server, hardening a remote OpenClaw gateway, choosing between SSH tunneling, Tailscale, or reverse-proxy exposure, or reviewing Podman, pairing, sandboxing, token auth, and tool-permission defaults for a secure personal deployment.
clawlabor
The autonomous marketplace where AI agents discover, purchase, and sell specialized AI capabilities. Search for services, post tasks with escrow-protected payments, create listings, manage orders, and handle the full transaction lifecycle. Use when the user needs to find, hire, buy, or sell AI capabilities.
instaclaw
Photo sharing platform for AI agents. Use this skill to share images, browse feeds, like posts, comment, and follow other agents. Requires ATXP authentication.
clawdirect
Interact with ClawDirect, a directory of social web experiences for AI agents. Use this skill to browse the directory, like entries, or add new sites. Requires ATXP authentication for MCP tool calls. Triggers: browsing agent-oriented websites, discovering social platforms for agents, liking/voting on directory entries, or submitting new agent-facing sites to ClawDirect.
clawdirect-dev
Build agent-facing web experiences with ATXP-based authentication, following the ClawDirect pattern. Use this skill when building websites that AI agents interact with via MCP tools, implementing cookie-based agent auth, or creating agent skills for web apps. Provides templates using @longrun/turtle, Express, SQLite, and ATXP.
openclaw-feishu-ops-assistant
Feishu (Lark) workspace operations for OpenClaw agents. Provides document CRUD, cloud drive management, permission control, and knowledge-base navigation through a unified tool surface. Activate when user mentions Feishu docs, wiki, drive, permissions, or Lark cloud documents.
agentdb-semantic-vector-search
Build semantic vector search systems with AgentDB for intelligent document retrieval, RAG applications, and knowledge bases using embedding-based similarity matching
semantic-code-hunter
Use when you need to find code by concept (not just text). Uses Serena MCP for semantic code search across the codebase with minimal token usage. Ideal for understanding architecture, finding authentication flows, or multi-file refactoring.