claw-semantic-sim
Semantic Similarity Index for disease research literature using PubMedBERT embeddings
Best use case
claw-semantic-sim is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Semantic Similarity Index for disease research literature using PubMedBERT embeddings
Teams using claw-semantic-sim should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/claw-semantic-sim/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How claw-semantic-sim Compares
| Feature / Agent | claw-semantic-sim | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Semantic Similarity Index for disease research literature using PubMedBERT embeddings
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# 🦖 Semantic Similarity Index Measure how isolated or connected disease research is across the global biomedical literature, using PubMedBERT embeddings on PubMed abstracts spanning 175 GBD diseases. ## What it does 1. Takes a disease list (GBD taxonomy) as input 2. Retrieves PubMed abstracts (2000-2025) for each disease with quality filtering 3. Generates 768-dimensional PubMedBERT embeddings for every abstract 4. Computes four semantic equity metrics per disease: - **Semantic Isolation Index (SII)**: average cosine distance to k-nearest disease neighbours; higher = more isolated, less connected research - **Knowledge Transfer Potential (KTP)**: cross-disease centroid similarity; higher = more potential for research spillover - **Research Clustering Coefficient (RCC)**: within-disease embedding variance; higher = more diverse research approaches - **Temporal Semantic Drift**: cosine distance between yearly centroids; measures how research focus evolves 5. Generates publication-quality multi-panel figures: - **Panel A**: Semantic isolation by disease category (boxplot) - **Panel B**: Top 20 most semantically isolated diseases (bar chart, NTD/Global South colour-coded) - **Panel C**: Semantic isolation vs research volume (scatter with regression) - **Panel D**: NTD vs non-NTD significance test (Welch's t-test, Cohen's d) 6. Produces a markdown report with all metrics, rankings, and reproducibility bundle ## Why this exists If you ask ChatGPT to "measure research neglect for diseases," it will: - Not know which embedding model to use for biomedical text - Hallucinate metrics that sound plausible but have no methodological grounding - Skip quality filtering (year coverage, abstract coverage, minimum papers) - Not handle MPS acceleration or checkpointed batch processing - Produce a single scatter plot with no disease classification This skill encodes the correct methodological decisions: - Uses PubMedBERT (the gold-standard biomedical language model) - Fetches from PubMed with exponential backoff and NCBI rate limiting - Quality filters: year coverage >= 70%, abstract coverage >= 95%, minimum 50 papers - Batch embedding with Apple MPS acceleration and CPU fallback - Checkpointed processing (resume after interruption) - HDF5 storage with gzip compression and SHA-256 checksums - Classification against WHO NTD list and Global South priority diseases - Statistical significance testing (Welch's t-test, Cohen's d) ## Key Finding Neglected tropical diseases (NTDs) are significantly more semantically isolated than other conditions (P < 0.001, Cohen's d = 0.8+). They exist in knowledge silos with limited cross-disciplinary research bridges. The 25 most isolated diseases are disproportionately Global South priority conditions. ## Pipeline ``` 05-00-heim-sem-setup.py # Validate environment, create directories 05-01-heim-sem-fetch.py # Retrieve PubMed abstracts (checkpointed) 05-02-heim-sem-embed.py # Generate PubMedBERT embeddings (MPS/CPU) 05-03-heim-sem-compute.py # Compute SII, KTP, RCC, temporal drift 05-04-heim-sem-figures.py # Generate publication figures 05-05-heim-sem-integrate.py # Merge with biobank + clinical trial dimensions ``` ### Demo (works out of the box) ```bash python semantic_sim.py --demo --output demo_report ``` The demo uses pre-computed embeddings and metrics for 175 GBD diseases and generates the full 4-panel figure instantly. ## Example Output ``` Semantic Similarity Index ========================= Diseases analysed: 175 Total PubMed abstracts: 13,100,000 Embedding model: PubMedBERT (768-dim) Metric Ranges: SII: 0.0412 - 0.1893 KTP: 0.6234 - 0.9187 RCC: 0.0891 - 0.3421 Key Finding: NTDs show +38% higher semantic isolation P < 0.0001, Cohen's d = 0.84 14/25 most isolated diseases are Global South priority Figures saved to: demo_report/ Fig5_Semantic_Structure.png (300 dpi) Fig5_Semantic_Structure.pdf (vector) Reproducibility: commands.sh | environment.yml | checksums.sha256 ``` ## Interpretation Guide - **High SII**: Disease research exists in a knowledge silo; limited cross-disciplinary bridges - **Low KTP**: Research on this disease has few methodological overlaps with others - **High RCC**: Diverse research approaches within the disease (many subtopics) - **High Temporal Drift**: Research focus has shifted significantly over time - NTDs shown in **red**, Global South diseases in **orange**, others in **grey** - The scatter plot (Panel C) reveals the inverse relationship between research volume and isolation ## Citation If you use this skill in a publication, please cite: - Corpas, M. et al. (2026). HEIM: Health Equity Index for Measuring structural bias in biomedical research. Under review. - Corpas, M. (2026). ClawBio. https://github.com/ClawBio/ClawBio
Related Skills
remote-openclaw-deploy
通用远程部署 OpenClaw Agent 项目。支持任意定制化 agent 团队、跨 macOS/Linux、多渠道(飞书/Telegram/Discord)、deploy.json 声明式配置注入。一个脚本完成从零到可用的全流程。
polyclaw
> 多策略聚合交易——Polymarket/CLOB 多策略交易执行引擎
clawbio-pharmgx-reporter
Pharmacogenomic report from DTC genetic data (23andMe/AncestryDNA)
openclaw-master-skills
> OpenClaw 主控技能集——团队管理、Agent 调度、系统配置等核心管理技能
openclaw-inter-instance
OpenClaw 实例间通信。当需要在多个 OpenClaw 实例之间传递消息、同步数据、远程执行命令时使用此技能。覆盖 agent-to-agent 消息、nodes.run 远程执行、文件级通信等多种方式。
openclaw-config-helper
OpenClaw 配置修改助手。修改任何 OpenClaw 配置前必须先查阅官方文档,确保格式正确,避免系统崩溃或功能异常。强制执行:查 schema → 查文档 → 确认 → 修改的流程。
openclaw-browser-chain-debug
Diagnose OpenClaw browser control failures including browser start timeouts, Chrome CDP startup failures, missing DISPLAY, browser profile launch issues, and gateway/browser environment mismatches. Use when browser automation, browser-based cron jobs, or profile openclaw fails to start, times out, or returns Request was aborted after browser steps. Also use when deciding whether a task should run with a profile browser versus an attach browser: prefer profile for unattended automation and recurring jobs; prefer attach when a human's already-open logged-in tab or manual cooperation is required.
moneyclaw
> 财务分析工具——个人/企业财务数据聚合与分析
clawrouter
Smart LLM router — save 67% on inference costs. Routes every request to the cheapest capable model across 41 models from OpenAI, Anthropic, Google, DeepSeek, and xAI.
claw-metagenomics
Shotgun metagenomics profiling — taxonomy, resistome, and functional pathways
claw-ancestry-pca
Ancestry decomposition PCA against the Simons Genome Diversity Project
wemp-operator
> 微信公众号全功能运营——草稿/发布/评论/用户/素材/群发/统计/菜单/二维码 API 封装