tooluniverse-binder-discovery

Discover novel small molecule binders for protein targets using structure-based and ligand-based approaches. Creates actionable reports with candidate compounds, ADMET profiles, and synthesis feasibility. Use when users ask to find small molecules for a target, identify novel binders, perform virtual screening, or need hit-to-lead compound identification.

1,802 stars

Best use case

tooluniverse-binder-discovery is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Discover novel small molecule binders for protein targets using structure-based and ligand-based approaches. Creates actionable reports with candidate compounds, ADMET profiles, and synthesis feasibility. Use when users ask to find small molecules for a target, identify novel binders, perform virtual screening, or need hit-to-lead compound identification.

Teams using tooluniverse-binder-discovery should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/tooluniverse-binder-discovery/SKILL.md --create-dirs "https://raw.githubusercontent.com/FreedomIntelligence/OpenClaw-Medical-Skills/main/skills/tooluniverse-binder-discovery/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/tooluniverse-binder-discovery/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How tooluniverse-binder-discovery Compares

Feature / Agenttooluniverse-binder-discoveryStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Discover novel small molecule binders for protein targets using structure-based and ligand-based approaches. Creates actionable reports with candidate compounds, ADMET profiles, and synthesis feasibility. Use when users ask to find small molecules for a target, identify novel binders, perform virtual screening, or need hit-to-lead compound identification.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Small Molecule Binder Discovery Strategy

Systematic discovery of novel small molecule binders using 60+ ToolUniverse tools across druggability assessment, known ligand mining, similarity expansion, ADMET filtering, and synthesis feasibility.

**KEY PRINCIPLES**:
1. **Report-first approach** - Create report file FIRST, then populate progressively
2. **Target validation FIRST** - Confirm druggability before compound searching
3. **Multi-strategy approach** - Combine structure-based and ligand-based methods
4. **ADMET-aware filtering** - Eliminate poor compounds early
5. **Evidence grading** - Grade candidates by supporting evidence
6. **Actionable output** - Provide prioritized candidates with rationale
7. **English-first queries** - Always use English terms in tool calls, even if the user writes in another language. Only try original-language terms as a fallback. Respond in the user's language

---

## Critical Workflow Requirements

### 1. Report-First Approach (MANDATORY)

**DO NOT** show search process or tool outputs to the user. Instead:

1. **Create the report file FIRST** - Before any data collection:
   - File name: `[TARGET]_binder_discovery_report.md`
   - Initialize with all section headers from the template
   - Add placeholder text: `[Researching...]` in each section

2. **Progressively update the report** - As you gather data:
   - Update each section with findings immediately
   - The user sees the report growing, not the search process

3. **Output separate data files**:
   - `[TARGET]_candidate_compounds.csv` - Prioritized compounds with SMILES, scores
   - `[TARGET]_bibliography.json` - Literature references (optional)

### 2. Citation Requirements (MANDATORY)

Every piece of information MUST include its source:

```markdown
### 3.2 Known Inhibitors
| Compound | ChEMBL ID | IC50 (nM) | Selectivity | Source |
|----------|-----------|-----------|-------------|--------|
| Imatinib | CHEMBL941 | 38 | ABL-selective | ChEMBL |
| Dasatinib | CHEMBL1421 | 0.5 | Multi-kinase | ChEMBL |

*Source: ChEMBL via `ChEMBL_get_target_activities` (CHEMBL1862)*
```

---

## Workflow Overview

```
Phase 0: Tool Verification (check parameter names)
    ↓
Phase 1: Target Validation
    ├─ 1.1 Resolve identifiers (UniProt, Ensembl, ChEMBL target ID)
    ├─ 1.2 Assess druggability/tractability
    │   └─ 1.2.5 Check therapeutic antibodies (Thera-SAbDab) [NEW]
    ├─ 1.3 Identify binding sites
    └─ 1.4 Predict structure (NvidiaNIM_alphafold2/esmfold)
    ↓
Phase 2: Known Ligand Mining
    ├─ Extract ChEMBL bioactivity data
    ├─ Get GtoPdb interactions
    ├─ Identify chemical probes
    ├─ BindingDB affinity data (NEW - Ki/IC50/Kd)
    ├─ PubChem BioAssay HTS data (NEW - screening hits)
    └─ Analyze SAR from known actives
    ↓
Phase 3: Structure Analysis
    ├─ Get PDB structures with ligands
    ├─ Check EMDB for cryo-EM structures (NEW - for membrane targets)
    ├─ Analyze binding pocket
    └─ Identify key interactions
    ↓
Phase 3.5: Docking Validation (NvidiaNIM_diffdock/boltz2) [NEW]
    ├─ Dock reference inhibitor
    └─ Validate binding pocket geometry
    ↓
Phase 4: Compound Expansion
    ├─ 4.1-4.3 Similarity/substructure search
    └─ 4.4 De novo generation (NvidiaNIM_genmol/molmim) [NEW]
    ↓
Phase 5: ADMET Filtering
    ├─ Predict physicochemical properties
    ├─ Predict ADMET endpoints
    └─ Flag liabilities
    ↓
Phase 6: Candidate Docking & Prioritization
    ├─ Dock all candidates (NvidiaNIM_diffdock/boltz2) [UPDATED]
    ├─ Score by docking + ADMET + novelty
    ├─ Assess synthesis feasibility
    └─ Generate final ranked list
    ↓
Phase 7: Report Synthesis
```

---

## Phase 0: Tool Verification

**CRITICAL**: Verify tool parameters before calling unfamiliar tools.

```python
# Check tool params to prevent silent failures
tool_info = tu.tools.get_tool_info(tool_name="ChEMBL_get_target_activities")
```

### Known Parameter Corrections

| Tool | WRONG Parameter | CORRECT Parameter |
|------|-----------------|-------------------|
| `OpenTargets_get_target_tractability_by_ensemblID` | `ensembl_id` | `ensemblId` |
| `ChEMBL_get_target_activities` | `chembl_target_id` | `target_chembl_id` |
| `ChEMBL_search_similar_molecules` | `smiles` | `molecule` (accepts SMILES, ChEMBL ID, or name) |
| `alphafold_get_prediction` | `uniprot` | `accession` |

---

## Phase 1: Target Validation

### 1.1 Identifier Resolution Chain

```
1. UniProt_search(query=target_name, organism="human")
   └─ Extract: UniProt accession, gene name, protein name

2. MyGene_query_genes(q=gene_symbol, species="human")
   └─ Extract: Ensembl gene ID, NCBI gene ID

3. ChEMBL_search_targets(query=target_name, organism="Homo sapiens")
   └─ Extract: ChEMBL target ID, target type

4. GtoPdb_get_targets(query=target_name)
   └─ Extract: GtoPdb target ID (if GPCR/ion channel/enzyme)
```

**Store all IDs for downstream queries**:
```
ids = {
    'uniprot': 'P00533',
    'ensembl': 'ENSG00000146648',
    'chembl_target': 'CHEMBL203',
    'gene_symbol': 'EGFR',
    'gtopdb': '1797'  # if available
}
```

### 1.2 Druggability Assessment

**Multi-Source Triangulation**:

```
1. OpenTargets_get_target_tractability_by_ensemblID(ensemblId)
   └─ Extract: Small molecule tractability score, bucket
   
2. DGIdb_get_gene_druggability(genes=[gene_symbol])
   └─ Extract: Druggability categories, known drug count
   
3. OpenTargets_get_target_classes_by_ensemblID(ensemblId)
   └─ Extract: Target class (kinase, GPCR, etc.)

4. GPCRdb_get_protein(protein=entry_name)  # NEW - for GPCRs
   └─ Extract: GPCR family, receptor state, ligand binding data
```

### 1.2a GPCRdb Integration (NEW - for GPCR Targets)

~35% of all approved drugs target GPCRs. For GPCR targets, use specialized data:

```python
def check_if_gpcr_and_enrich(tu, target_name, uniprot_id):
    """Check if target is GPCR and get specialized data."""
    
    # Build GPCRdb entry name (e.g., "adrb2_human")
    entry_name = f"{target_name.lower()}_human"
    
    # Check if it's a GPCR
    gpcr_info = tu.tools.GPCRdb_get_protein(
        operation="get_protein",
        protein=entry_name
    )
    
    if gpcr_info.get('status') == 'success':
        # It's a GPCR - get specialized data
        
        # Get known structures (active/inactive states)
        structures = tu.tools.GPCRdb_get_structures(
            operation="get_structures",
            protein=entry_name
        )
        
        # Get known ligands
        ligands = tu.tools.GPCRdb_get_ligands(
            operation="get_ligands",
            protein=entry_name
        )
        
        # Get mutation data (important for SAR)
        mutations = tu.tools.GPCRdb_get_mutations(
            operation="get_mutations",
            protein=entry_name
        )
        
        return {
            'is_gpcr': True,
            'gpcr_family': gpcr_info['data'].get('family'),
            'gpcr_class': gpcr_info['data'].get('receptor_class'),
            'structures': structures['data'].get('structures', []),
            'ligands': ligands['data'].get('ligands', []),
            'mutation_data': mutations['data'].get('mutations', [])
        }
    
    return {'is_gpcr': False}
```

**GPCRdb Advantages**:
- GPCR-specific sequence alignments (Ballesteros-Weinstein numbering)
- Active vs. inactive state structures
- Curated ligand binding data
- Experimental mutation effects on ligand binding

**Druggability Scorecard**:

| Factor | Assessment | Score |
|--------|------------|-------|
| Known small molecule drugs | Yes (3+) | ★★★ |
| Tractability bucket | 1-3 | ★★☆-★★★ |
| Target class | Enzyme/GPCR/Ion channel | ★★★ |
| Binding site known | Yes (X-ray) | ★★★ |
| GPCRdb ligands available | Yes (10+) | ★★★ (GPCR only) |
| Therapeutic antibodies exist | Check Thera-SAbDab | See 1.2.5 |

**Decision Point**: If druggability score < ★★☆, warn user about challenges.

### 1.2.5 Therapeutic Antibody Landscape (NEW)

Check if therapeutic antibodies already target this protein - important for:
- Understanding competitive landscape
- Validating target tractability (if antibodies work, target is validated)
- Identifying potential combination approaches

```python
def check_therapeutic_antibodies(tu, target_name):
    """
    Check Thera-SAbDab for therapeutic antibodies against target.
    """
    # Search by target name
    results = tu.tools.TheraSAbDab_search_by_target(
        target=target_name
    )
    
    if results.get('status') == 'success':
        antibodies = results['data'].get('therapeutics', [])
        
        # Categorize by clinical stage
        by_phase = {'Approved': [], 'Phase 3': [], 'Phase 2': [], 'Phase 1': [], 'Preclinical': []}
        for ab in antibodies:
            phase = ab.get('phase', 'Unknown')
            for key in by_phase.keys():
                if key.lower() in phase.lower():
                    by_phase[key].append(ab)
                    break
        
        return {
            'total_antibodies': len(antibodies),
            'by_phase': by_phase,
            'antibodies': antibodies[:10],  # Top 10
            'competitive_alert': len(by_phase.get('Approved', [])) > 0
        }
    return None

def get_antibody_landscape(tu, target_name, uniprot_id=None):
    """
    Comprehensive antibody competitive landscape.
    """
    # Thera-SAbDab search
    therasabdab = check_therapeutic_antibodies(tu, target_name)
    
    # Also search by common synonyms
    synonyms = [target_name]
    if target_name != uniprot_id:
        synonyms.append(uniprot_id)
    
    all_antibodies = []
    for synonym in synonyms:
        results = tu.tools.TheraSAbDab_search_therapeutics(query=synonym)
        if results.get('status') == 'success':
            all_antibodies.extend(results['data'].get('therapeutics', []))
    
    # Deduplicate
    seen = set()
    unique = []
    for ab in all_antibodies:
        inn = ab.get('inn_name')
        if inn and inn not in seen:
            seen.add(inn)
            unique.append(ab)
    
    return {
        'antibodies': unique,
        'count': len(unique),
        'has_approved': any(ab.get('phase', '').lower() == 'approved' for ab in unique),
        'source': 'Thera-SAbDab'
    }
```

**Report Output**:
```markdown
### 1.2.5 Therapeutic Antibody Landscape (NEW)

**Thera-SAbDab Search Results**:

| Antibody (INN) | Target | Format | Phase | PDB |
|----------------|--------|--------|-------|-----|
| Pembrolizumab | PD-1 | IgG4 | Approved | 5DK3 |
| Nivolumab | PD-1 | IgG4 | Approved | 5WT9 |
| Cemiplimab | PD-1 | IgG4 | Approved | 7WVM |

**Competitive Landscape**: ⚠️ 3 approved antibodies target this protein
**Strategic Implication**: Small molecule approach offers differentiation (oral dosing, CNS penetration, cost)

*Source: Thera-SAbDab via `TheraSAbDab_search_by_target`*
```

**Why Include Antibody Landscape**:
- **Validation**: Approved antibodies = validated target
- **Competition**: Understand what's already in market/clinic
- **Strategy**: Identify gaps (no oral, no CNS-penetrant)
- **Synergy**: Potential combination opportunities

### 1.3 Binding Site Analysis

```
1. ChEMBL_search_binding_sites(target_chembl_id)
   └─ Extract: Binding site names, types
   
2. get_binding_affinity_by_pdb_id(pdb_id)  # For each PDB with ligand
   └─ Extract: Kd, Ki, IC50 values for co-crystallized ligands
   
3. InterPro_get_protein_domains(uniprot_accession)
   └─ Extract: Domain architecture, active sites
```

**Output for Report**:
```markdown
### 1.3 Binding Site Assessment

**Known Binding Sites**: 
| Site | Type | Evidence | Key Residues | Source |
|------|------|----------|--------------|--------|
| ATP pocket | Orthosteric | X-ray (23 PDBs) | K745, E762, M793 | PDB/ChEMBL |
| Allosteric pocket | Allosteric | X-ray (3 PDBs) | T790, C797 | PDB |

**Binding Site Druggability**: ★★★ (well-defined pocket, multiple co-crystal structures)

*Source: ChEMBL via `ChEMBL_search_binding_sites`, PDB structures*
```

### 1.4 Structure Prediction (NVIDIA NIM)

When no experimental structure is available, or for custom domain predictions.

**Requires**: `NVIDIA_API_KEY` environment variable

**Option A: AlphaFold2 (High accuracy, async)**
```
NvidiaNIM_alphafold2(
    sequence=kinase_domain_sequence,
    algorithm="mmseqs2",
    relax_prediction=False
)
└─ Returns: PDB structure with pLDDT confidence scores
└─ Use when: Accuracy is critical, time is available (~5-15 min)
```

**Option B: ESMFold (Fast, synchronous)**
```
NvidiaNIM_esmfold(sequence=kinase_domain_sequence)
└─ Returns: PDB structure (max 1024 AA)
└─ Use when: Quick assessment needed (~30 sec)
```

**Report pLDDT Confidence**:
```markdown
### 1.4 Structure Prediction Quality

**Method**: AlphaFold2 via NVIDIA NIM
**Mean pLDDT**: 90.94 (very high confidence)

| Confidence Level | Range | Fraction | Interpretation |
|------------------|-------|----------|----------------|
| Very High | ≥90 | 74.3% | Highly reliable |
| Confident | 70-90 | 16.0% | Reliable |
| Low | 50-70 | 9.0% | Use caution |
| Very Low | <50 | 0.7% | Unreliable |

**Key Binding Residue Confidence**:
| Residue | Function | pLDDT |
|---------|----------|-------|
| K745 | ATP binding | 90.0 |
| T790 | Gatekeeper | 92.3 |
| M793 | Hinge region | 95.3 |
| D855 | DFG motif | 89.5 |

*Source: NVIDIA NIM via `NvidiaNIM_alphafold2`*
```

---

## Phase 2: Known Ligand Mining

### 2.1 ChEMBL Bioactivity Data

```
1. ChEMBL_get_target_activities(target_chembl_id, limit=500)
   └─ Filter: standard_type in ["IC50", "Ki", "Kd", "EC50"]
   └─ Filter: standard_value < 10000 nM
   └─ Extract: ChEMBL molecule IDs, SMILES, potency values

2. ChEMBL_get_molecule(molecule_chembl_id)  # For top actives
   └─ Extract: Full molecular data, max_phase, oral flag
```

**Activity Summary Table**:
```markdown
### 2.1 Known Active Compounds (ChEMBL)

**Total Bioactivity Points**: 2,847 (IC50: 1,234 | Ki: 892 | Kd: 456 | EC50: 265)
**Compounds with IC50 < 100 nM**: 156
**Approved Drugs for This Target**: 5

| Compound | ChEMBL ID | IC50 (nM) | Max Phase | SMILES (truncated) |
|----------|-----------|-----------|-----------|-------------------|
| Erlotinib | CHEMBL553 | 2 | 4 | COc1cc2ncnc(Nc3ccc... |
| Gefitinib | CHEMBL939 | 5 | 4 | COc1cc2ncnc(Nc3ccc... |
| [Novel] | CHEMBL123 | 12 | 0 | c1ccc(NC(=O)c2ccc... |

*Source: ChEMBL via `ChEMBL_get_target_activities` (CHEMBL203)*
```

### 2.2 GtoPdb Interactions

```
GtoPdb_get_target_interactions(target_id)
└─ Extract: Ligands with pKi/pIC50, selectivity data
```

### 2.3 Chemical Probes

```
OpenTargets_get_chemical_probes_by_target_ensemblID(ensemblId)
└─ Extract: Validated chemical probes with ratings
```

**Output for Report**:
```markdown
### 2.3 Chemical Probes

| Probe | Target | Rating | Use | Caveat | Source |
|-------|--------|--------|-----|--------|--------|
| Probe-X | EGFR | ★★★★ | In vivo | None | Chemical Probes Portal |
| Probe-Y | EGFR | ★★★☆ | In vitro | Off-target kinase activity | Open Targets |

**Recommended Probe for Target Validation**: Probe-X (highest rating, validated in vivo)
```

### 2.4 SAR Analysis from Actives

Identify common scaffolds and SAR trends:

```markdown
### 2.4 Structure-Activity Relationships

**Core Scaffolds Identified**:
1. **4-Anilinoquinazoline** (34 compounds, IC50 range: 2-500 nM)
   - N1 position: Aryl preferred
   - C6/C7: Methoxy groups improve potency
   
2. **Pyrimidine-amine** (12 compounds, IC50 range: 15-800 nM)
   - Less potent than quinazolines
   - Better selectivity profile

**Key SAR Insights**:
- Halogen at meta position of aniline increases potency 3-5x
- C7 ethoxy group critical for binding (H-bond to M793)
```

### 2.5 BindingDB Affinity Data (NEW)

BindingDB provides experimental binding affinity data complementary to ChEMBL:

```python
def get_bindingdb_ligands(tu, uniprot_id, affinity_cutoff=10000):
    """
    Get ligands from BindingDB with measured affinities.
    
    BindingDB advantages:
    - May have compounds not in ChEMBL
    - Different affinity types (Ki, IC50, Kd)
    - Direct literature links
    """
    
    result = tu.tools.BindingDB_get_ligands_by_uniprot(
        uniprot=uniprot_id,
        affinity_cutoff=affinity_cutoff  # nM
    )
    
    if result:
        ligands = []
        for entry in result:
            ligands.append({
                'smiles': entry.get('smile'),
                'affinity_type': entry.get('affinity_type'),
                'affinity_nM': entry.get('affinity'),
                'pmid': entry.get('pmid'),
                'monomer_id': entry.get('monomerid')
            })
        
        # Sort by potency
        ligands.sort(key=lambda x: float(x['affinity_nM']) if x['affinity_nM'] else 1e6)
        return ligands[:50]  # Top 50
    
    return []

def find_compound_polypharmacology(tu, smiles, similarity_cutoff=0.85):
    """Find off-target interactions for selectivity analysis."""
    
    targets = tu.tools.BindingDB_get_targets_by_compound(
        smiles=smiles,
        similarity_cutoff=similarity_cutoff
    )
    
    return targets  # Other proteins this compound may bind
```

**BindingDB Output for Report**:
```markdown
### 2.5 Additional Ligands (BindingDB) (NEW)

**Total Unique Ligands**: 89 (non-overlapping with ChEMBL)
**Most Potent**: 0.3 nM Ki

| SMILES | Affinity Type | Value (nM) | PMID | BindingDB ID |
|--------|---------------|------------|------|--------------|
| CC(C)Cc1ccc... | Ki | 0.3 | 15737014 | 12345 |
| COc1cc2ncnc... | IC50 | 2.1 | 16460808 | 12346 |

**Novel Scaffolds from BindingDB**: 3 scaffolds not seen in ChEMBL data

*Source: BindingDB via `BindingDB_get_ligands_by_uniprot`*
```

### 2.6 PubChem BioAssay Screening Data (NEW)

PubChem BioAssay provides HTS screening results and dose-response data:

```python
def get_pubchem_assays_for_target(tu, gene_symbol):
    """
    Get bioassays and active compounds from PubChem.
    
    Advantages:
    - HTS data not in ChEMBL
    - NIH-funded screening programs (MLPCN)
    - Dose-response curves for IC50 calculation
    """
    
    # Search assays targeting this gene
    assays = tu.tools.PubChem_search_assays_by_target_gene(
        gene_symbol=gene_symbol
    )
    
    results = {
        'assays': [],
        'total_active_compounds': 0
    }
    
    if assays.get('data', {}).get('aids'):
        for aid in assays['data']['aids'][:10]:  # Top 10 assays
            # Get assay summary
            summary = tu.tools.PubChem_get_assay_summary(aid=aid)
            
            # Get active compounds
            actives = tu.tools.PubChem_get_assay_active_compounds(aid=aid)
            active_cids = actives.get('data', {}).get('cids', [])
            
            results['assays'].append({
                'aid': aid,
                'summary': summary.get('data', {}),
                'active_count': len(active_cids)
            })
            results['total_active_compounds'] += len(active_cids)
    
    return results

def get_dose_response_data(tu, aid):
    """Get dose-response curves for IC50/EC50 determination."""
    
    dr_data = tu.tools.PubChem_get_assay_dose_response(aid=aid)
    return dr_data

def get_compound_bioactivity_profile(tu, cid):
    """Get all bioactivity data for a compound."""
    
    profile = tu.tools.PubChem_get_compound_bioactivity(cid=cid)
    return profile
```

**PubChem BioAssay Output for Report**:
```markdown
### 2.6 PubChem HTS Screening Data (NEW)

**Assays Found**: 45
**Total Active Compounds Across Assays**: ~1,200

| AID | Assay Type | Active Compounds | Target | Description |
|-----|------------|------------------|--------|-------------|
| 504526 | HTS | 234 | EGFR | qHTS inhibition screen |
| 1053104 | Dose-response | 12 | EGFR kinase | Confirmatory IC50 |
| 651564 | Cellular | 8 | EGFR | Cell proliferation assay |

**Novel Actives** (not in ChEMBL/BindingDB):
- CID 12345678: Active in AID 504526, IC50 = 45 nM
- CID 23456789: Active in AID 1053104, IC50 = 120 nM

*Source: PubChem via `PubChem_search_assays_by_target_gene`, `PubChem_get_assay_active_compounds`*
```

**Why Use Both BindingDB and PubChem**:
| Source | Strengths | Best For |
|--------|-----------|----------|
| **ChEMBL** | Curated, standardized, SAR data | Primary ligand source |
| **BindingDB** | Direct affinity measurements | Ki/Kd values, PMIDs |
| **PubChem BioAssay** | HTS data, NIH screens | Novel scaffolds, broad coverage |

---

## Phase 3: Structure Analysis

### 3.1 PDB Structure Retrieval

```
1. PDB_search_similar_structures(query=uniprot_accession, type="sequence")
   └─ Extract: PDB IDs with ligands
   
2. get_protein_metadata_by_pdb_id(pdb_id)
   └─ Extract: Resolution, method, ligand codes
   
3. alphafold_get_prediction(accession=uniprot_accession)
   └─ Extract: Predicted structure (if no experimental)
```

### 3.1b EMDB Cryo-EM Structures (NEW)

**Prioritize EMDB for**: Membrane proteins (GPCRs, ion channels), large complexes, targets with multiple conformational states.

```python
def get_cryoem_structures(tu, target_name, uniprot_accession):
    """Get cryo-EM structures for membrane targets."""
    
    # Search EMDB
    emdb_results = tu.tools.emdb_search(
        query=f"{target_name} membrane receptor"
    )
    
    structures = []
    for entry in emdb_results[:5]:
        details = tu.tools.emdb_get_entry(entry_id=entry['emdb_id'])
        
        # Get associated PDB model (essential for docking)
        pdb_models = details.get('pdb_ids', [])
        
        structures.append({
            'emdb_id': entry['emdb_id'],
            'resolution': entry.get('resolution', 'N/A'),
            'title': entry.get('title', 'N/A'),
            'conformational_state': details.get('state', 'Unknown'),
            'pdb_models': pdb_models
        })
    
    return structures
```

**When to use cryo-EM over X-ray**:
| Target Type | Prefer cryo-EM? | Reason |
|-------------|-----------------|--------|
| GPCR | Yes | Native membrane conformation |
| Ion channel | Yes | Multiple functional states |
| Receptor-ligand complex | Yes | Physiological state |
| Kinase | Usually X-ray | Higher resolution typically |

**Structure Summary**:
```markdown
### 3.1 Available Structures

| PDB ID | Resolution | Method | Ligand | Affinity | State |
|--------|------------|--------|--------|----------|-------|
| 1M17 | 2.6 Å | X-ray | Erlotinib | Ki=0.4 nM | Active |
| 4HJO | 2.1 Å | X-ray | Lapatinib | Ki=3 nM | Inactive |
| AF-P00533 | - | Predicted | None | - | - |

### 3.1b Cryo-EM Structures (EMDB)

| EMDB ID | Resolution | PDB Model | Conformation | Ligand |
|---------|------------|-----------|--------------|--------|
| EMD-12345 | 3.2 Å | 7ABC | Active | Agonist |
| EMD-23456 | 3.5 Å | 8DEF | Inactive | Antagonist |

**Best Structure for Docking**: 1M17 (high resolution, relevant ligand)
*Source: RCSB PDB, EMDB, AlphaFold DB*
```

### 3.2 Binding Pocket Analysis

```
get_binding_affinity_by_pdb_id(pdb_id)
└─ Extract: Binding affinities for co-crystallized ligands
```

**Output for Report**:
```markdown
### 3.2 Binding Pocket Characterization

**Pocket Volume**: ~850 ų (well-defined)
**Key Interaction Residues**:
- **Hinge region**: M793 (backbone H-bond donor/acceptor)
- **Gatekeeper**: T790 (small residue, allows access)
- **DFG motif**: D855 (active conformation)
- **Selectivity pocket**: L788, G796 (unique to EGFR)

**Druggability Assessment**: High (enclosed pocket, conserved interactions)
```

---

## Phase 3.5: Docking Validation (NVIDIA NIM)

Validate structure and score compounds using molecular docking.

**Requires**: `NVIDIA_API_KEY` environment variable

### 3.5.1 Reference Compound Docking

Dock a known inhibitor to validate the structure captures the binding pocket correctly.

**Option A: DiffDock (Blind docking, PDB + SDF input)**
```
NvidiaNIM_diffdock(
    protein=pdb_content,        # PDB text content
    ligand=reference_sdf,       # SDF/MOL2 content
    num_poses=10
)
└─ Returns: Docking poses with confidence scores
└─ Use: When you have PDB structure and ligand SDF file
```

**Option B: Boltz2 (From sequence + SMILES)**
```
NvidiaNIM_boltz2(
    polymers=[{"molecule_type": "protein", "sequence": kinase_sequence}],
    ligands=[{"smiles": "COc1cc2ncnc(Nc3ccc(C#C)cc3)c2cc1OCCOC"}],
    sampling_steps=50,
    diffusion_samples=1
)
└─ Returns: Protein-ligand complex structure
└─ Use: When starting from SMILES, no SDF needed
```

### 3.5.2 Docking Score Interpretation

| Score vs Reference | Priority | Symbol |
|--------------------|----------|--------|
| Higher than reference | Top priority | ★★★★ |
| Within 5% of reference | High priority | ★★★ |
| Within 20% of reference | Moderate priority | ★★☆ |
| >20% lower | Low priority | ★☆☆ |

**Report Format**:
```markdown
### 3.5 Docking Validation Results

**Reference Compound**: Erlotinib
**Method**: DiffDock via NVIDIA NIM

| Metric | Value | Interpretation |
|--------|-------|----------------|
| Best Pose Confidence | 0.906 | Excellent |
| Steric Clashes | None | Clean binding pose |

**Validation Status**: ✓ Structure captures binding pocket correctly

*Source: NVIDIA NIM via `NvidiaNIM_diffdock`*
```

---

## Phase 4: Compound Expansion

### 4.1 Similarity Search

Starting from top actives, expand chemical space:

```
1. ChEMBL_search_similar_molecules(molecule=top_active_smiles, similarity=70)
   └─ Extract: Similar compounds not yet tested on target
   
2. PubChem_search_compounds_by_similarity(smiles, threshold=0.7)
   └─ Extract: PubChem CIDs with similar structures
```

**Strategy**:
- Use 3-5 diverse actives as seeds
- Similarity threshold: 70-85% (balance novelty vs. activity)
- Prioritize compounds NOT in ChEMBL bioactivity for target

### 4.2 Substructure Search

```
1. ChEMBL_search_substructure(smiles=core_scaffold)
   └─ Extract: Compounds containing scaffold
   
2. PubChem_search_compounds_by_substructure(smiles=core_scaffold)
   └─ Extract: Additional scaffold-containing compounds
```

### 4.3 Cross-Database Mining

```
1. STITCH_get_chemical_protein_interactions(identifier=target_gene)
   └─ Extract: Additional chemical-protein links
   
2. DGIdb_get_drug_gene_interactions(genes=[gene_symbol])
   └─ Extract: Approved/investigational drugs
```

**Output for Report**:
```markdown
### 4. Compound Expansion Results

**Starting Seeds**: 5 diverse actives (IC50 < 100 nM)
**Similarity Expansion**: 847 compounds (70% threshold)
**Substructure Search**: 234 scaffold matches
**Cross-Database**: 45 additional hits

**After Deduplication**: 923 unique candidate compounds

| Source | Compounds | Already Tested | Novel Candidates |
|--------|-----------|----------------|------------------|
| ChEMBL similarity | 456 | 234 | 222 |
| PubChem similarity | 391 | 156 | 235 |
| ChEMBL substructure | 178 | 89 | 89 |
| STITCH | 45 | 23 | 22 |
| **Total Unique** | **923** | **355** | **568** |
```

### 4.4 De Novo Molecule Generation (NVIDIA NIM)

When database mining yields insufficient candidates, generate novel molecules.

**Requires**: `NVIDIA_API_KEY` environment variable

**Option A: GenMol (Scaffold Hopping with Masked Regions)**
```
NvidiaNIM_genmol(
    smiles="COc1cc2ncnc(Nc3ccc([*{3-8}])c([*{1-3}])c3)c2cc1OCCCN1CCOCC1",
    num_molecules=100,
    temperature=2.0,
    scoring="QED"
)
└─ Input: SMILES with [*{min-max}] masked regions
└─ Output: Generated molecules with QED/LogP scores
└─ Use: Explore specific positions while keeping scaffold
```

**Mask Design Strategy**:
| Position | Mask | Purpose |
|----------|------|---------|
| Aniline substituent | `[*{1-3}]` | Small groups (halogen, methyl) |
| Solubilizing group | `[*{5-10}]` | Morpholine, piperazine variants |
| Linker region | `[*{3-6}]` | Spacer variations |

**Example Masked SMILES for EGFR**:
```
# Keep quinazoline core, vary aniline and tail
COc1cc2ncnc(Nc3ccc([*{1-3}])c([*{1-3}])c3)c2cc1[*{5-12}]
```

**Option B: MolMIM (Controlled Generation from Reference)**
```
NvidiaNIM_molmim(
    smi="COc1cc2ncnc(Nc3ccc(Cl)cc3)c2cc1OCCN1CCOCC1",
    num_molecules=50,
    algorithm="CMA-ES"
)
└─ Input: Reference SMILES (known active)
└─ Output: Optimized analogs with property scores
└─ Use: Generate close analogs of top actives
```

**Generation Workflow**:
1. Identify top 3-5 actives from Phase 2
2. Design masked SMILES for GenMol OR use as reference for MolMIM
3. Generate 50-100 molecules per seed
4. Pass generated molecules to Phase 5 (ADMET filtering)
5. Dock survivors in Phase 6 for final ranking

**Report Format**:
```markdown
### 4.4 De Novo Generation Results

**Method**: GenMol via NVIDIA NIM
**Seed Scaffold**: 4-anilinoquinazoline (from erlotinib)
**Masked Positions**: Aniline (3,4), solubilizing tail

| Metric | Value |
|--------|-------|
| Molecules Generated | 100 |
| Passing Lipinski | 95 (95%) |
| Mean QED Score | 0.72 |
| Unique Scaffolds | 12 |

**Top Generated Compounds**:
| ID | SMILES | QED | LogP | Novelty |
|----|--------|-----|------|---------|
| GEN-001 | COc1cc2ncnc(Nc3ccc(Cl)c(Cl)c3)c2cc1OCCCN1CCOCC1 | 0.81 | 4.2 | Novel substitution |
| GEN-002 | COc1cc2ncnc(Nc3ccc(C#N)c(F)c3)c2cc1OCCCN1CCOCC1 | 0.78 | 3.8 | Novel substitution |

*Source: NVIDIA NIM via `NvidiaNIM_genmol`*
```

---

## Phase 5: ADMET Filtering

### 5.1 Physicochemical Properties

```
ADMETAI_predict_physicochemical_properties(smiles=[compound_list])
└─ Filter: Lipinski violations ≤ 1
└─ Filter: QED > 0.3
└─ Filter: MW 200-600
```

### 5.2 ADMET Endpoints

```
1. ADMETAI_predict_bioavailability(smiles=[compound_list])
   └─ Filter: Oral bioavailability > 0.3
   
2. ADMETAI_predict_toxicity(smiles=[compound_list])
   └─ Filter: AMES < 0.5, hERG < 0.5, DILI < 0.5
   
3. ADMETAI_predict_CYP_interactions(smiles=[compound_list])
   └─ Flag: CYP3A4 inhibitors (drug interaction risk)
```

### 5.3 Structural Alerts

```
ChEMBL_search_compound_structural_alerts(smiles=compound_smiles)
└─ Flag: PAINS, reactive groups, toxicophores
```

**ADMET Filter Summary**:
```markdown
### 5. ADMET Filtering Results

| Filter Stage | Input | Passed | Failed | Pass Rate |
|--------------|-------|--------|--------|-----------|
| Physicochemical (Lipinski) | 568 | 456 | 112 | 80% |
| Drug-likeness (QED > 0.3) | 456 | 398 | 58 | 87% |
| Bioavailability (> 0.3) | 398 | 312 | 86 | 78% |
| Toxicity filters | 312 | 267 | 45 | 86% |
| Structural alerts | 267 | 234 | 33 | 88% |
| **Final Candidates** | **568** | **234** | **334** | **41%** |

**Common Failure Reasons**:
1. High molecular weight (>600): 67 compounds
2. Low predicted bioavailability: 86 compounds
3. hERG liability: 28 compounds
4. PAINS alerts: 18 compounds
```

---

## Phase 6: Candidate Prioritization

### 6.1 Scoring Framework

Score each candidate on multiple dimensions:

| Dimension | Weight | Scoring Criteria |
|-----------|--------|------------------|
| **Structural Similarity** | 25% | Tanimoto to actives (0.7-1.0 → 1-5) |
| **Novelty** | 20% | Not in ChEMBL bioactivity = +2; Novel scaffold = +3 |
| **ADMET Score** | 25% | Composite of property predictions |
| **Synthesis Feasibility** | 15% | SA score (1-10), commercial availability |
| **Scaffold Diversity** | 15% | Cluster representative bonus |

### 6.2 Synthesis Feasibility

```markdown
### 6.2 Synthesis Feasibility Assessment

| Candidate | SA Score | Commercial | Estimated Steps | Flag |
|-----------|----------|------------|-----------------|------|
| Compound-1 | 2.3 | Yes (Enamine) | 0 | ★★★ |
| Compound-2 | 3.5 | Building block | 2-3 | ★★☆ |
| Compound-3 | 5.8 | No | 6-8 | ★☆☆ |

**SA Score Interpretation**:
- 1-3: Easy synthesis
- 3-5: Moderate complexity
- 5-10: Challenging synthesis
```

### 6.3 Final Prioritized List

```markdown
### 6.3 Top 20 Candidate Compounds

| Rank | ID | SMILES | Sim. Score | ADMET | Novelty | Overall | Rationale |
|------|-----|--------|------------|-------|---------|---------|-----------|
| 1 | CPD-001 | Cc1ccc... | 0.82 | 4.5 | Novel scaffold | 4.2 | High similarity, clean ADMET |
| 2 | CPD-002 | COc1cc... | 0.78 | 4.3 | Not tested | 4.0 | Quinazoline analog |
| 3 | CPD-003 | Nc1ccc... | 0.75 | 4.1 | Novel core | 3.9 | New chemotype |
| ... | ... | ... | ... | ... | ... | ... | ... |

**Scaffold Diversity**: 7 distinct scaffolds in top 20
**Commercial Availability**: 12/20 available for purchase
**Estimated Hit Rate**: 15-25% (based on similarity to actives)
```

---

## Phase 6.5: Literature Evidence (NEW)

### 6.5.1 Literature Search for Validation

Search literature to validate candidate compounds and understand target context.

```python
def search_binder_literature(tu, target_name, compound_scaffolds):
    """Search literature for compound and target evidence."""
    
    # PubMed: Published SAR studies
    sar_papers = tu.tools.PubMed_search_articles(
        query=f"{target_name} inhibitor SAR structure-activity",
        limit=30
    )
    
    # BioRxiv: Latest unpublished findings
    preprints = tu.tools.BioRxiv_search_preprints(
        query=f"{target_name} small molecule discovery",
        limit=15
    )
    
    # MedRxiv: Clinical data on inhibitors
    clinical = tu.tools.MedRxiv_search_preprints(
        query=f"{target_name} inhibitor clinical trial",
        limit=10
    )
    
    # Citation analysis for key papers
    key_papers = sar_papers[:10]
    for paper in key_papers:
        citation = tu.tools.openalex_search_works(
            query=paper['title'],
            limit=1
        )
        paper['citations'] = citation[0].get('cited_by_count', 0) if citation else 0
    
    return {
        'published_sar': sar_papers,
        'preprints': preprints,
        'clinical_preprints': clinical,
        'high_impact_papers': sorted(key_papers, key=lambda x: x.get('citations', 0), reverse=True)
    }
```

### 6.5.2 Output for Report

```markdown
## 6.5 Literature Evidence

### Published SAR Studies

| PMID | Title | Year | Key Insight |
|------|-------|------|-------------|
| 34567890 | Discovery of novel EGFR inhibitors... | 2024 | C7 substitution critical |
| 33456789 | Structure-activity relationship of... | 2023 | Fluorine improves potency |

### Recent Preprints (⚠️ Not Peer-Reviewed)

| Source | Title | Posted | Relevance |
|--------|-------|--------|-----------|
| BioRxiv | Novel scaffolds for EGFR... | 2024-02 | New chemotype discovery |
| MedRxiv | Clinical activity of... | 2024-01 | Phase 2 results |

### High-Impact References

| PMID | Citations | Title |
|------|-----------|-------|
| 32123456 | 523 | Landmark EGFR inhibitor study... |
| 31234567 | 312 | Comprehensive SAR analysis... |

*Source: PubMed, BioRxiv, MedRxiv, OpenAlex*
```

---

## Report Template

**File**: `[TARGET]_binder_discovery_report.md`

```markdown
# Small Molecule Binder Discovery: [TARGET]

**Generated**: [Date] | **Query**: [Original query] | **Status**: In Progress

---

## Executive Summary
[Researching...]

---

## 1. Target Validation
### 1.1 Target Identifiers
[Researching...]
### 1.2 Druggability Assessment
[Researching...]
### 1.3 Binding Site Analysis
[Researching...]

---

## 2. Known Ligand Landscape
### 2.1 ChEMBL Bioactivity Summary
[Researching...]
### 2.2 Approved Drugs & Clinical Compounds
[Researching...]
### 2.3 Chemical Probes
[Researching...]
### 2.4 SAR Insights
[Researching...]

---

## 3. Structural Information
### 3.1 Available Structures
[Researching...]
### 3.2 Binding Pocket Analysis
[Researching...]
### 3.3 Key Interactions
[Researching...]

---

## 4. Compound Expansion
### 4.1 Similarity Search Results
[Researching...]
### 4.2 Substructure Search Results
[Researching...]
### 4.3 Cross-Database Mining
[Researching...]

---

## 5. ADMET Filtering
### 5.1 Physicochemical Filters
[Researching...]
### 5.2 ADMET Predictions
[Researching...]
### 5.3 Structural Alerts
[Researching...]
### 5.4 Filter Summary
[Researching...]

---

## 6. Candidate Prioritization
### 6.1 Scoring Methodology
[Researching...]
### 6.2 Synthesis Feasibility
[Researching...]
### 6.3 Top 20 Candidates
[Researching...]

---

## 7. Recommendations
### 7.1 Immediate Actions
[Researching...]
### 7.2 Experimental Validation Plan
[Researching...]
### 7.3 Backup Strategies
[Researching...]

---

## 8. Data Gaps & Limitations
[Researching...]

---

## 9. Data Sources
[Will be populated as research progresses...]

---

## 10. Methods Summary

| Step | Tool | Purpose |
|------|------|---------|
| Sequence retrieval | UniProt_search | Get protein sequence |
| Structure prediction | NvidiaNIM_alphafold2 / NvidiaNIM_esmfold | 3D structure with pLDDT |
| Docking validation | NvidiaNIM_diffdock / NvidiaNIM_boltz2 | Validate binding pocket |
| Known ligands | ChEMBL_get_target_activities | Bioactivity data |
| Similarity search | ChEMBL_search_similar_molecules | Expand chemical space |
| De novo generation | NvidiaNIM_genmol / NvidiaNIM_molmim | Novel molecule design |
| ADMET filtering | ADMETAI_predict_* | Drug-likeness assessment |
| Candidate docking | NvidiaNIM_diffdock / NvidiaNIM_boltz2 | Final scoring |
```

---

## Evidence Grading

| Tier | Symbol | Description | Example |
|------|--------|-------------|---------|
| **T0** | ★★★★ | Docking score > reference inhibitor | Better than erlotinib |
| **T1** | ★★★ | Experimental IC50/Ki < 100 nM | ChEMBL bioactivity |
| **T2** | ★★☆ | Docking within 5% of reference OR IC50 100-1000 nM | High priority |
| **T3** | ★☆☆ | Structural similarity > 80% to T1 | Predicted active |
| **T4** | ☆☆☆ | Similarity 70-80%, scaffold match | Lower confidence |
| **T5** | ○○○ | Generated molecule, ADMET-passed, no docking | Speculative |

**Docking-Enhanced Grading**: When NVIDIA NIM docking is available, compounds gain evidence:
- Docking > reference → upgrade to T0 (★★★★)
- Docking within 5% → upgrade to T2 (★★☆)
- Docking within 20% → maintain current tier
- Docking >20% worse → downgrade one tier

Apply to all candidate compounds:
```markdown
| Compound | Evidence | Docking vs Ref | Rationale |
|----------|----------|----------------|-----------|
| CPD-001 | ★★★★ | +8.3% | 85% similar, docking > erlotinib |
| CPD-002 | ★★★ | -2.1% | IC50=45nM, validated by docking |
| CPD-003 | ★★☆ | -4.5% | 78% similar, good docking |
| GEN-001 | ★☆☆ | -15% | Generated, ADMET-passed |
```

---

## Mandatory Completeness Checklist

### Phase 1: Target Validation
- [ ] UniProt accession resolved
- [ ] ChEMBL target ID obtained
- [ ] Druggability assessed (≥2 sources)
- [ ] Target class identified
- [ ] Binding site information (or "No structural data")

### Phase 2: Known Ligands
- [ ] ChEMBL activities queried (≥100 or all available)
- [ ] Activity statistics (count, potency range)
- [ ] Top 10 actives listed with IC50
- [ ] Chemical probes identified (or "None available")
- [ ] SAR insights summarized

### Phase 3: Structure
- [ ] PDB structures listed (or "No experimental structure")
- [ ] Best structure for docking identified
- [ ] Binding pocket described (or "Predicted from AlphaFold")

### Phase 4: Expansion
- [ ] ≥3 seed compounds used
- [ ] Similarity search completed (≥100 results or exhausted)
- [ ] Substructure search completed
- [ ] Deduplicated candidate count reported

### Phase 5: ADMET
- [ ] Physicochemical filters applied
- [ ] Toxicity predictions run
- [ ] Structural alerts checked
- [ ] Filter funnel table included

### Phase 6: Prioritization
- [ ] ≥20 candidates ranked (or all if fewer)
- [ ] Scoring methodology explained
- [ ] Synthesis feasibility assessed
- [ ] Scaffold diversity noted

### Phase 7: Recommendations
- [ ] ≥3 immediate actions listed
- [ ] Experimental validation plan outlined
- [ ] Data gaps aggregated

---

## Tool Reference by Phase

### Phase 1: Target Validation
| Tool | Purpose |
|------|---------|
| `UniProt_search` | Resolve UniProt accession |
| `MyGene_query_genes` | Get Ensembl/NCBI IDs |
| `ChEMBL_search_targets` | Get ChEMBL target ID |
| `OpenTargets_get_target_tractability_by_ensemblID` | Tractability assessment |
| `DGIdb_get_gene_druggability` | Druggability categories |
| `ChEMBL_search_binding_sites` | Binding site info |
| `InterPro_get_protein_domains` | Domain architecture |

### Phase 2: Known Ligands
| Tool | Purpose |
|------|---------|
| `ChEMBL_get_target_activities` | Bioactivity data |
| `ChEMBL_get_molecule` | Molecule details |
| `GtoPdb_get_target_interactions` | Pharmacology data |
| `OpenTargets_get_chemical_probes_by_target_ensemblID` | Chemical probes |
| `OpenTargets_get_associated_drugs_by_target_ensemblID` | Known drugs |

### Phase 1.4: Structure Prediction (NVIDIA NIM)
| Tool | Purpose |
|------|---------|
| `NvidiaNIM_alphafold2` | High-accuracy structure prediction with pLDDT |
| `NvidiaNIM_esmfold` | Fast structure prediction (max 1024 AA) |
| `NvidiaNIM_msa_search` | MSA generation for AlphaFold |

### Phase 3: Structure
| Tool | Purpose |
|------|---------|
| `PDB_search_similar_structures` | Find PDB structures |
| `get_protein_metadata_by_pdb_id` | Structure metadata |
| `get_binding_affinity_by_pdb_id` | Ligand affinities |
| `alphafold_get_prediction` | Predicted structure (AlphaFold DB) |
| `get_ligand_smiles_by_chem_comp_id` | Ligand structures |
| `emdb_search` | Search cryo-EM structures (NEW) |
| `emdb_get_entry` | Get EMDB entry details (NEW) |

### Phase 3.5: Docking Validation (NVIDIA NIM)
| Tool | Purpose |
|------|---------|
| `NvidiaNIM_diffdock` | Blind molecular docking (PDB + SDF) |
| `NvidiaNIM_boltz2` | Protein-ligand complex (sequence + SMILES) |

### Phase 4: Expansion
| Tool | Purpose |
|------|---------|
| `ChEMBL_search_similar_molecules` | Similarity search |
| `PubChem_search_compounds_by_similarity` | PubChem similarity |
| `ChEMBL_search_substructure` | Substructure search |
| `PubChem_search_compounds_by_substructure` | PubChem substructure |
| `STITCH_get_chemical_protein_interactions` | Cross-database |

### Phase 4.4: De Novo Generation (NVIDIA NIM)
| Tool | Purpose |
|------|---------|
| `NvidiaNIM_genmol` | Scaffold hopping with masked regions |
| `NvidiaNIM_molmim` | Controlled generation from reference |

### Phase 5: ADMET
| Tool | Purpose |
|------|---------|
| `ADMETAI_predict_physicochemical_properties` | Drug-likeness |
| `ADMETAI_predict_bioavailability` | Oral absorption |
| `ADMETAI_predict_toxicity` | Toxicity flags |
| `ADMETAI_predict_CYP_interactions` | CYP liabilities |
| `ChEMBL_search_compound_structural_alerts` | PAINS, alerts |

### Phase 6: Candidate Docking (NVIDIA NIM)
| Tool | Purpose |
|------|---------|
| `NvidiaNIM_diffdock` | Score all candidates by docking |
| `NvidiaNIM_boltz2` | Alternative docking from SMILES |

### Phase 6.5: Literature Evidence (NEW)
| Tool | Purpose |
|------|---------|
| `PubMed_search_articles` | Published SAR studies |
| `BioRxiv_search_preprints` | Latest biology preprints |
| `MedRxiv_search_preprints` | Clinical preprints |
| `openalex_search_works` | Citation analysis |
| `SemanticScholar_search` | AI-ranked papers |

---

## Fallback Chains

| Primary Tool | Fallback 1 | Fallback 2 | Use When |
|--------------|------------|------------|----------|
| `ChEMBL_get_target_activities` | `GtoPdb_get_target_interactions` | `PubChem_search_assays` | No ChEMBL data |
| `ChEMBL_search_similar_molecules` | `PubChem_search_compounds_by_similarity` | `STITCH_get_chemical_protein_interactions` | ChEMBL exhausted |
| `PDB_search_similar_structures` | `NvidiaNIM_alphafold2` | `alphafold_get_prediction` | No PDB structure |
| `alphafold_get_prediction` | `NvidiaNIM_alphafold2` | `NvidiaNIM_esmfold` | AlphaFold DB unavailable |
| `NvidiaNIM_alphafold2` | `NvidiaNIM_esmfold` | `alphafold_get_prediction` | AlphaFold2 NIM error |
| `NvidiaNIM_diffdock` | `NvidiaNIM_boltz2` | Skip docking, use similarity | Docking error |
| `NvidiaNIM_genmol` | `NvidiaNIM_molmim` | Manual scaffold hopping | Generation error |
| `OpenTargets_get_target_tractability` | `DGIdb_get_gene_druggability` | Document "Unknown" | Open Targets error |
| `ADMETAI_*` | SwissADME tools | Basic Lipinski | Invalid SMILES |
| `PDB_search_similar_structures` | `emdb_search` + PDB | `NvidiaNIM_alphafold2` | Membrane proteins |
| `PubMed_search_articles` | `openalex_search_works` | `SemanticScholar_search` | Literature search |
| `BioRxiv_search_preprints` | `MedRxiv_search_preprints` | Skip preprints | Preprint sources |

**NVIDIA NIM API Key Required**: Tools with `NvidiaNIM_` prefix require `NVIDIA_API_KEY` environment variable. Check availability at start:
```python
import os
nvidia_available = bool(os.environ.get("NVIDIA_API_KEY"))
# If not available, fall back to non-NIM alternatives
```

---

## Common Use Cases

### Well-Characterized Target
User: "Find novel binders for EGFR"
→ Rich ChEMBL data; focus on novel scaffolds, selectivity, ADMET

### Novel Target
User: "Find small molecules for [new target with no known ligands]"
→ Limited bioactivity; rely on structure-based assessment, similar target ligands

### Lead Optimization
User: "Find analogs of compound X for target Y"
→ Deep similarity search around specific compound; focus on SAR

### Selectivity Challenge
User: "Find selective inhibitors for kinase X vs kinase Y"
→ Include selectivity analysis; filter by off-target predictions

---

## When NOT to Use This Skill

- **Drug research** → Use tooluniverse-drug-research (existing drug profiling)
- **Target research only** → Use tooluniverse-target-research
- **Single compound ADMET** → Call ADMET tools directly
- **Literature search** → Use tooluniverse-literature-deep-research
- **Protein structure only** → Use tooluniverse-protein-structure-retrieval

Use this skill for **discovering new compounds** for a protein target.

---

## Additional Resources

- **Checklist**: [CHECKLIST.md](CHECKLIST.md) - Pre-delivery verification
- **Examples**: [EXAMPLES.md](EXAMPLES.md) - Detailed workflow examples
- **Tool corrections**: [TOOLS_REFERENCE.md](TOOLS_REFERENCE.md) - Parameter corrections

Related Skills

tooluniverse-variant-interpretation

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Systematic clinical variant interpretation from raw variant calls to ACMG-classified recommendations with structural impact analysis. Aggregates evidence from ClinVar, gnomAD, CIViC, UniProt, and PDB across ACMG criteria. Produces pathogenicity scores (0-100), clinical recommendations, and treatment implications. Use when interpreting genetic variants, classifying variants of uncertain significance (VUS), performing ACMG variant classification, or translating variant calls to clinical actionability.

tooluniverse-variant-analysis

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Production-ready VCF processing, variant annotation, mutation analysis, and structural variant (SV/CNV) interpretation for bioinformatics questions. Parses VCF files (streaming, large files), classifies mutation types (missense, nonsense, synonymous, frameshift, splice, intronic, intergenic) and structural variants (deletions, duplications, inversions, translocations), applies VAF/depth/quality/consequence filters, annotates with ClinVar/dbSNP/gnomAD/CADD via ToolUniverse, interprets SV/CNV clinical significance using ClinGen dosage sensitivity scores, computes variant statistics, and generates reports. Solves questions like "What fraction of variants with VAF < 0.3 are missense?", "How many non-reference variants remain after filtering intronic/intergenic?", "What is the pathogenicity of this deletion affecting BRCA1?", or "Which dosage-sensitive genes overlap this CNV?". Use when processing VCF files, annotating variants, filtering by VAF/depth/consequence, classifying mutations, interpreting structural variants, assessing CNV pathogenicity, comparing cohorts, or answering variant analysis questions.

tooluniverse-target-research

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Gather comprehensive biological target intelligence from 9 parallel research paths covering protein info, structure, interactions, pathways, expression, variants, drug interactions, and literature. Features collision-aware searches, evidence grading (T1-T4), explicit Open Targets coverage, and mandatory completeness auditing. Use when users ask about drug targets, proteins, genes, or need target validation, druggability assessment, or comprehensive target profiling.

tooluniverse-systems-biology

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Comprehensive systems biology and pathway analysis using multiple pathway databases (Reactome, KEGG, WikiPathways, Pathway Commons, BioModels). Performs pathway enrichment, protein-pathway mapping, keyword searches, and systems-level analysis. Use when analyzing gene sets, exploring biological pathways, or investigating systems-level biology.

tooluniverse-structural-variant-analysis

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Comprehensive structural variant (SV) analysis skill for clinical genomics. Classifies SVs (deletions, duplications, inversions, translocations), assesses pathogenicity using ACMG-adapted criteria, evaluates gene disruption and dosage sensitivity, and provides clinical interpretation with evidence grading. Use when analyzing CNVs, large deletions/duplications, chromosomal rearrangements, or any structural variants requiring clinical interpretation.

tooluniverse-statistical-modeling

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Perform statistical modeling and regression analysis on biomedical datasets. Supports linear regression, logistic regression (binary/ordinal/multinomial), mixed-effects models, Cox proportional hazards survival analysis, Kaplan-Meier estimation, and comprehensive model diagnostics. Extracts odds ratios, hazard ratios, confidence intervals, p-values, and effect sizes. Designed to solve BixBench statistical reasoning questions involving clinical/experimental data. Use when asked to fit regression models, compute odds ratios, perform survival analysis, run statistical tests, or interpret model coefficients from provided data.

tooluniverse-spatial-transcriptomics

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Analyze spatial transcriptomics data to map gene expression in tissue architecture. Supports 10x Visium, MERFISH, seqFISH, Slide-seq, and imaging-based platforms. Performs spatial clustering, domain identification, cell-cell proximity analysis, spatial gene expression patterns, tissue architecture mapping, and integration with single-cell data. Use when analyzing spatial transcriptomics datasets, studying tissue organization, identifying spatial expression patterns, mapping cell-cell interactions in tissue context, characterizing tumor microenvironment spatial structure, or integrating spatial and single-cell RNA-seq data for comprehensive tissue analysis.

tooluniverse-spatial-omics-analysis

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Computational analysis framework for spatial multi-omics data integration. Given spatially variable genes (SVGs), spatial domain annotations, tissue type, and disease context from spatial transcriptomics/proteomics experiments (10x Visium, MERFISH, DBiTplus, SLIDE-seq, etc.), performs comprehensive biological interpretation including pathway enrichment, cell-cell interaction inference, druggable target identification, immune microenvironment characterization, and multi-modal integration. Produces a detailed markdown report with Spatial Omics Integration Score (0-100), domain-by-domain characterization, and validation recommendations. Uses 70+ ToolUniverse tools across 9 analysis phases. Use when users ask about spatial transcriptomics analysis, spatial omics interpretation, tissue heterogeneity, spatial gene expression patterns, tumor microenvironment mapping, tissue zonation, or cell-cell communication from spatial data.

tooluniverse-single-cell

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Production-ready single-cell and expression matrix analysis using scanpy, anndata, and scipy. Performs scRNA-seq QC, normalization, PCA, UMAP, Leiden/Louvain clustering, differential expression (Wilcoxon, t-test, DESeq2), cell type annotation, per-cell-type statistical analysis, gene-expression correlation, batch correction (Harmony), trajectory inference, and cell-cell communication analysis. NEW: Analyzes ligand-receptor interactions between cell types using OmniPath (CellPhoneDB, CellChatDB), scores communication strength, identifies signaling cascades, and handles multi-subunit receptor complexes. Integrates with ToolUniverse gene annotation tools (HPA, Ensembl, MyGene, UniProt) and enrichment tools (gseapy, PANTHER, STRING). Supports h5ad, 10X, CSV/TSV count matrices, and pre-annotated datasets. Use when analyzing single-cell RNA-seq data, studying cell-cell interactions, performing cell type differential expression, computing gene-expression correlations by cell type, analyzing tumor-immune communication, or answering questions about scRNA-seq datasets.

tooluniverse-sequence-retrieval

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Retrieves biological sequences (DNA, RNA, protein) from NCBI and ENA with gene disambiguation, accession type handling, and comprehensive sequence profiles. Creates detailed reports with sequence metadata, cross-database references, and download options. Use when users need nucleotide sequences, protein sequences, genome data, or mention GenBank, RefSeq, EMBL accessions.

tooluniverse-rnaseq-deseq2

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Production-ready RNA-seq differential expression analysis using PyDESeq2. Performs DESeq2 normalization, dispersion estimation, Wald testing, LFC shrinkage, and result filtering. Handles multi-factor designs, multiple contrasts, batch effects, and integrates with gene enrichment (gseapy) and ToolUniverse annotation tools (UniProt, Ensembl, OpenTargets). Supports CSV/TSV/H5AD input formats and any organism. Use when analyzing RNA-seq count matrices, identifying DEGs, performing differential expression with statistical rigor, or answering questions about gene expression changes.

tooluniverse-rare-disease-diagnosis

1802
from FreedomIntelligence/OpenClaw-Medical-Skills

Provide differential diagnosis for patients with suspected rare diseases based on phenotype and genetic data. Matches symptoms to HPO terms, identifies candidate diseases from Orphanet/OMIM, prioritizes genes for testing, interprets variants of uncertain significance. Use when clinician asks about rare disease diagnosis, unexplained phenotypes, or genetic testing interpretation.