Protein Interaction Network Analysis

Analyze protein-protein interaction networks using STRING, BioGRID, and SASBDB databases. Maps protein identifiers, retrieves interaction networks with confidence scores, performs functional enrichment analysis (GO/KEGG/Reactome), and optionally includes structural data. No API key required for core functionality (STRING). Use when analyzing protein networks, discovering interaction partners, identifying functional modules, or studying protein complexes.

42 stars

byZaoqu-Liu

View on GitHub Installation ↓

Best use case

Protein Interaction Network Analysis is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using Protein Interaction Network Analysis should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/tooluniverse-protein-interactions/SKILL.md --create-dirs "https://raw.githubusercontent.com/Zaoqu-Liu/ScienceClaw/main/skills/tooluniverse-protein-interactions/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/tooluniverse-protein-interactions/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Protein Interaction Network Analysis Compares

Feature / Agent	Protein Interaction Network Analysis	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Protein Interaction Network Analysis

Comprehensive protein interaction network analysis using ToolUniverse tools. Analyzes protein networks through a 4-phase workflow: identifier mapping, network retrieval, enrichment analysis, and optional structural data.

## Features

✅ **Identifier Mapping** - Convert protein names to database IDs (STRING, UniProt, Ensembl)
✅ **Network Retrieval** - Get interaction networks with confidence scores (0-1.0)
✅ **Functional Enrichment** - GO terms, KEGG pathways, Reactome pathways
✅ **PPI Enrichment** - Test if proteins form functional modules
✅ **Structural Data** - Optional SAXS/SANS solution structures (SASBDB)
✅ **Fallback Strategy** - STRING primary (no API key) → BioGRID secondary (if key available)

## Databases Used

| Database | Coverage | API Key | Purpose |
|----------|----------|---------|---------|
| **STRING** | 14M+ proteins, 5,000+ organisms | ❌ Not required | Primary interaction source |
| **BioGRID** | 2.3M+ interactions, 80+ organisms | ✅ Required | Fallback, curated data |
| **SASBDB** | 2,000+ SAXS/SANS entries | ❌ Not required | Solution structures |

## Quick Start

### Basic Usage

```python
from tooluniverse import ToolUniverse
from python_implementation import analyze_protein_network

# Initialize ToolUniverse
tu = ToolUniverse()

# Analyze protein network
result = analyze_protein_network(
    tu=tu,
    proteins=["TP53", "MDM2", "ATM", "CHEK2"],
    species=9606,  # Human
    confidence_score=0.7  # High confidence
)

# Access results
print(f"Mapped: {len(result.mapped_proteins)} proteins")
print(f"Network: {result.total_interactions} interactions")
print(f"Enrichment: {len(result.enriched_terms)} GO terms")
print(f"PPI p-value: {result.ppi_enrichment.get('p_value', 1.0):.2e}")
```

### Expected Output

```
🔍 Phase 1: Mapping 4 protein identifiers...
✅ Mapped 4/4 proteins (100.0%)

🕸️  Phase 2: Retrieving interaction network...
✅ STRING: Retrieved 6 interactions

🧬 Phase 3: Performing enrichment analysis...
✅ Found 245 enriched GO terms (FDR < 0.05)
✅ PPI enrichment significant (p=3.45e-05)

✅ Analysis complete!
```

## Use Cases

### 1. Single Protein Analysis

Discover interaction partners for a protein of interest:

```python
result = analyze_protein_network(
    tu=tu,
    proteins=["TP53"],  # Single protein
    species=9606,
    confidence_score=0.7
)

# Top 5 partners will be in the network
for edge in result.network_edges[:5]:
    print(f"{edge['preferredName_A']} ↔ {edge['preferredName_B']} "
          f"(score: {edge['score']})")
```

### 2. Protein Complex Validation

Test if proteins form a functional complex:

```python
# DNA damage response proteins
proteins = ["TP53", "ATM", "CHEK2", "BRCA1", "BRCA2"]

result = analyze_protein_network(tu=tu, proteins=proteins)

# Check PPI enrichment
if result.ppi_enrichment.get("p_value", 1.0) < 0.05:
    print("✅ Proteins form functional module!")
    print(f"   Expected edges: {result.ppi_enrichment['expected_number_of_edges']:.1f}")
    print(f"   Observed edges: {result.ppi_enrichment['number_of_edges']}")
else:
    print("⚠️  Proteins may be unrelated")
```

### 3. Pathway Discovery

Find enriched pathways for a protein set:

```python
result = analyze_protein_network(
    tu=tu,
    proteins=["MAPK1", "MAPK3", "RAF1", "MAP2K1"],  # MAPK pathway
    confidence_score=0.7
)

# Show top enriched processes
print("\nTop Enriched Pathways:")
for term in result.enriched_terms[:10]:
    print(f"  {term['term']}: p={term['p_value']:.2e}, FDR={term['fdr']:.2e}")
```

### 4. Multi-Protein Network Analysis

Build complete interaction network for multiple proteins:

```python
# Apoptosis regulators
proteins = ["TP53", "BCL2", "BAX", "CASP3", "CASP9"]

result = analyze_protein_network(
    tu=tu,
    proteins=proteins,
    confidence_score=0.7
)

# Export network for Cytoscape
import pandas as pd
df = pd.DataFrame(result.network_edges)
df.to_csv("apoptosis_network.tsv", sep="\t", index=False)
```

### 5. With BioGRID Validation

Use BioGRID for experimentally validated interactions:

```python
# Requires BIOGRID_API_KEY in environment
result = analyze_protein_network(
    tu=tu,
    proteins=["TP53", "MDM2"],
    include_biogrid=True  # Enable BioGRID fallback
)

print(f"Primary source: {result.primary_source}")  # "STRING" or "BioGRID"
```

### 6. Including Structural Data

Add SAXS/SANS solution structures:

```python
result = analyze_protein_network(
    tu=tu,
    proteins=["TP53"],
    include_structure=True  # Query SASBDB
)

if result.structural_data:
    print(f"\nFound {len(result.structural_data)} SAXS/SANS entries:")
    for entry in result.structural_data:
        print(f"  {entry.get('sasbdb_id')}: {entry.get('title')}")
```

## Parameters

### `analyze_protein_network()` Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `tu` | ToolUniverse | Required | ToolUniverse instance |
| `proteins` | list[str] | Required | Protein identifiers (gene symbols, UniProt IDs) |
| `species` | int | 9606 | NCBI taxonomy ID (9606=human, 10090=mouse) |
| `confidence_score` | float | 0.7 | Min interaction confidence (0-1). 0.4=low, 0.7=high, 0.9=very high |
| `include_biogrid` | bool | False | Use BioGRID if STRING fails (requires API key) |
| `include_structure` | bool | False | Include SASBDB structural data (slower) |
| `suppress_warnings` | bool | True | Suppress ToolUniverse loading warnings |

### Species IDs (Common)

- `9606` - Homo sapiens (human)
- `10090` - Mus musculus (mouse)
- `10116` - Rattus norvegicus (rat)
- `7227` - Drosophila melanogaster (fruit fly)
- `6239` - Caenorhabditis elegans (worm)
- `7955` - Danio rerio (zebrafish)
- `559292` - Saccharomyces cerevisiae (yeast)

### Confidence Score Guidelines

| Score | Level | Description | Use Case |
|-------|-------|-------------|----------|
| 0.15 | Very low | All evidence | Exploratory, hypothesis generation |
| 0.4 | Low | Medium evidence | Default STRING threshold |
| 0.7 | High | Strong evidence | **Recommended** - reliable interactions |
| 0.9 | Very high | Strongest evidence | Core interactions only |

## Results Structure

### `ProteinNetworkResult` Object

```python
@dataclass
class ProteinNetworkResult:
    # Phase 1: Identifier mapping
    mapped_proteins: List[Dict[str, Any]]
    mapping_success_rate: float

    # Phase 2: Network retrieval
    network_edges: List[Dict[str, Any]]
    total_interactions: int

    # Phase 3: Enrichment analysis
    enriched_terms: List[Dict[str, Any]]
    ppi_enrichment: Dict[str, Any]

    # Phase 4: Structural data (optional)
    structural_data: Optional[List[Dict[str, Any]]]

    # Metadata
    primary_source: str  # "STRING" or "BioGRID"
    warnings: List[str]
```

### Network Edge Format (STRING)

```python
{
    "stringId_A": "9606.ENSP00000269305",  # Protein A STRING ID
    "stringId_B": "9606.ENSP00000258149",  # Protein B STRING ID
    "preferredName_A": "TP53",             # Protein A name
    "preferredName_B": "MDM2",             # Protein B name
    "ncbiTaxonId": 9606,                   # Species
    "score": 0.999,                        # Combined confidence (0-1)
    "nscore": 0.0,                         # Neighborhood score
    "fscore": 0.0,                         # Gene fusion score
    "pscore": 0.0,                         # Phylogenetic profile score
    "ascore": 0.947,                       # Coexpression score
    "escore": 0.951,                       # Experimental score
    "dscore": 0.9,                         # Database score
    "tscore": 0.994                        # Text mining score
}
```

### Enrichment Term Format

```python
{
    "category": "Process",                  # GO category
    "term": "GO:0006915",                   # GO term ID
    "description": "apoptotic process",     # Term description
    "number_of_genes": 4,                   # Genes in your set
    "number_of_genes_in_background": 1234, # Genes in genome
    "p_value": 1.23e-05,                    # Enrichment p-value
    "fdr": 0.0012,                          # FDR correction
    "inputGenes": "TP53,MDM2,BAX,CASP3"    # Matching genes
}
```

## Workflow Details

### 4-Phase Analysis Pipeline

```
┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Identifier Mapping                                 │
│ ─────────────────────────────────────────────────────────── │
│ STRING_map_identifiers()                                    │
│   • Validates protein names exist in database              │
│   • Converts to STRING IDs for consistency                 │
│   • Returns mapping success rate                           │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Phase 2: Network Retrieval                                  │
│ ─────────────────────────────────────────────────────────── │
│ PRIMARY: STRING_get_network() (no API key needed)          │
│   • Retrieves all pairwise interactions                    │
│   • Returns confidence scores by evidence type             │
│                                                             │
│ FALLBACK: BioGRID_get_interactions() (if enabled)          │
│   • Used if STRING fails or for validation                 │
│   • Requires BIOGRID_API_KEY                               │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Phase 3: Enrichment Analysis                                │
│ ─────────────────────────────────────────────────────────── │
│ STRING_functional_enrichment()                              │
│   • GO terms (Process, Component, Function)                │
│   • KEGG pathways                                           │
│   • Reactome pathways                                       │
│   • FDR-corrected p-values                                  │
│                                                             │
│ STRING_ppi_enrichment()                                     │
│   • Tests if proteins interact more than random            │
│   • Returns p-value for functional coherence               │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Phase 4: Structural Data (Optional)                         │
│ ─────────────────────────────────────────────────────────── │
│ SASBDB_search_entries()                                     │
│   • SAXS/SANS solution structures                           │
│   • Protein flexibility and conformations                   │
│   • Complements crystal/cryo-EM data                       │
└─────────────────────────────────────────────────────────────┘
```

## Installation & Setup

### Prerequisites

```bash
# Install ToolUniverse (if not already installed)
pip install tooluniverse

# Or with extras
pip install tooluniverse[all]
```

### Optional: BioGRID API Key

For BioGRID fallback functionality:

1. Register for free API key: https://webservice.thebiogrid.org/
2. Add to `.env` file:
   ```bash
   BIOGRID_API_KEY=your_key_here
   ```

### Skill Files

```
tooluniverse-protein-interactions/
├── SKILL.md                    # This file
├── python_implementation.py    # Main implementation
├── QUICK_START.md             # Quick reference
├── DOMAIN_ANALYSIS.md         # Design rationale
├── PHASE2_COMPLETE.md         # Tool testing results
├── PHASE4_IMPLEMENTATION_COMPLETE.md
└── KNOWN_ISSUES.md            # ToolUniverse limitations
```

## Known Limitations

### 1. ToolUniverse Verbose Output

**Issue**: ToolUniverse prints 40+ warning messages during analysis.

**Workaround**: Filter output when running:
```bash
python your_script.py 2>&1 | grep -v "Error loading tools"
```

See `KNOWN_ISSUES.md` for details.

### 2. BioGRID Requires API Key

BioGRID fallback requires free API key. STRING works without any API key.

### 3. SASBDB May Have API Issues

SASBDB endpoints occasionally return errors. Structural data is optional.

## Performance

### Typical Execution Times

| Operation | Time | Notes |
|-----------|------|-------|
| Identifier mapping | 1-2 sec | For 5 proteins |
| Network retrieval | 2-3 sec | Depends on network size |
| Enrichment analysis | 3-5 sec | For 374 terms |
| Full 4-phase analysis | 6-10 sec | Excluding ToolUniverse overhead |

**Note**: Add 4-8 seconds per tool call for ToolUniverse loading (framework limitation).

### Optimization Tips

1. **Disable structural data** if not needed: `include_structure=False`
2. **Use higher confidence scores** to reduce network size: `confidence_score=0.9`
3. **Filter output** to avoid processing warning messages
4. **Reuse ToolUniverse instance** across multiple analyses

## Troubleshooting

### "Error: 'protein_ids' is a required property"

✅ **Fixed in this skill** - All parameter names verified in Phase 2 testing.

### No interactions found

- Check protein names are correct (case-sensitive)
- Try lower confidence score: `confidence_score=0.4`
- Verify species ID is correct
- Check if proteins actually interact (not all proteins have known interactions)

### BioGRID not working

- Ensure `BIOGRID_API_KEY` is set in environment
- Check API key is valid at https://webservice.thebiogrid.org/
- BioGRID is optional - STRING works without it

### Slow performance

- This is expected (see KNOWN_ISSUES.md)
- ToolUniverse framework reloads tools on every call
- Use output filtering to reduce processing time

## Examples

See `python_implementation.py` for:
- `example_tp53_analysis()` - Complete TP53 network analysis
- `analyze_protein_network()` - Main function with all options
- `ProteinNetworkResult` - Result data structure

## References

- **STRING**: https://string-db.org/ (14M+ proteins, 5,000+ organisms)
- **BioGRID**: https://thebiogrid.org/ (2.3M+ interactions, experimentally validated)
- **SASBDB**: https://www.sasbdb.org/ (2,000+ SAXS/SANS entries)
- **ToolUniverse**: https://github.com/mims-harvard/ToolUniverse

## Support

For issues with:
- **This skill**: Check KNOWN_ISSUES.md and troubleshooting section
- **ToolUniverse framework**: See TOOLUNIVERSE_BUG_REPORT.md
- **API errors**: Check database status pages (STRING, BioGRID, SASBDB)

## License

Same as ToolUniverse framework license.

Related Skills

tooluniverse-variant-analysis

from Zaoqu-Liu/ScienceClaw

Production-ready VCF processing, variant annotation, mutation analysis, and structural variant (SV/CNV) interpretation for bioinformatics questions. Parses VCF files (streaming, large files), classifies mutation types (missense, nonsense, synonymous, frameshift, splice, intronic, intergenic) and structural variants (deletions, duplications, inversions, translocations), applies VAF/depth/quality/consequence filters, annotates with ClinVar/dbSNP/gnomAD/CADD via ToolUniverse, interprets SV/CNV clinical significance using ClinGen dosage sensitivity scores, computes variant statistics, and generates reports. Solves questions like "What fraction of variants with VAF < 0.3 are missense?", "How many non-reference variants remain after filtering intronic/intergenic?", "What is the pathogenicity of this deletion affecting BRCA1?", or "Which dosage-sensitive genes overlap this CNV?". Use when processing VCF files, annotating variants, filtering by VAF/depth/consequence, classifying mutations, interpreting structural variants, assessing CNV pathogenicity, comparing cohorts, or answering variant analysis questions.

tooluniverse-structural-variant-analysis

from Zaoqu-Liu/ScienceClaw

Comprehensive structural variant (SV) analysis skill for clinical genomics. Classifies SVs (deletions, duplications, inversions, translocations), assesses pathogenicity using ACMG-adapted criteria, evaluates gene disruption and dosage sensitivity, and provides clinical interpretation with evidence grading. Use when analyzing CNVs, large deletions/duplications, chromosomal rearrangements, or any structural variants requiring clinical interpretation.

tooluniverse-spatial-omics-analysis

from Zaoqu-Liu/ScienceClaw

Computational analysis framework for spatial multi-omics data integration. Given spatially variable genes (SVGs), spatial domain annotations, tissue type, and disease context from spatial transcriptomics/proteomics experiments (10x Visium, MERFISH, DBiTplus, SLIDE-seq, etc.), performs comprehensive biological interpretation including pathway enrichment, cell-cell interaction inference, druggable target identification, immune microenvironment characterization, and multi-modal integration. Produces a detailed markdown report with Spatial Omics Integration Score (0-100), domain-by-domain characterization, and validation recommendations. Uses 70+ ToolUniverse tools across 9 analysis phases. Use when users ask about spatial transcriptomics analysis, spatial omics interpretation, tissue heterogeneity, spatial gene expression patterns, tumor microenvironment mapping, tissue zonation, or cell-cell communication from spatial data.

tooluniverse-proteomics-analysis

from Zaoqu-Liu/ScienceClaw

Analyze mass spectrometry proteomics data including protein quantification, differential expression, post-translational modifications (PTMs), and protein-protein interactions. Processes MaxQuant, Spectronaut, DIA-NN, and other MS platform outputs. Performs normalization, statistical analysis, pathway enrichment, and integration with transcriptomics. Use when analyzing proteomics data, comparing protein abundance between conditions, identifying PTM changes, studying protein complexes, integrating protein and RNA data, discovering protein biomarkers, or conducting quantitative proteomics experiments.

tooluniverse-protein-therapeutic-design

from Zaoqu-Liu/ScienceClaw

Design novel protein therapeutics (binders, enzymes, scaffolds) using AI-guided de novo design. Uses RFdiffusion for backbone generation, ProteinMPNN for sequence design, ESMFold/AlphaFold2 for validation. Use when asked to design protein binders, therapeutic proteins, or engineer protein function.

tooluniverse-protein-structure-retrieval

from Zaoqu-Liu/ScienceClaw

Retrieves protein structure data from RCSB PDB, PDBe, and AlphaFold with protein disambiguation, quality assessment, and comprehensive structural profiles. Creates detailed structure reports with experimental metadata, ligand information, and download links. Use when users need protein structures, 3D models, crystallography data, or mention PDB IDs (4-character codes like 1ABC) or UniProt accessions.

tooluniverse-network-pharmacology

from Zaoqu-Liu/ScienceClaw

Construct and analyze compound-target-disease networks for drug repurposing, polypharmacology discovery, and systems pharmacology. Builds multi-layer networks from ChEMBL, OpenTargets, STRING, DrugBank, Reactome, FAERS, and 60+ other ToolUniverse tools. Calculates Network Pharmacology Scores (0-100), identifies repurposing candidates, predicts mechanisms, and analyzes polypharmacology. Use when users ask about drug repurposing via network analysis, multi-target drug effects, compound-target-disease networks, systems pharmacology, or polypharmacology.

tooluniverse-metabolomics-analysis

from Zaoqu-Liu/ScienceClaw

Analyze metabolomics data including metabolite identification, quantification, pathway analysis, and metabolic flux. Processes LC-MS, GC-MS, NMR data from targeted and untargeted experiments. Performs normalization, statistical analysis, pathway enrichment, metabolite-enzyme integration, and biomarker discovery. Use when analyzing metabolomics datasets, identifying differential metabolites, studying metabolic pathways, integrating with transcriptomics/proteomics, discovering metabolic biomarkers, performing flux balance analysis, or characterizing metabolic phenotypes in disease, drug response, or physiological conditions.

tooluniverse-immune-repertoire-analysis

from Zaoqu-Liu/ScienceClaw

Comprehensive immune repertoire analysis for T-cell and B-cell receptor sequencing data. Analyze TCR/BCR repertoires to assess clonality, diversity, V(D)J gene usage, CDR3 characteristics, convergence, and predict epitope specificity. Integrate with single-cell data for clonotype-phenotype associations. Use for adaptive immune response profiling, cancer immunotherapy research, vaccine response assessment, autoimmune disease studies, or repertoire diversity analysis in immunology research.

tooluniverse-image-analysis

from Zaoqu-Liu/ScienceClaw

Production-ready microscopy image analysis and quantitative imaging data skill for colony morphometry, cell counting, fluorescence quantification, and statistical analysis of imaging-derived measurements. Processes ImageJ/CellProfiler output (area, circularity, intensity, cell counts), performs Dunnett's test, Cohen's d effect size, power analysis, Shapiro-Wilk normality tests, two-way ANOVA, polynomial regression, natural spline regression with confidence intervals, and comparative morphometry. Supports CSV/TSV measurement tables, multi-channel fluorescence data, colony swarming assays, and neuron counting datasets. Use when analyzing microscopy measurement data, colony area/circularity, cell count statistics, swarming assays, co-culture ratio optimization, or answering questions about imaging-derived quantitative data.

tooluniverse-crispr-screen-analysis

from Zaoqu-Liu/ScienceClaw

Comprehensive CRISPR screen analysis for functional genomics. Analyze pooled or arrayed CRISPR screens (knockout, activation, interference) to identify essential genes, synthetic lethal interactions, and drug targets. Perform sgRNA count processing, gene-level scoring (MAGeCK, BAGEL), quality control, pathway enrichment, and drug target prioritization. Use for CRISPR screen analysis, gene essentiality studies, synthetic lethality detection, functional genomics, drug target validation, or identifying genetic vulnerabilities.

Statistical Analysis & Quality Control

from Zaoqu-Liu/ScienceClaw

## Overview