tooluniverse-proteomics-analysis

Analyze mass spectrometry proteomics data including protein quantification, differential expression, post-translational modifications (PTMs), and protein-protein interactions. Processes MaxQuant, Spectronaut, DIA-NN, and other MS platform outputs. Performs normalization, statistical analysis, pathway enrichment, and integration with transcriptomics. Use when analyzing proteomics data, comparing protein abundance between conditions, identifying PTM changes, studying protein complexes, integrating protein and RNA data, discovering protein biomarkers, or conducting quantitative proteomics experiments.

42 stars

byZaoqu-Liu

View on GitHub Installation ↓

Best use case

tooluniverse-proteomics-analysis is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using tooluniverse-proteomics-analysis should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/tooluniverse-proteomics-analysis/SKILL.md --create-dirs "https://raw.githubusercontent.com/Zaoqu-Liu/ScienceClaw/main/skills/tooluniverse-proteomics-analysis/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/tooluniverse-proteomics-analysis/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How tooluniverse-proteomics-analysis Compares

Feature / Agent	tooluniverse-proteomics-analysis	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Proteomics Analysis

Comprehensive analysis of mass spectrometry-based proteomics data from protein identification through quantification, differential expression, post-translational modifications, and systems-level interpretation.

## When to Use This Skill

**Triggers**:
- User has proteomics data (MS output files)
- Questions about protein abundance or expression
- Differential protein expression analysis requests
- PTM analysis (phosphorylation, acetylation, ubiquitination)
- Protein-RNA correlation analysis
- Multi-omics integration involving proteomics
- Protein complex or interaction analysis
- Proteomics biomarker discovery

**Example Questions This Skill Solves**:
1. "Analyze this MaxQuant output for differential protein expression"
2. "Which proteins are significantly upregulated in disease vs control?"
3. "Correlate protein abundance with mRNA expression"
4. "What post-translational modifications change between conditions?"
5. "Identify protein complexes in my co-IP MS data"
6. "Which pathways are enriched in differentially expressed proteins?"
7. "Find protein biomarkers for disease classification"
8. "Compare protein and RNA levels to identify translation-regulated genes"

---

## Core Capabilities

| Capability | Description |
|-----------|-------------|
| **Data Import** | MaxQuant, Spectronaut, DIA-NN, Proteome Discoverer, FragPipe outputs |
| **Quality Control** | Missing value analysis, intensity distributions, sample clustering |
| **Normalization** | Median, quantile, TMM, VSN normalization methods |
| **Imputation** | MinProb, KNN, QRILC for missing values |
| **Differential Expression** | Limma, DEP, MSstats for statistical testing |
| **PTM Analysis** | Phospho-site localization, PTM enrichment, kinase prediction |
| **Protein-RNA Integration** | Correlation analysis, translation efficiency |
| **Pathway Enrichment** | Over-representation and GSEA for protein sets |
| **PPI Analysis** | Protein complex detection, interaction networks via STRING/IntAct |
| **Reporting** | Comprehensive reports with volcano plots, heatmaps, pathway diagrams |

---

## Workflow Overview

```
Input: MS Proteomics Data
    |
    v
Phase 1: Data Import & QC
    |-- Load MaxQuant/Spectronaut/DIA-NN output
    |-- Parse protein groups, intensities, modifications
    |-- Quality control plots (missing values, intensity distributions)
    |-- Sample correlation and PCA
    |
    v
Phase 2: Preprocessing
    |-- Filter low-confidence proteins
    |-- Handle missing values (imputation or filtering)
    |-- Log-transform intensities
    |-- Normalize across samples
    |
    v
Phase 3: Differential Expression Analysis
    |-- Statistical testing (limma, t-test, ANOVA)
    |-- Multiple testing correction (BH, Bonferroni)
    |-- Fold change calculation
    |-- Significance thresholds (p < 0.05, |log2FC| > 1)
    |
    v
Phase 4: PTM Analysis (if applicable)
    |-- Identify modified peptides
    |-- Localization probability filtering
    |-- PTM site quantification
    |-- Kinase-substrate prediction
    |-- PTM enrichment analysis
    |
    v
Phase 5: Functional Enrichment
    |-- Gene Ontology enrichment
    |-- KEGG/Reactome pathway enrichment
    |-- Protein complex enrichment (CORUM)
    |-- Tissue-specific enrichment
    |
    v
Phase 6: Protein-Protein Interactions
    |-- Query STRING for interaction networks
    |-- Identify protein complexes
    |-- Network clustering and modules
    |-- Hub protein identification
    |
    v
Phase 7: Multi-Omics Integration (optional)
    |-- Correlate with RNA-seq data
    |-- Identify translation-regulated proteins
    |-- Compare with variant/CNV data
    |-- Integrate with metabolomics
    |
    v
Phase 8: Generate Report
    |-- Summary statistics
    |-- Volcano plots and heatmaps
    |-- Pathway diagrams
    |-- Protein network visualizations
    |-- Multi-omics integration plots
```

---

## Phase Details

### Phase 1: Data Import & Quality Control

**Objective**: Load proteomics data and assess data quality.

**Supported input formats**:

**MaxQuant (most common)**:
- `proteinGroups.txt` - Protein-level quantification
- `evidence.txt` - Peptide-level data
- `Phospho (STY)Sites.txt` - Phosphorylation sites
- `modificationSpecificPeptides.txt` - Other PTMs

**Spectronaut**:
- `*_Report.tsv` - Protein/peptide quantification
- DIA-based quantification

**DIA-NN**:
- `report.tsv` - Protein groups
- `report.pr_matrix.tsv` - Protein matrix

**Proteome Discoverer**:
- `*_Proteins.txt`
- `*_PSMs.txt`

**Data loading**:
```python
def load_maxquant_proteins(protein_groups_file):
    """
    Load MaxQuant proteinGroups.txt file.

    Returns:
    - DataFrame with proteins as rows, samples as columns
    - Metadata (protein names, gene names, sequence coverage)
    """
    import pandas as pd

    # Read file
    df = pd.read_csv(protein_groups_file, sep='\t')

    # Extract intensity columns (LFQ or raw)
    intensity_cols = [col for col in df.columns if 'LFQ intensity' in col or 'Intensity ' in col]

    # Create intensity matrix
    intensity_matrix = df[intensity_cols].copy()
    intensity_matrix.columns = [col.replace('LFQ intensity ', '').replace('Intensity ', '')
                                 for col in intensity_cols]

    # Add protein metadata
    metadata = df[['Protein IDs', 'Gene names', 'Fasta headers',
                   'Peptides', 'Sequence coverage [%]']].copy()

    return intensity_matrix, metadata
```

**Quality Control**:

1. **Missing value assessment**:
```python
def assess_missing_values(intensity_matrix):
    """
    Calculate percentage of missing values per protein and sample.
    """
    # Per protein
    missing_per_protein = (intensity_matrix == 0).sum(axis=1) / intensity_matrix.shape[1]

    # Per sample
    missing_per_sample = (intensity_matrix == 0).sum(axis=0) / intensity_matrix.shape[0]

    # Visualize
    plot_missing_value_heatmap(intensity_matrix)

    return missing_per_protein, missing_per_sample
```

2. **Intensity distribution**:
```python
def plot_intensity_distributions(intensity_matrix):
    """
    Plot log10 intensity distributions per sample.
    Check for consistent distributions across samples.
    """
    import matplotlib.pyplot as plt
    import numpy as np

    log_intensities = np.log10(intensity_matrix.replace(0, np.nan))

    # Boxplot per sample
    log_intensities.plot(kind='box')
    plt.ylabel('log10 Intensity')
    plt.title('Intensity Distribution per Sample')

    # Should see similar median and spread across samples
```

3. **Sample correlation**:
```python
def plot_sample_correlation(intensity_matrix):
    """
    Calculate and visualize sample-sample correlation.
    Expect: High correlation within replicates, lower between conditions.
    """
    # Log-transform and remove zeros
    log_data = np.log2(intensity_matrix.replace(0, np.nan))

    # Correlation matrix
    corr_matrix = log_data.corr(method='pearson')

    # Heatmap
    import seaborn as sns
    sns.heatmap(corr_matrix, annot=True, cmap='RdYlBu_r', vmin=0.8, vmax=1.0)
```

4. **PCA**:
```python
def perform_pca(intensity_matrix, sample_groups):
    """
    Principal component analysis for sample clustering.
    """
    from sklearn.decomposition import PCA

    # Prepare data (log, impute, scale)
    log_data = np.log2(intensity_matrix.replace(0, np.nan))
    # Simple imputation with minimum value
    imputed = log_data.fillna(log_data.min().min())

    # PCA
    pca = PCA(n_components=2)
    pca_result = pca.fit_transform(imputed.T)

    # Plot with group colors
    plt.scatter(pca_result[:, 0], pca_result[:, 1], c=sample_groups)
    plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
    plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
```

### Phase 2: Preprocessing & Normalization

**Objective**: Clean data and normalize across samples for fair comparison.

**Filtering**:
```python
def filter_proteins(intensity_matrix, metadata, min_valid=3):
    """
    Filter out low-confidence proteins.

    Criteria:
    - At least 2 unique peptides (from metadata)
    - At least min_valid samples with detected intensity
    - Remove contaminants and reverse sequences
    """
    # Filter by peptide count
    valid_proteins = metadata['Peptides'] >= 2

    # Filter by detection in samples
    n_detected = (intensity_matrix > 0).sum(axis=1)
    valid_detection = n_detected >= min_valid

    # Remove contaminants (from MaxQuant)
    is_contaminant = metadata['Protein IDs'].str.contains('CON__', na=False)
    is_reverse = metadata['Protein IDs'].str.contains('REV__', na=False)

    # Combined filter
    keep = valid_proteins & valid_detection & ~is_contaminant & ~is_reverse

    return intensity_matrix[keep], metadata[keep]
```

**Missing value imputation**:
```python
def impute_missing_values(intensity_matrix, method='MinProb'):
    """
    Impute missing protein intensities.

    Methods:
    - MinProb: Random from minimum observed + normal noise (for MNAR)
    - KNN: K-nearest neighbors imputation
    - QRILC: Quantile regression-based imputation
    """
    if method == 'MinProb':
        # Assume missing = low abundance (MNAR assumption)
        min_val = intensity_matrix[intensity_matrix > 0].min().min()
        width = 0.3  # Standard deviation of noise
        shift = 1.8  # Downshift from minimum

        # Replace zeros with random low values
        imputed = intensity_matrix.copy()
        missing_mask = imputed == 0
        n_missing = missing_mask.sum().sum()

        random_vals = np.random.normal(
            loc=min_val - shift,
            scale=width,
            size=n_missing
        )
        imputed.values[missing_mask.values] = random_vals

        return imputed

    elif method == 'KNN':
        from sklearn.impute import KNNImputer
        imputer = KNNImputer(n_neighbors=5)
        imputed = pd.DataFrame(
            imputer.fit_transform(intensity_matrix.replace(0, np.nan)),
            index=intensity_matrix.index,
            columns=intensity_matrix.columns
        )
        return imputed
```

**Normalization**:
```python
def normalize_intensities(intensity_matrix, method='median'):
    """
    Normalize protein intensities across samples.

    Methods:
    - median: Divide by median intensity per sample
    - quantile: Quantile normalization (same distribution)
    - TMM: Trimmed mean of M-values (from edgeR)
    - VSN: Variance-stabilizing normalization
    """
    if method == 'median':
        # Median normalization
        medians = intensity_matrix.median(axis=0)
        global_median = medians.median()
        norm_factors = global_median / medians
        normalized = intensity_matrix * norm_factors
        return normalized

    elif method == 'quantile':
        # Quantile normalization
        from sklearn.preprocessing import quantile_transform
        normalized = pd.DataFrame(
            quantile_transform(intensity_matrix, axis=1),
            index=intensity_matrix.index,
            columns=intensity_matrix.columns
        )
        return normalized
```

### Phase 3: Differential Expression Analysis

**Objective**: Identify proteins with significant abundance changes between conditions.

**Statistical testing with limma**:
```python
def differential_expression_limma(log2_intensities, group1_samples, group2_samples):
    """
    Perform differential expression using limma-like approach.

    Returns:
    - log2 fold changes
    - p-values
    - adjusted p-values (BH)
    """
    from scipy import stats

    results = []

    for protein in log2_intensities.index:
        # Extract intensities for each group
        group1 = log2_intensities.loc[protein, group1_samples]
        group2 = log2_intensities.loc[protein, group2_samples]

        # Calculate statistics
        mean1 = group1.mean()
        mean2 = group2.mean()
        log2fc = mean2 - mean1

        # t-test
        t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)

        results.append({
            'protein': protein,
            'log2FC': log2fc,
            'mean_group1': mean1,
            'mean_group2': mean2,
            'p_value': p_value,
            't_statistic': t_stat
        })

    results_df = pd.DataFrame(results)

    # Multiple testing correction (Benjamini-Hochberg)
    from statsmodels.stats.multitest import multipletests
    results_df['adj_p_value'] = multipletests(results_df['p_value'], method='fdr_bh')[1]

    # Classify significance
    results_df['significant'] = (
        (results_df['adj_p_value'] < 0.05) &
        (np.abs(results_df['log2FC']) > 1.0)
    )

    return results_df
```

**Volcano plot**:
```python
def plot_volcano(de_results, title='Volcano Plot'):
    """
    Visualize differential expression results.
    """
    import matplotlib.pyplot as plt

    plt.figure(figsize=(8, 6))

    # Non-significant
    non_sig = de_results[~de_results['significant']]
    plt.scatter(non_sig['log2FC'], -np.log10(non_sig['p_value']),
                c='gray', alpha=0.5, s=10)

    # Significant
    sig = de_results[de_results['significant']]
    plt.scatter(sig['log2FC'], -np.log10(sig['p_value']),
                c='red', alpha=0.7, s=20)

    # Thresholds
    plt.axhline(-np.log10(0.05), color='blue', linestyle='--', label='p=0.05')
    plt.axvline(-1, color='blue', linestyle='--')
    plt.axvline(1, color='blue', linestyle='--', label='|log2FC|=1')

    plt.xlabel('log2 Fold Change')
    plt.ylabel('-log10(p-value)')
    plt.title(title)
    plt.legend()
```

### Phase 4: PTM Analysis

**Objective**: Analyze post-translational modifications (phosphorylation, acetylation, etc.)

**Phosphoproteomics workflow**:
```python
def analyze_phosphosites(phospho_sites_file, intensity_matrix):
    """
    Analyze phosphorylation site changes.

    Input: MaxQuant Phospho (STY)Sites.txt
    Output: Differential phosphorylation per site
    """
    # Load phospho data
    phospho = pd.read_csv(phospho_sites_file, sep='\t')

    # Filter by localization probability
    phospho_confident = phospho[phospho['Localization prob'] > 0.75]

    # Extract site information
    phospho_confident['site'] = (
        phospho_confident['Gene names'] + '_' +
        phospho_confident['Amino acid'] +
        phospho_confident['Position'].astype(str)
    )

    # Quantification (similar to protein-level analysis)
    # ... perform differential analysis ...

    return phospho_results
```

**Kinase-substrate prediction**:
```python
def predict_kinases(phospho_sites):
    """
    Predict upstream kinases for phosphorylation sites.

    Uses ToolUniverse PhosphoSitePlus or KEA3 tools.
    """
    from tooluniverse import ToolUniverse
    tu = ToolUniverse()

    # For each significant phosphosite
    kinase_predictions = []
    for site in phospho_sites:
        # Query kinase-substrate databases
        # (would use actual ToolUniverse tool here)
        result = tu.run_one_function({
            "name": "phosphosite_plus_query",  # hypothetical
            "arguments": {"site": site}
        })
        kinase_predictions.append(result)

    return kinase_predictions
```

### Phase 5: Functional Enrichment

**Objective**: Interpret biological meaning of protein changes via pathway analysis.

**Gene Ontology enrichment**:
```python
def pathway_enrichment_proteins(de_proteins, organism='human'):
    """
    Perform pathway enrichment for differentially expressed proteins.

    Uses ToolUniverse gene-enrichment skill.
    """
    from tooluniverse import ToolUniverse
    tu = ToolUniverse()

    # Extract gene names for significant proteins
    sig_proteins = de_proteins[de_proteins['significant']]
    gene_list = sig_proteins['gene_name'].tolist()

    # Run enrichment via ToolUniverse
    enrichment = tu.run_one_function({
        "name": "enrichr_enrich",
        "arguments": {
            "gene_list": ",".join(gene_list),
            "library": "KEGG_2021_Human"
        }
    })

    return enrichment
```

**Protein complex enrichment**:
```python
def protein_complex_enrichment(protein_list):
    """
    Test for enrichment of known protein complexes (CORUM database).
    """
    # Query CORUM or use ToolUniverse
    # Identify if proteins are part of known complexes
    pass
```

### Phase 6: Protein-Protein Interactions

**Objective**: Identify interaction networks and protein complexes.

**STRING network analysis**:
```python
def build_protein_network(protein_list, confidence=0.7):
    """
    Build PPI network using STRING database.

    Uses ToolUniverse STRING tools.
    """
    from tooluniverse import ToolUniverse
    tu = ToolUniverse()

    # Get interactions
    interactions = tu.run_one_function({
        "name": "string_get_interactions",
        "arguments": {
            "proteins": ",".join(protein_list),
            "species": 9606,  # human
            "score_threshold": int(confidence * 1000)
        }
    })

    # Build network graph
    import networkx as nx
    G = nx.Graph()

    for interaction in interactions['data']:
        G.add_edge(
            interaction['protein1'],
            interaction['protein2'],
            score=interaction['score']
        )

    return G
```

**Module detection**:
```python
def detect_protein_modules(network_graph):
    """
    Identify tightly connected protein modules (complexes).
    """
    from networkx.algorithms import community

    # Detect communities
    communities = community.greedy_modularity_communities(network_graph)

    # Annotate modules with enriched functions
    modules = []
    for i, comm in enumerate(communities):
        module_proteins = list(comm)
        # Run enrichment for this module
        enrichment = pathway_enrichment_proteins(module_proteins)
        modules.append({
            'module_id': i,
            'proteins': module_proteins,
            'size': len(module_proteins),
            'top_function': enrichment['top_terms'][0]
        })

    return modules
```

### Phase 7: Multi-Omics Integration

**Objective**: Integrate proteomics with transcriptomics and other omics.

**Protein-RNA correlation**:
```python
def correlate_protein_rna(protein_data, rna_data, common_samples):
    """
    Correlate protein and mRNA levels for each gene.

    Expected: r ~ 0.4-0.6 (moderate correlation)
    Discordance indicates post-transcriptional regulation
    """
    from scipy.stats import spearmanr

    # Find common genes
    common_genes = set(protein_data.index) & set(rna_data.index)

    correlations = {}
    for gene in common_genes:
        protein = protein_data.loc[gene, common_samples]
        rna = rna_data.loc[gene, common_samples]

        r, p = spearmanr(protein, rna)
        correlations[gene] = {
            'r': r,
            'p': p,
            'regulation': classify_regulation(r, protein.mean(), rna.mean())
        }

    return correlations

def classify_regulation(r, protein_level, rna_level):
    """
    Classify regulatory mechanism based on correlation and levels.
    """
    if r > 0.6 and protein_level > 0 and rna_level > 0:
        return 'transcriptional_upregulation'
    elif r > 0.6 and protein_level < 0 and rna_level < 0:
        return 'transcriptional_downregulation'
    elif r < 0.2 and protein_level > 0 and rna_level < 0:
        return 'translational_upregulation'
    elif r < 0.2 and protein_level < 0 and rna_level > 0:
        return 'protein_degradation'
    else:
        return 'mixed_regulation'
```

**Integration with multi-omics skill**:
```python
def integrate_with_multiomics(protein_data, rna_data, methylation_data):
    """
    Pass proteomics data to multi-omics integration skill.

    Enables comprehensive analysis across all molecular layers.
    """
    # Prepare for multi-omics skill
    omics_data = {
        'proteomics': protein_data,
        'rnaseq': rna_data,
        'methylation': methylation_data
    }

    # Invoke multi-omics integration skill
    from tooluniverse import ToolUniverse
    # (Would use Skill tool to invoke tooluniverse-multi-omics-integration)

    return integrated_analysis
```

### Phase 8: Report Generation

**Generate comprehensive proteomics report**:

```markdown
# Proteomics Analysis Report

## Dataset Summary
- **Samples**: 20 (10 disease, 10 control)
- **Proteins Identified**: 5,432
- **Proteins Quantified**: 4,987 (at least 3 samples)
- **Platform**: Orbitrap Fusion Lumos, MaxQuant 2.0

## Quality Control
- **Missing Values**: 15% average per protein
- **Sample Correlation**: 0.92-0.98 within groups
- **PCA**: Clear separation between disease and control (PC1: 35% variance)

## Differential Expression
- **Significant Proteins**: 432 (adj. p < 0.05, |log2FC| > 1)
  - Upregulated: 245 proteins
  - Downregulated: 187 proteins
- **Top upregulated**: MYC (log2FC=3.2), EGFR (log2FC=2.8)
- **Top downregulated**: TP53 (log2FC=-2.5), BRCA1 (log2FC=-2.1)

## Phosphoproteomics
- **Phosphosites Quantified**: 8,543
- **Differentially Phosphorylated**: 234 sites (p < 0.05)
- **Top Predicted Kinases**: CDK1, MAPK1, AKT1

## Pathway Enrichment
### Top Pathways (Upregulated)
1. **Cell Cycle** (p=1e-15) - 45 proteins, including cyclins, CDKs
2. **DNA Replication** (p=1e-12) - 23 proteins
3. **Glycolysis** (p=1e-10) - 18 proteins

### Top Pathways (Downregulated)
1. **Apoptosis** (p=1e-14) - 32 proteins, including caspases
2. **DNA Repair** (p=1e-11) - 28 proteins
3. **Oxidative Phosphorylation** (p=1e-9) - 25 proteins

## Protein Network Analysis
- **Network**: 432 nodes, 1,245 edges (STRING confidence > 0.7)
- **Modules Detected**: 8 functional modules
  - Module 1: Cell cycle (85 proteins)
  - Module 2: Metabolism (62 proteins)
  - Module 3: Translation (48 proteins)

## Protein-RNA Correlation
- **Overall Correlation**: r = 0.54 (moderate, expected)
- **High Correlation**: 2,134 genes (r > 0.6) - transcriptional regulation
- **Low Correlation**: 456 genes (r < 0.2) - post-transcriptional regulation
- **Translation-Regulated**: 89 proteins (high protein, low RNA)

## Biological Interpretation
Disease state shows increased proliferation (MYC, cyclins) with concurrent
suppression of apoptosis and DNA repair (TP53, BRCA1). Metabolic shift toward
glycolysis evident at protein level. Post-transcriptional upregulation of
translation machinery suggests adaptation to proliferative demands.

## Potential Biomarkers
Top 10 proteins for disease classification (Random Forest AUC=0.95):
1. MYC (protein)
2. EGFR (protein)
3. CDK1 (phospho-T161)
4. TP53 (protein)
5. BRCA1 (protein)
```

---

## Integration with ToolUniverse

**Skills Coordinated**:
| Skill | Used For | Phase |
|-------|----------|-------|
| `tooluniverse-gene-enrichment` | Pathway enrichment | Phase 5 |
| `tooluniverse-protein-interactions` | PPI networks | Phase 6 |
| `tooluniverse-rnaseq-deseq2` | RNA-seq for integration | Phase 7 |
| `tooluniverse-multi-omics-integration` | Cross-omics analysis | Phase 7 |
| `tooluniverse-target-research` | Protein annotation | Phase 8 |

---

## Example Use Cases

### Use Case 1: Cancer Proteomics

**Question**: "Analyze proteomics data from breast cancer vs normal tissue"

**Workflow**:
1. Load MaxQuant proteinGroups.txt
2. QC and filter (keep proteins with 2+ peptides, detected in 3+ samples)
3. Impute missing, normalize by median
4. Differential expression (limma): 432 significant proteins
5. Pathway enrichment: Cell cycle, metabolism upregulated
6. STRING network: Identify hub proteins (MYC, EGFR)
7. Integrate with TCGA RNA-seq: Find translation-regulated genes
8. Report: Comprehensive analysis with biomarkers

### Use Case 2: Phosphoproteomics Signaling

**Question**: "What kinase signaling is activated in response to drug treatment?"

**Workflow**:
1. Load Phospho (STY)Sites.txt from MaxQuant
2. Filter by localization probability > 0.75
3. Differential phosphorylation analysis
4. Kinase prediction for significant sites
5. Identify MAPK1, CDK1, AKT1 as top kinases
6. Pathway enrichment: MAPK, PI3K/AKT pathways
7. Report: Drug activates growth signaling

### Use Case 3: Protein-RNA Integration

**Question**: "Which proteins are regulated post-transcriptionally?"

**Workflow**:
1. Load proteomics (MaxQuant) and RNA-seq (DESeq2) data
2. Match samples, extract common genes
3. Correlate protein and RNA for each gene
4. Identify low-correlation genes (r < 0.2)
5. Classify: translation upregulation, protein degradation
6. Enrichment: Find pathways enriched in post-transcriptional regulation
7. Report: 89 translation-regulated proteins, RNA-binding proteins enriched

---

## Quantified Minimums

| Component | Requirement |
|-----------|-------------|
| Proteins quantified | At least 500 proteins |
| Replicates | At least 3 per condition |
| Filtering | 2+ unique peptides per protein |
| Statistical test | limma or t-test with multiple testing correction |
| Pathway enrichment | At least one method (GO, KEGG, or Reactome) |
| Report | Summary, QC, DE results, pathways, visualizations |

---

## Limitations

- **Platform-specific**: Optimized for MS-based proteomics (not Western blot quantification)
- **Missing values**: High missing rate (>50% per protein) limits statistical power
- **PTM analysis**: Requires enrichment protocols for comprehensive PTM profiling
- **Absolute quantification**: Relative abundance only (unless TMT/SILAC used)
- **Protein isoforms**: Typically collapsed to gene level
- **Dynamic range**: MS has limited dynamic range vs mRNA sequencing

---

## References

**Methods**:
- MaxQuant: https://doi.org/10.1038/nbt.1511
- Limma for proteomics: https://doi.org/10.1093/nar/gkv007
- DEP workflow: https://doi.org/10.1038/nprot.2018.107

**Databases**:
- STRING: https://string-db.org
- PhosphoSitePlus: https://www.phosphosite.org
- CORUM: https://mips.helmholtz-muenchen.de/corum

Related Skills

tooluniverse

from Zaoqu-Liu/ScienceClaw

Router skill for ToolUniverse tasks. First checks if specialized tooluniverse skills (34+ skills covering disease/drug/target research, clinical decision support, genomics, epigenomics, chemical safety, systems biology, and more) can solve the problem, then falls back to general strategies for using 1400+ scientific tools. Covers tool discovery, multi-hop queries, comprehensive research workflows, disambiguation, evidence grading, and report generation. Use when users need to research any scientific topic, find biological data, or explore drug/target/disease relationships.

tooluniverse-variant-interpretation

from Zaoqu-Liu/ScienceClaw

Systematic clinical variant interpretation from raw variant calls to ACMG-classified recommendations with structural impact analysis. Aggregates evidence from ClinVar, gnomAD, CIViC, UniProt, and PDB across ACMG criteria. Produces pathogenicity scores (0-100), clinical recommendations, and treatment implications. Use when interpreting genetic variants, classifying variants of uncertain significance (VUS), performing ACMG variant classification, or translating variant calls to clinical actionability.

tooluniverse-variant-analysis

from Zaoqu-Liu/ScienceClaw

Production-ready VCF processing, variant annotation, mutation analysis, and structural variant (SV/CNV) interpretation for bioinformatics questions. Parses VCF files (streaming, large files), classifies mutation types (missense, nonsense, synonymous, frameshift, splice, intronic, intergenic) and structural variants (deletions, duplications, inversions, translocations), applies VAF/depth/quality/consequence filters, annotates with ClinVar/dbSNP/gnomAD/CADD via ToolUniverse, interprets SV/CNV clinical significance using ClinGen dosage sensitivity scores, computes variant statistics, and generates reports. Solves questions like "What fraction of variants with VAF < 0.3 are missense?", "How many non-reference variants remain after filtering intronic/intergenic?", "What is the pathogenicity of this deletion affecting BRCA1?", or "Which dosage-sensitive genes overlap this CNV?". Use when processing VCF files, annotating variants, filtering by VAF/depth/consequence, classifying mutations, interpreting structural variants, assessing CNV pathogenicity, comparing cohorts, or answering variant analysis questions.

tooluniverse-target-research

from Zaoqu-Liu/ScienceClaw

Gather comprehensive biological target intelligence from 9 parallel research paths covering protein info, structure, interactions, pathways, expression, variants, drug interactions, and literature. Features collision-aware searches, evidence grading (T1-T4), explicit Open Targets coverage, and mandatory completeness auditing. Use when users ask about drug targets, proteins, genes, or need target validation, druggability assessment, or comprehensive target profiling.

tooluniverse-systems-biology

from Zaoqu-Liu/ScienceClaw

Comprehensive systems biology and pathway analysis using multiple pathway databases (Reactome, KEGG, WikiPathways, Pathway Commons, BioModels). Performs pathway enrichment, protein-pathway mapping, keyword searches, and systems-level analysis. Use when analyzing gene sets, exploring biological pathways, or investigating systems-level biology.

tooluniverse-structural-variant-analysis

from Zaoqu-Liu/ScienceClaw

Comprehensive structural variant (SV) analysis skill for clinical genomics. Classifies SVs (deletions, duplications, inversions, translocations), assesses pathogenicity using ACMG-adapted criteria, evaluates gene disruption and dosage sensitivity, and provides clinical interpretation with evidence grading. Use when analyzing CNVs, large deletions/duplications, chromosomal rearrangements, or any structural variants requiring clinical interpretation.

tooluniverse-statistical-modeling

from Zaoqu-Liu/ScienceClaw

Perform statistical modeling and regression analysis on biomedical datasets. Supports linear regression, logistic regression (binary/ordinal/multinomial), mixed-effects models, Cox proportional hazards survival analysis, Kaplan-Meier estimation, and comprehensive model diagnostics. Extracts odds ratios, hazard ratios, confidence intervals, p-values, and effect sizes. Designed to solve BixBench statistical reasoning questions involving clinical/experimental data. Use when asked to fit regression models, compute odds ratios, perform survival analysis, run statistical tests, or interpret model coefficients from provided data.

tooluniverse-spatial-transcriptomics

from Zaoqu-Liu/ScienceClaw

Analyze spatial transcriptomics data to map gene expression in tissue architecture. Supports 10x Visium, MERFISH, seqFISH, Slide-seq, and imaging-based platforms. Performs spatial clustering, domain identification, cell-cell proximity analysis, spatial gene expression patterns, tissue architecture mapping, and integration with single-cell data. Use when analyzing spatial transcriptomics datasets, studying tissue organization, identifying spatial expression patterns, mapping cell-cell interactions in tissue context, characterizing tumor microenvironment spatial structure, or integrating spatial and single-cell RNA-seq data for comprehensive tissue analysis.

tooluniverse-spatial-omics-analysis

from Zaoqu-Liu/ScienceClaw

Computational analysis framework for spatial multi-omics data integration. Given spatially variable genes (SVGs), spatial domain annotations, tissue type, and disease context from spatial transcriptomics/proteomics experiments (10x Visium, MERFISH, DBiTplus, SLIDE-seq, etc.), performs comprehensive biological interpretation including pathway enrichment, cell-cell interaction inference, druggable target identification, immune microenvironment characterization, and multi-modal integration. Produces a detailed markdown report with Spatial Omics Integration Score (0-100), domain-by-domain characterization, and validation recommendations. Uses 70+ ToolUniverse tools across 9 analysis phases. Use when users ask about spatial transcriptomics analysis, spatial omics interpretation, tissue heterogeneity, spatial gene expression patterns, tumor microenvironment mapping, tissue zonation, or cell-cell communication from spatial data.

tooluniverse-single-cell

from Zaoqu-Liu/ScienceClaw

Production-ready single-cell and expression matrix analysis using scanpy, anndata, and scipy. Performs scRNA-seq QC, normalization, PCA, UMAP, Leiden/Louvain clustering, differential expression (Wilcoxon, t-test, DESeq2), cell type annotation, per-cell-type statistical analysis, gene-expression correlation, batch correction (Harmony), trajectory inference, and cell-cell communication analysis. NEW: Analyzes ligand-receptor interactions between cell types using OmniPath (CellPhoneDB, CellChatDB), scores communication strength, identifies signaling cascades, and handles multi-subunit receptor complexes. Integrates with ToolUniverse gene annotation tools (HPA, Ensembl, MyGene, UniProt) and enrichment tools (gseapy, PANTHER, STRING). Supports h5ad, 10X, CSV/TSV count matrices, and pre-annotated datasets. Use when analyzing single-cell RNA-seq data, studying cell-cell interactions, performing cell type differential expression, computing gene-expression correlations by cell type, analyzing tumor-immune communication, or answering questions about scRNA-seq datasets.

tooluniverse-sequence-retrieval

from Zaoqu-Liu/ScienceClaw

Retrieves biological sequences (DNA, RNA, protein) from NCBI and ENA with gene disambiguation, accession type handling, and comprehensive sequence profiles. Creates detailed reports with sequence metadata, cross-database references, and download options. Use when users need nucleotide sequences, protein sequences, genome data, or mention GenBank, RefSeq, EMBL accessions.

tooluniverse-sdk

from Zaoqu-Liu/ScienceClaw

Build AI scientist systems using ToolUniverse Python SDK for scientific research. Use when users need to access 1000++ scientific tools through Python code, create scientific workflows, perform drug discovery, protein analysis, genomics analysis, literature research, or any computational biology task. Triggers include requests to use scientific tools programmatically, build research pipelines, analyze biological data, search literature, predict drug properties, or create AI-powered scientific workflows.