proteomics-de

Differential expression analysis for label-free quantitative (LFQ) intensity data with standard MaxQuant and DIA-NN output. Workflow includes preprocessing, imputation, and statistical testing.

658 stars

byClawBio

View on GitHub Installation ↓

Best use case

proteomics-de is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Differential expression analysis for label-free quantitative (LFQ) intensity data with standard MaxQuant and DIA-NN output. Workflow includes preprocessing, imputation, and statistical testing.

Teams using proteomics-de should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/proteomics-de/SKILL.md --create-dirs "https://raw.githubusercontent.com/ClawBio/ClawBio/main/skills/proteomics-de/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/proteomics-de/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How proteomics-de Compares

Feature / Agent	proteomics-de	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Differential expression analysis for label-free quantitative (LFQ) intensity data with standard MaxQuant and DIA-NN output. Workflow includes preprocessing, imputation, and statistical testing.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# 🥚 Proteomics Differential Expression

This skill performs differential expression analysis on label-free quantitative (LFQ) intensity data from MaxQuant and DIA-NN outputs, including preprocessing, imputation, statistical testing, and visualization.

---

## Domain Decisions

### 1. Multi-format Input Support
- Supports **MaxQuant `proteinGroups.txt`**
  - Automatic filtering of reverse hits, contaminants, and site-only identifications
- Supports **DIA-NN output**
  - Automatically extracts protein IDs and `.raw` intensity columns

---

### 2. Preprocessing Strategy
- MaxQuant:
  - Filters:
    - `Reverse`
    - `Potential contaminant` / `Contaminant`
    - `Only identified by site`
- DIA-NN:
  - Extracts protein identifiers and intensity matrix directly

---

### 3. Intensity Transformation
- LFQ intensities are transformed using **log2 scaling**
- Ensures approximate normality for downstream statistical testing

---

### 4. Missing Value Imputation
- Uses **down-shifted Gaussian imputation**
  - Mean shifted by: `median - shift × std`
  - Default:
    - `shift = 1.8`
    - `scale = 0.3`
- Assumption:
  - Missing values represent **low-abundance proteins**

---

### 5. Statistical Testing
- Two-sample **t-test** between treatment and control groups
- Default degrees of freedom:
  - `df = 4` (for 3 vs 3 replicates)

---

### 6. s0-based FDR Correction
- Uses **s0-based thresholding** to stabilize variance
- Combines:
  - log2 fold change
  - p-value
- Based on:
  - Giai Gianetto et al. (2016)

---

### 7. Significance Thresholding
- Default:
  - `FDR = 0.05`
  - `s0 = 0.1`
- Produces:
  - Adjusted significance boundary (used in volcano plot)

---

### 8. Visualization Outputs
- PCA plot
- Volcano plot (with s0 curve)
- Imputation distribution comparison

---

## Safety Rules

- **Local-first**
  - No data upload without explicit user consent

- **Statistical caution**
  - Statistical results should be interpreted with caution and not overinterpreted
  - Avoid drawing conclusions beyond what the data supports

- **Missing data assumptions**
  - Imputation assumes missing values correspond to low abundance
  - May not hold in all experimental designs

- **Small sample limitations**
  - t-test reliability depends on sufficient replicates

- **Reproducibility**
  - All parameters and commands are logged

- **No hallucinated science**
  - All methods are based on established proteomics workflows

---

## Agent Boundary

### This skill DOES:
- Perform differential expression analysis on LFQ proteomics data
- Handle MaxQuant and DIA-NN outputs
- Generate statistical results and visualizations
- Produce reproducible reports

---

### This skill DOES NOT:
- Process raw mass spectrometry data (e.g. RAW files)
- Perform peptide identification or database search
- Conduct pathway or functional enrichment analysis
- Provide biological interpretation of results

---

## Input Contract

### Supported Input Formats
1. MaxQuant `proteinGroups.txt`
2. DIA-NN output (`.tsv` / `.txt`)

---

### Metadata Requirements
- `.csv` or `.tsv`
- Must include:
  - `sample_id`
  - `group`

Supports:
- raw names
- full paths (e.g. `/path/sample.raw`)

---

## Output Structure
```
proteomics_de_report/
├── report.md
├── figures/
│   ├── imputation_distribution.png
│   ├── pca.png
│   └── volcano.png
├── tables/
│   ├── imputed_proteinGroups.csv
│   └── de_results.csv
└── reproducibility/
    ├── commands.sh
    ├── environment.yml
    └── checksums.sha256
```

---

## Usage

### Demo
```bash
python proteomics_de.py \
  --demo \
  --output report_dir
```

### MaxQuant Input
```bash
python proteomics_de.py \
  --input proteinGroups.txt \
  --input-type maxquant \
  --metadata metadata.csv \
  --contrast "treated,control" \
  --output report_dir
```

### DIA-NN Input
```bash
python proteomics_de.py \
  --input diann_output.tsv \
  --input-type diann \
  --metadata metadata.csv \
  --contrast "treated,control" \
  --output report_dir
```

### Parameters

| Parameter            | Description           | Default         |
| -------------------- | --------------------- | --------------- |
| `--input`            | Input file path       | -               |
| `--input-type`       | `maxquant` or `diann` | maxquant        |
| `--metadata`         | Metadata file         | -               |
| `--contrast`         | treatment,control     | treated,control |
| `--s0`               | s0 parameter          | 0.1             |
| `--fdr`              | FDR threshold         | 0.05            |
| `--ttest-df`         | Degrees of freedom    | 4               |
| `--imputation-shift` | Imputation shift      | 1.8             |
| `--imputation-scale` | Imputation scale      | 0.3             |
| `--output`           | Output directory      | -               |

## References
- test_proteinGroups.txt is from: Keilhauer EC, Hein MY, Mann M. Accurate protein complex retrieval by affinity enrichment mass spectrometry (AE-MS) rather than affinity purification mass spectrometry (AP-MS). Mol Cell Proteomics. 2015 Jan;14(1):120-35. doi: 10.1074/mcp.M114.041012. Epub 2014 Nov 2. PMID: 25363814; PMCID: PMC4288248. 
- s0 correction algorithm is from: Giai Gianetto Q, Couté Y, Bruley C, Burger T. Uses and misuses of the fudge factor in quantitative discovery proteomics. Proteomics. 2016 Jul;16(14):1955-60. doi: 10.1002/pmic.201600132. PMID: 27272648.
- s0 correction algorithm is cited by: Michaelis AC, Brunner AD, Zwiebel M, Meier F, Strauss MT, Bludau I, Mann M. The social and structural architecture of the yeast protein interactome. Nature. 2023 Dec;624(7990):192-200. doi: 10.1038/s41586-023-06739-5. Epub 2023 Nov 15. PMID: 37968396; PMCID: PMC10700138.

Related Skills

wes-clinical-report-es

658

from ClawBio/ClawBio

Generates professional clinical PDF reports in Spanish from WES (Whole Exome Sequencing) data with clinical interpretation, pharmacogenomic alerts, and follow-up recommendations.

wes-clinical-report-en

658

from ClawBio/ClawBio

Generates professional clinical PDF reports in English from WES (Whole Exome Sequencing) data with clinical interpretation summary, pharmacogenomic alerts, and follow-up recommendations.

vcf-annotator

658

from ClawBio/ClawBio

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

variant-annotation

658

from ClawBio/ClawBio

Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.

ukb-navigator

658

from ClawBio/ClawBio

Semantic search across UK Biobank's 12,000+ data fields and publications — find the right variables for your research question.

target-validation-scorer

658

from ClawBio/ClawBio

Evidence-grounded target validation scoring with GO/NO-GO decisions for drug discovery campaigns

struct-predictor

658

from ClawBio/ClawBio

Protein structure prediction with Boltz-2. Accepts YAML inputs (single protein or multi-chain complex), runs boltz predict, extracts per-residue pLDDT and PAE confidence, and writes a markdown report with figures.

soul2dna

658

from ClawBio/ClawBio

Compile SOUL.md character profiles into synthetic diploid genomes (.genome.json) via trait-to-allele mapping

seq-wrangler

658

from ClawBio/ClawBio

Sequence QC, alignment, and BAM processing. Wraps FastQC, BWA/Bowtie2, SAMtools for automated read-to-BAM pipelines.

scrna-orchestrator

658

from ClawBio/ClawBio

Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.

scrna-embedding

658

from ClawBio/ClawBio

Local scVI/scANVI-based single-cell latent embedding and batch-aware integration from raw-count .h5ad or 10x Matrix Market input, with stable integrated AnnData export for downstream latent analysis.

rnaseq-de

658

from ClawBio/ClawBio

Differential expression analysis for bulk RNA-seq and pseudo-bulk count matrices with QC, PCA, and contrast testing.