icml-reviewer
Paper reviewer that evaluates machine learning research projects following official ICML reviewer guidelines. Provides comprehensive reviews with actionable feedback across all key dimensions: claims/evidence, relation to prior work, originality, significance, clarity, and reproducibility. Also provides formative feedback on incomplete drafts, proposals, and research code repositories. MANDATORY TRIGGERS: review paper, ICML review, paper review, evaluate paper, research paper feedback, ML paper review, conference review, academic review, paper critique, NeurIPS review, ICLR review, project proposal, research proposal, paper draft, early feedback, incomplete paper, work in progress, WIP review, review repo, review codebase, research project review
Best use case
icml-reviewer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Paper reviewer that evaluates machine learning research projects following official ICML reviewer guidelines. Provides comprehensive reviews with actionable feedback across all key dimensions: claims/evidence, relation to prior work, originality, significance, clarity, and reproducibility. Also provides formative feedback on incomplete drafts, proposals, and research code repositories. MANDATORY TRIGGERS: review paper, ICML review, paper review, evaluate paper, research paper feedback, ML paper review, conference review, academic review, paper critique, NeurIPS review, ICLR review, project proposal, research proposal, paper draft, early feedback, incomplete paper, work in progress, WIP review, review repo, review codebase, research project review
Teams using icml-reviewer should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/icml-reviewer/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How icml-reviewer Compares
| Feature / Agent | icml-reviewer | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Paper reviewer that evaluates machine learning research projects following official ICML reviewer guidelines. Provides comprehensive reviews with actionable feedback across all key dimensions: claims/evidence, relation to prior work, originality, significance, clarity, and reproducibility. Also provides formative feedback on incomplete drafts, proposals, and research code repositories. MANDATORY TRIGGERS: review paper, ICML review, paper review, evaluate paper, research paper feedback, ML paper review, conference review, academic review, paper critique, NeurIPS review, ICLR review, project proposal, research proposal, paper draft, early feedback, incomplete paper, work in progress, WIP review, review repo, review codebase, research project review
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
SKILL.md Source
# ICML Paper Reviewer Enables rigorous review of ML research papers following official ICML guidelines. ## Workflow ### Step 1: Input Analysis & Mode Selection **Determine input type:** - **Complete paper**: PDF/text with abstract, methodology, experiments, results → Full Review Mode - **Incomplete document**: Missing major sections, labeled draft/proposal, or user indicates early stage → Early-Stage Feedback Mode - **Code repository**: User points to folder/repo path → Repository Review Mode **For complete papers**, extract: title, abstract, main claims, methodology, experiments, results. Identify paper type: theoretical, methodological, algorithmic, empirical, bridge paper, or application-driven. **For code repositories**, first explore: read README, scan code structure, find experiment scripts/results, identify the research question and what's implemented. ### Step 2: Prior Work Grounding (Critical - All Modes) This step applies to ALL input types. Grounding in reality is essential for any meaningful feedback. 1. Generate 3-5 search queries based on the research topic: benchmarks/baselines, same problem, related techniques 2. Use WebSearch to find recent arXiv papers and published work 3. Fetch abstracts of 5-10 most relevant papers 4. **Critically synthesize**: - What specific claims in this paper are already addressed by prior work? - What are the actual quantitative improvements over recent baselines? - Are claimed "novelties" actually novel given the literature? - What gaps truly exist vs. what the authors claim exists? **Critical mindset**: - Your job is to verify claims against reality, not accept them at face value - Most papers overclaim—your review should ground their contributions in what the literature actually shows - Default to skepticism: Assume claims are overstated until proven otherwise by evidence - Authors have selection bias toward their own work; you represent the community's interests - Be the critical voice that ensures published work actually advances the field Then proceed to mode-specific evaluation. --- ## Full Review Mode (Complete Papers) ### Step 3: Systematic Evaluation Evaluate across 7 dimensions (see `references/evaluation-criteria.md`). **Default to skepticism—require strong evidence to score highly.** | Dimension | Key Questions (Answer with Literature Evidence) | |-----------|---------------| | Originality | Is this truly novel given recent work X, Y, Z? What specific aspects are incremental vs. novel? | | Importance | Why does this problem matter? What's the real-world impact? Who will care? | | Claims Support | Do experiments actually prove the claims? What alternative explanations exist? | | Experimental Soundness | Are baselines from 2023+? Are comparisons fair? What's missing? | | Clarity | Can I reproduce this from the paper? Are claims precisely stated? | | Community Value | Will this change how people work? Or just add noise? | | Prior Work Context | Are comparisons accurate? What recent work (last 2 years) is missing? | **Evaluation mindset**: - Start from neutral and require evidence to move up or down - Compare every claim against what you found in the literature search - Most papers are incremental—high originality scores are rare - Weak baselines or missing comparisons are critical flaws, not minor issues ### Step 4: Critical Cross-Check Against Literature Before writing the review, explicitly verify: 1. **Baselines check**: List baselines used in paper. List baselines from your literature search of adjacent papers. What's missing? 2. **Methodology check**: How do 2-3 adjacent papers approach this problem? Does this paper follow similar methodology? If not, why not? 3. **Claims check**: List main claims. For each, cite specific evidence from experiments or proofs. If insufficient, note it. 4. **Citations check**: Which papers from your search are cited? Which are missing? Why? 5. **Novelty check**: List claimed novelties. For each, cite specific prior work that does or doesn't do this. This step is not optional. Your review must reference specific findings from your literature search. ### Step 5: Generate Review Follow the ICML review form (see `references/review-template.md`): 1. **Summary** - Neutral, factual (should not be disputed by authors) 2. **Claims and Evidence** - Are claims supported? **Compare to what literature shows** 3. **Relation to Prior Work** - Proper context? Missing citations? **List specific missing papers** 4. **Strengths** - Specific and substantive, **compared to standards in adjacent work** 5. **Weaknesses** - Constructive, explain severity, **cite specific literature for comparison** 6. **Questions for Authors** - Numbered, explain impact on evaluation 7. **Minor Issues** - Typos, suggestions 8. **Overall Recommendation** - 1-5 scale with justification **grounded in literature comparison** 9. **Confidence Score** - 1-5 scale ### Step 6: Quality Check - Verify all claims in review are substantiated - Ensure constructive tone - Check specificity of strengths/weaknesses - Confirm questions are actionable ## Key Principles ### Be Rigorous AND Constructive Your primary duty is to the research community—publishing weak papers dilutes the literature. - **Be honest**: Don't inflate scores to be nice. If baselines are weak, say so clearly. - **Be specific**: Always cite which literature contradicts or supports claims. - **Be fair**: Criticism should be substantiated by evidence or literature. - **Be actionable**: Tell authors exactly what would fix the issues. "Review the papers of others as you would wish your own to be reviewed"—with rigor, honesty, and specific feedback grounded in the literature. ### Be Specific Bad: "The experiments are weak" Good: "Experiments compare only against [X] from 2019, but recent baselines [Y] (2024) and [Z] (2024) should be included." ### Fair Novelty Assessment Originality may arise from: creative combinations, new domains, removing restrictive assumptions, novel datasets, new problem formulations. **But**: Most claimed novelty is actually incremental. Verify against literature before accepting novelty claims. ### Score Calibration Use this reference frame: - **5s are rare**: Reserve for papers that will clearly influence the field - **4s are uncommon**: Solid papers with rigorous execution and clear contributions - **3s are common**: Papers with merit but significant limitations - **2s are common**: Incremental work or work with major methodological issues - **1s indicate fundamental problems**: Wrong results, no contribution, or severe ethical issues If you find yourself giving mostly 4s and 5s, you're likely being too generous. Re-calibrate against what the literature shows is standard. ### Application-Driven Papers For application-driven ML: methods should fit real-world constraints, non-standard datasets acceptable if documented, compare against domain baselines. ## Rating Scales **Overall (1-5):** Use the full range. Most papers should be 2-3. - **5 (Strong Accept)**: Significant contribution, will be influential, no major flaws - **4 (Accept)**: Solid contribution, rigorous execution, minor issues only - **3 (Weak Accept)**: Contribution exists but limited; or good idea with execution flaws - **2 (Weak Reject)**: Incremental contribution insufficient for venue; or significant methodological issues - **1 (Reject)**: Fundamental flaws, not ready, or no meaningful contribution **Red flags that should lower scores**: - Baselines older than 2 years (unless explicitly justified) - Missing comparisons to obvious related work from literature search - Claims not directly supported by presented experiments - Novelty claims contradicted by prior work **Confidence (1-5):** 5=Expert/certain, 4=Confident, 3=Fairly confident, 2=Uncertain, 1=Not in area --- ## Early-Stage Feedback Mode Use this mode for incomplete drafts, research proposals, or code repositories. Focus shifts from "accept/reject evaluation" to "constructive guidance on how to make this publishable." After completing Steps 1-2 (input analysis and prior work grounding), proceed here. ### Step 3: Generate Formative Feedback Use the Early-Stage Feedback Template (see `references/review-template.md`). No numerical scores—focus on constructive guidance. **For code repositories**, additionally address: - Code quality and organization - Experiment design and reproducibility - What's missing for a paper (baselines, ablations, analysis) ## References - `references/evaluation-criteria.md` - Detailed criteria for each dimension - `references/review-template.md` - Full template with examples - `references/common-issues.md` - Common paper issues to identify
Related Skills
cs448b-visualization
Data visualization design based on Stanford CS448B. Use for: (1) choosing chart types, (2) selecting visual encodings, (3) critiquing visualizations, (4) building D3.js visualizations, (5) designing interactions/animations, (6) choosing colors, (7) visualizing networks, (8) visualizing text. Covers Bertin, Mackinlay, Cleveland & McGill.
training-data-curation
Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.
tinker
Fine-tune LLMs using the Tinker API. Covers supervised fine-tuning, reinforcement learning, LoRA training, vision-language models, and both high-level Cookbook patterns and low-level API usage.
tinker-training-cost
Calculate training costs for Tinker fine-tuning jobs. Use when estimating costs for Tinker LLM training, counting tokens in datasets, or comparing Tinker model training prices. Tokenizes datasets using the correct model tokenizer and provides accurate cost estimates.
skill
Find, install, create, improve, and publish AI agent skills through the Sundial ecosystem. Use when the user wants to find or search for skills, install a skill, create a new skill, improve or evaluate an existing skill, or publish a skill to Sundial Hub. Trigger phrases include "find a skill", "install skill", "create a skill", "make a skill", "improve this skill", "evaluate skill", "publish skill", "push skill", "search for skills".
skill-to-card
End-to-end workflow that creates a skill from a description and attached files, publishes it to Sundial as a private skill, generates a trading card (front + back with QR code), and sends it to a printer. Use when the user wants to create a skill and get a printed trading card, or says "skill to card", "create and print a skill card", "make me a skill with a card".
project-referee
Critiques ML conference papers with reviewer-style feedback. Use when users want to anticipate reviewer concerns, identify weaknesses, check claim-evidence gaps, or find missing citations.
neuro-symbolic-reasoning
Neuro-symbolic AI combining LLMs with symbolic solvers. Use when exploring neuro-symbolic approaches (ideation, no code) or implementing solver integrations (code).
cs-research-methodology
Conduct a literature review and develop a CS research proposal. Use when asked to review a research area, find gaps in existing work, and propose a novel research contribution. The output is a research proposal identifying an assumption to challenge (the "bit flip") and how to validate it.
commit-splitter
Split large sets of uncommitted changes into logical, well-organized commits. Use when the user has many uncommitted changes and wants structured commits, or proactively suggest when detecting a large diff that would benefit from splitting.
codex
Run OpenAI's Codex CLI agent in non-interactive mode using `codex exec`. Use when delegating coding tasks to Codex, running Codex in scripts/automation, or when needing a second agent to work on a task in parallel.
ai-co-scientist
Transform Claude Code into an AI Scientist that orchestrates research workflows using tree-based hypothesis exploration. Triggers on "research project", "scientific experiment", "run experiments", "AI scientist", "tree search experimentation", "systematic study".