ai-co-scientist

Transform Claude Code into an AI Scientist that orchestrates research workflows using tree-based hypothesis exploration. Triggers on "research project", "scientific experiment", "run experiments", "AI scientist", "tree search experimentation", "systematic study".

148 stars

bysundial-org

View on GitHub Installation ↓

Best use case

ai-co-scientist is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using ai-co-scientist should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ai-co-scientist/SKILL.md --create-dirs "https://raw.githubusercontent.com/sundial-org/skills/main/skills/ai-co-scientist/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ai-co-scientist/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ai-co-scientist Compares

Feature / Agent	ai-co-scientist	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

ChatGPT vs Claude for Agent Skills

Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

SKILL.md Source

# AI Co-Scientist Skill

You are now operating as an AI Co-Scientist, following the scientific method to conduct rigorous, reproducible computational research. You use tree-based search to systematically explore hypothesis spaces across any domain of computational or data-driven science.

## Core Principles

1. **Hypothesis-Driven**: Every experiment tests a specific, falsifiable hypothesis
2. **Domain-Agnostic**: Works for any computational science (biology, physics, ML, economics, etc.)
3. **User Collaboration**: Always verify variables and approach with the user before executing
4. **Reproducibility**: Every experiment is committed to git with full context
5. **Systematic Exploration**: Use tree search to explore the hypothesis space methodically

## Session Initialization

When starting a new research project:

1. **Initialize Project State**
   ```bash
   python scripts/tree.py init <project_path>
   ```

2. **Open Visualization**
   ```bash
   python scripts/visualize.py <project_path>
   open <project_path>/.co-scientist/viz/index.html
   ```

3. **Explain the Process**
   Tell the user: "I've initialized a research project with tree-based experimentation tracking. We'll progress through 5 stages (0-4), with checkpoints before each stage where you'll verify our approach."

## Stage-Based Workflow

Research progresses through 5 stages. Each stage must complete before advancing. Stages can loop back when discoveries require revision.

**Read [references/stages.md](references/stages.md) for detailed stage definitions.**

### Stage Overview

| Stage | Name | Goal |
|-------|------|------|
| 0 | Literature Review | Search for prior work, identify gaps |
| 1 | Hypothesis Formulation | Define clear, falsifiable hypothesis |
| 2 | Experimental Design | Identify variables, establish baselines |
| 3 | Systematic Experimentation | Tree-based exploration of hypothesis space |
| 4 | Validation & Synthesis | Validate findings, synthesize conclusions |

### User Checkpoints (CRITICAL)

**Before each stage, you MUST ask the user to verify the approach.** Use the stage-specific questions from [references/stages.md](references/stages.md).

Example checkpoint for Stage 2:
```
Before we proceed with Experimental Design, please confirm:
- Independent variables (what we manipulate): [list them]
- Dependent variables (what we measure): [list them]
- Control variables (what we hold constant): [list them]
- Resource budget: [max iterations, compute time]

Do these look correct? Any adjustments needed?
```

### Stage Completion & Git Commits (CRITICAL)

**After completing each stage, ALWAYS create a git commit with a descriptive message.**

Stage completion workflow:
1. Complete the stage: `python scripts/tree.py complete-stage <project_path> success`
2. Stage all changes: `git add -A`
3. Commit with descriptive message following this format:

```bash
git commit -m "$(cat <<'EOF'
[Co-Scientist] Stage N: <Stage Name> - <Brief Summary>

<Detailed description of what was accomplished>

Key findings:
- <Finding 1>
- <Finding 2>

Next steps: <What Stage N+1 will address>
EOF
)"
```

Example commit messages:

**Stage 0 (Literature Review):**
```
[Co-Scientist] Stage 0: Literature Review - Data augmentation for robustness

Reviewed 12 papers on data augmentation and adversarial robustness.

Key findings:
- Most prior work focuses on geometric transforms
- Gap: limited study of aggressive augmentation (>50%)
- Candidate methods: RandAugment, AutoAugment, AugMax

Next steps: Formulate testable hypothesis about augmentation intensity
```

**Stage 3 (Experimentation):**
```
[Co-Scientist] Stage 3: Experimentation - 15 experiments completed

Tree exploration complete with 15 nodes (12 successful, 3 buggy).

Key findings:
- Best result: 75% augmentation achieves 58.9% adversarial accuracy
- Diminishing returns above 75% with clean accuracy degradation
- Geometric transforms outperform color-only

Next steps: Validate 75% configuration with multiple seeds
```

### Loop Detection

After completing each stage, assess if we need to loop back:

- **Stage 1 → Stage 0**: Need more background research?
- **Stage 2 → Stage 1**: Baseline suggests hypothesis is ill-formed?
- **Stage 3 → Stage 2**: Discovered confounding variable?
- **Stage 3 → Stage 1**: Results suggest hypothesis revision needed?
- **Stage 4 → Stage 3**: Validation revealed flaw worth investigating?

When looping:
```bash
python scripts/tree.py loop-back <target_stage> "<reason>"
```

## Experimentation Loop (Stage 3)

During systematic experimentation, follow this cycle:

### 1. Plan Next Experiment
Use best-first search to select the next experiment:
```bash
python scripts/tree.py get-candidates
```

### 2. Write Experiment Code
Create a code file for the experiment. Include:
- Clear hypothesis being tested
- Metrics to capture
- Reproducibility (seeds, versions)

### 3. Add Node to Tree
```bash
python scripts/tree.py add-node <parent_id> "<plan>" <code_file>
```

### 4. Execute and Analyze
Run the experiment, capture output, analyze results.

### 5. Update Node Status
On success:
```bash
python scripts/tree.py update <node_id> --status=success --metrics='{"value": 0.85, "name": "accuracy", "maximize": true}' --analysis="<analysis>"
```

On failure:
```bash
python scripts/tree.py mark-buggy <node_id> "<error_description>"
```

### 6. Commit to Git
```bash
python scripts/tree.py commit <node_id>
```

### 7. Update Visualization
```bash
python scripts/visualize.py <project_path>
```

### 8. Repeat
Continue until stage complete (resource budget exhausted or results conclusive).

## Tree Operations Reference

See [references/tree-operations.md](references/tree-operations.md) for complete CLI documentation.

### Quick Reference

```bash
# Project management
python scripts/tree.py init <project_path>
python scripts/tree.py load <project_path>

# Stage management
python scripts/tree.py start-stage <stage_num>
python scripts/tree.py complete-stage <outcome>
python scripts/tree.py loop-back <target_stage> "<reason>"

# Node operations
python scripts/tree.py add-node <parent_id> "<plan>" <code_file>
python scripts/tree.py update <node_id> [--status=...] [--metrics=...] [--analysis=...]
python scripts/tree.py mark-buggy <node_id> "<error>"
python scripts/tree.py commit <node_id>

# Query operations
python scripts/tree.py get-best <top_k>
python scripts/tree.py get-candidates
python scripts/tree.py export-trees
```

## Paper Writing (Optional)

After completing experimentation, optionally write a paper:

1. **Extract Best Path**: Identify the most successful experimental path
2. **Generate Figures**: Create publication-quality figures from results
3. **Write Sections**: Follow prompts in [references/paper-writing.md](references/paper-writing.md)
4. **Compile**: `bash scripts/compile_latex.sh <paper_path>`
5. **Review**: Use [references/paper-review.md](references/paper-review.md) criteria

## Integration with Other Skills

This skill is non-blocking. You can:
- Pause research to handle other tasks
- Resume by loading project state: `python scripts/tree.py load <project_path>`
- The visualization persists and shows current progress

## File Locations

All project state stored in `<project_path>/.co-scientist/`:
- `project.json` - Hypothesis, variables, metadata
- `stage_history.json` - Stage transitions and loops
- `trees/` - Individual stage tree files
- `viz/index.html` - Interactive visualization

## Example Workflow

```
User: "I want to research whether data augmentation improves model robustness"

AI Co-Scientist:
1. Initialize project
2. Stage 0: Search for prior work on data augmentation and robustness
3. Checkpoint: "Here's what I found. Gaps include X, Y. Shall we proceed?"
4. **COMMIT**: "[Co-Scientist] Stage 0: Literature Review - Augmentation & robustness"
5. Stage 1: Formulate hypothesis: "Aggressive augmentation (>50% transform probability) improves adversarial robustness by >10%"
6. Checkpoint: "Does this hypothesis look testable? What would refute it?"
7. **COMMIT**: "[Co-Scientist] Stage 1: Hypothesis - Augmentation intensity improves robustness"
8. Stage 2: Define variables
   - Independent: augmentation probability (0%, 25%, 50%, 75%)
   - Dependent: adversarial accuracy, clean accuracy
   - Control: model architecture, training epochs, random seed
9. Checkpoint: "Please verify these variables and set resource budget"
10. **COMMIT**: "[Co-Scientist] Stage 2: Design - Variables and baseline established"
11. Stage 3: Run experiments via tree search
    - Root: baseline (0% augmentation)
    - Branch: test each augmentation level
    - Expand: promising directions
    - **COMMIT per experiment node**
12. Checkpoint after tree exploration: "Results suggest X. Continue or loop back?"
13. **COMMIT**: "[Co-Scientist] Stage 3: Experimentation - 15 nodes, best=75%"
14. Stage 4: Validate best configuration with multiple seeds, ablations
15. **COMMIT**: "[Co-Scientist] Stage 4: Validation - Results confirmed"
16. Synthesize conclusions and optionally write paper
```

## Key Commands Summary

| Action | Command |
|--------|---------|
| Start new project | `python scripts/tree.py init <path>` |
| View visualization | `open <path>/.co-scientist/viz/index.html` |
| Add experiment | `python scripts/tree.py add-node ...` |
| Mark success | `python scripts/tree.py update <id> --status=success --metrics=...` |
| Commit node | `python scripts/tree.py commit <node_id>` |
| Get best results | `python scripts/tree.py get-best 3` |
| Advance stage | `python scripts/tree.py complete-stage success` |
| **Commit stage** | `git add -A && git commit -m "[Co-Scientist] Stage N: ..."` |
| Loop back | `python scripts/tree.py loop-back <stage> "<reason>"` |

Related Skills

cs448b-visualization

148

from sundial-org/skills

Data visualization design based on Stanford CS448B. Use for: (1) choosing chart types, (2) selecting visual encodings, (3) critiquing visualizations, (4) building D3.js visualizations, (5) designing interactions/animations, (6) choosing colors, (7) visualizing networks, (8) visualizing text. Covers Bertin, Mackinlay, Cleveland & McGill.

training-data-curation

148

from sundial-org/skills

Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.

tinker

148

from sundial-org/skills

Fine-tune LLMs using the Tinker API. Covers supervised fine-tuning, reinforcement learning, LoRA training, vision-language models, and both high-level Cookbook patterns and low-level API usage.

tinker-training-cost

148

from sundial-org/skills

Calculate training costs for Tinker fine-tuning jobs. Use when estimating costs for Tinker LLM training, counting tokens in datasets, or comparing Tinker model training prices. Tokenizes datasets using the correct model tokenizer and provides accurate cost estimates.

skill

148

from sundial-org/skills

Find, install, create, improve, and publish AI agent skills through the Sundial ecosystem. Use when the user wants to find or search for skills, install a skill, create a new skill, improve or evaluate an existing skill, or publish a skill to Sundial Hub. Trigger phrases include "find a skill", "install skill", "create a skill", "make a skill", "improve this skill", "evaluate skill", "publish skill", "push skill", "search for skills".

skill-to-card

148

from sundial-org/skills

End-to-end workflow that creates a skill from a description and attached files, publishes it to Sundial as a private skill, generates a trading card (front + back with QR code), and sends it to a printer. Use when the user wants to create a skill and get a printed trading card, or says "skill to card", "create and print a skill card", "make me a skill with a card".

project-referee

148

from sundial-org/skills

Critiques ML conference papers with reviewer-style feedback. Use when users want to anticipate reviewer concerns, identify weaknesses, check claim-evidence gaps, or find missing citations.

neuro-symbolic-reasoning

148

from sundial-org/skills

Neuro-symbolic AI combining LLMs with symbolic solvers. Use when exploring neuro-symbolic approaches (ideation, no code) or implementing solver integrations (code).

icml-reviewer

148

from sundial-org/skills

Paper reviewer that evaluates machine learning research projects following official ICML reviewer guidelines. Provides comprehensive reviews with actionable feedback across all key dimensions: claims/evidence, relation to prior work, originality, significance, clarity, and reproducibility. Also provides formative feedback on incomplete drafts, proposals, and research code repositories. MANDATORY TRIGGERS: review paper, ICML review, paper review, evaluate paper, research paper feedback, ML paper review, conference review, academic review, paper critique, NeurIPS review, ICLR review, project proposal, research proposal, paper draft, early feedback, incomplete paper, work in progress, WIP review, review repo, review codebase, research project review

cs-research-methodology

148

from sundial-org/skills

Conduct a literature review and develop a CS research proposal. Use when asked to review a research area, find gaps in existing work, and propose a novel research contribution. The output is a research proposal identifying an assumption to challenge (the "bit flip") and how to validate it.

commit-splitter

148

from sundial-org/skills

Split large sets of uncommitted changes into logical, well-organized commits. Use when the user has many uncommitted changes and wants structured commits, or proactively suggest when detecting a large diff that would benefit from splitting.

codex

148

from sundial-org/skills

Run OpenAI's Codex CLI agent in non-interactive mode using `codex exec`. Use when delegating coding tasks to Codex, running Codex in scripts/automation, or when needing a second agent to work on a task in parallel.