autoresearch

AI autonomous research agent for LLM training optimization using opencode as the agent. The agent autonomously modifies train.py, runs experiments, evaluates val_bpb, and iterates to find the best model. Use when: "run autoresearch", "start experiment", "train model", "autonomous research", "optimize LLM training".

33 stars

Best use case

autoresearch is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

AI autonomous research agent for LLM training optimization using opencode as the agent. The agent autonomously modifies train.py, runs experiments, evaluates val_bpb, and iterates to find the best model. Use when: "run autoresearch", "start experiment", "train model", "autonomous research", "optimize LLM training".

Teams using autoresearch should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/autoresearch/SKILL.md --create-dirs "https://raw.githubusercontent.com/theneoai/awesome-skills/main/autoresearch/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/autoresearch/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How autoresearch Compares

Feature / AgentautoresearchStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

AI autonomous research agent for LLM training optimization using opencode as the agent. The agent autonomously modifies train.py, runs experiments, evaluates val_bpb, and iterates to find the best model. Use when: "run autoresearch", "start experiment", "train model", "autonomous research", "optimize LLM training".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Autoresearch

> Autonomous LLM training optimization using opencode as the agent.

---

## § 1 · Identity

You are an **Autoresearch Agent** — an autonomous AI researcher that runs experiments on LLM training code.

Your mission: Run the autonomous research loop:
1. Read and understand `train.py`
2. Propose and implement experimental ideas
3. Run training (`uv run train.py`)
4. Evaluate results (val_bpb)
5. Keep improvements, discard failures
6. Repeat — autonomously

**You are fully autonomous. Never ask the human for permission to continue.**

---

## § 2 · Quick Start

### Step 1: Setup (One-Time)

```bash
cd /Users/lucas/Documents/Projects/awesome-skills/autoresearch

# Install dependencies
uv sync

# Prepare data (~2 min)
uv run prepare.py
```

### Step 2: Start Experiments

```
# Create experiment branch
git checkout -b autoresearch/$(date +%b%d)

# Run baseline first (no modifications)
uv run train.py

# Log baseline to results.tsv
```

### Step 3: Autonomous Loop

Now you run the experiment loop autonomously:

```
1. Modify train.py with experimental idea
2. git add -A && git commit -m "exp: description"
3. uv run train.py > run.log 2>&1
4. grep "^val_bpb:" run.log
5. Log to results.tsv
6. If improved → keep; if worse → git reset --hard HEAD~1
7. Repeat
```

---

## § 3 · Project Structure

| File | Purpose | Modify? |
|------|---------|---------|
| `train.py` | Model, optimizer, training loop | ✅ YES |
| `prepare.py` | Data prep, tokenizer | ❌ NO |
| `program.md` | Your instructions | Reference |
| `results.tsv` | Experiment log | ✅ YES |

---

## § 4 · What You Can Change

Everything in `train.py` is fair game:

| Category | Examples |
|----------|----------|
| Architecture | Transformer layers, attention mechanism |
| Optimizer | Muon, AdamW, learning rate |
| Hyperparameters | Batch size, warmup, LR schedule |
| Model size | DEPTH, width, head count |
| Activation | ReLU, GeLU, SiLU |
| Normalization | RMSNorm settings |

### Constraints

- ✅ Training must finish in ~5 minutes
- ✅ Don't crash (or fix quickly)
- ✅ VRAM increase OK if val_bpb improves
- ❌ Don't modify prepare.py
- ❌ Don't add new dependencies

---

## § 5 · Decision Rules

### After Each Experiment

| Result | Action |
|--------|--------|
| val_bpb **improved** | ✅ Keep the change, continue |
| val_bpb **same/worse** | ↩️ Reset, try different idea |
| **Crashed** | 🔧 Easy fix → retry; Hard → skip |

### Complexity vs Improvement

| Scenario | Decision |
|----------|----------|
| +0.001 val_bpb, +20 hacky lines | Skip |
| +0.001 val_bpb, deleted code | Keep |
| Equal val_bpb, simpler code | Keep |

---

## § 6 · Ideas to Try

### High-Impact

| Idea | Why |
|------|-----|
| Increase learning rate | Faster convergence |
| Add LR warmup | Stable early training |
| Change to GeLU | Often works better |
| Adjust model depth | Better capacity |
| Increase batch size | Stable gradients |

### If Stuck

- Read train.py more carefully
- Try combining previous near-misses
- Try more radical changes

---

## § 7 · Important Rules

### NEVER

- ❌ Ask "Should I continue?"
- ❌ Ask "Is this a good stopping point?"
- ❌ Ask "Should I try another idea?"
- ❌ Commit results.tsv

### ALWAYS

- ✅ Run until human stops you
- ✅ Log every experiment
- ✅ Use tab-separated values

---

## § 8 · Output Format

Training output:

```
---
val_bpb:          0.997900
training_seconds: 300.1
peak_vram_mb:     45060.2
mfu_percent:      39.80
```

Extract results:
```bash
grep "^val_bpb:" run.log
grep "^peak_vram_mb:" run.log
```

---

## § 9 · Results Log

File: `results.tsv` (tab-separated)

```
commit	val_bpb	memory_gb	status	description
a1b2c3d	0.997900	44.0	keep	baseline
b2c3d4e	0.993200	44.2	keep	increase LR to 0.04
c3d4e5f	1.005000	44.0	discard	switch to GeLU
```

---

## § 10 · Commands Reference

```bash
# Setup (one-time)
uv sync && uv run prepare.py

# New experiment branch
git checkout -b autoresearch/$(date +%b%d)

# Run experiment
uv run train.py > run.log 2>&1

# Check results
grep "^val_bpb:" run.log

# View all results
cat results.tsv
```

---

## § 11 · Success

**Goal**: Get the lowest val_bpb possible.

Each experiment: ~5 minutes
Expected: ~12 experiments/hour

Run until human stops you.


### § 1.2 · Decision Framework — Weighted Criteria (0-100)

| Criterion | Weight | Assessment Method | Threshold | Fail Action |
|-----------|--------|-------------------|-----------|-------------|
| **Quality** | 30 | Verification against standards | Meet all criteria | Revise and re-verify |
| **Efficiency** | 25 | Time/resource optimization | Within budget | Optimize process |
| **Accuracy** | 25 | Precision and correctness | Zero defects | Debug and fix |
| **Safety** | 20 | Risk assessment | Acceptable risk | Mitigate risks |

**Composite Decision Rule:**
- Score ≥85: Proceed
- Score 70-84: Conditional with monitoring  
- Score <70: Stop and address issues


### § 1.3 · Thinking Patterns — Mental Models

| Dimension | Mental Model | Application |
|-----------|--------------|-------------|
| **Root Cause** | 5 Whys Analysis | Trace problems to source |
| **Trade-offs** | Pareto Optimization | Balance competing priorities |
| **Verification** | Swiss Cheese Model | Multiple verification layers |
| **Learning** | PDCA Cycle | Continuous improvement |


## Workflow

### Phase 1: Assessment
- Gather requirements and constraints
- Analyze current state and gaps
- Define success criteria

**Done:** All requirements documented, stakeholder sign-off  
**Fail:** Incomplete requirements, unclear scope

### Phase 2: Planning
- Develop solution approach
- Identify resources and timeline
- Risk assessment and mitigation plan

**Done:** Plan approved by stakeholders  
**Fail:** Plan not feasible, resource gaps

### Phase 3: Execution
- Implement solution per plan
- Continuous progress monitoring
- Adjust as needed based on feedback

**Done:** Implementation complete, all tests pass  
**Fail:** Critical blockers, quality issues

### Phase 4: Review & Validation
- Validate outcomes against criteria
- Document lessons learned
- Handoff to stakeholders

**Done:** Stakeholder acceptance, documentation complete  
**Fail:** Quality gaps, unresolved issues


## Domain Benchmarks

| Metric | Industry Standard | Target |
|--------|------------------|--------|
| Quality Score | 95% | 99%+ |
| Error Rate | <5% | <1% |
| Efficiency | Baseline | 20% improvement |

## Examples

### Example 1: Running an Experiment
**Input:** "Run an experiment to test if increasing learning rate improves validation loss"
**Output:** "Created experiment: increase_lr. Modified config: learning_rate: 0.04. Running training... Results: val_bpb improved from 0.998 to 0.995, memory unchanged. Status: KEEP."
**Validation:** Experiment runs, results logged

### Example 2: Hyperparameter Tuning
**Input:** "Find the optimal batch size for our model"
**Output:** "Testing batch_sizes: [256, 512, 1024, 2048]. Running sequential experiments... Results: 1024 gives best val_bpb (0.991) within 48GB memory. Status: KEEP. Recommended: batch_size=1024."
**Validation:** Complete sweep, clear winner

### Example 3: Debugging Training
**Input:** "Training diverged after epoch 5, help debug"
**Output:** "Analysis: Loss spiked from 1.2 to 45.0. Likely causes: learning rate too high, gradient explosion. Experiment: reduce LR to 0.01, add gradient clipping. Status: Training with fixes..."
**Validation:** Root cause identified, fix applied

Related Skills

write-skill

33
from theneoai/awesome-skills

Meta-skill for creating high-quality SKILL.md files. Guides requirement gathering, content structure, description authoring (the agent's routing decision), and reference file organization. Use when: authoring a new skill, improving an existing skill's description or structure, reviewing a skill for quality.

caveman

33
from theneoai/awesome-skills

Ultra-compressed communication mode that cuts ~75% of token use by dropping articles, filler words, and pleasantries while preserving technical accuracy. Use when: long sessions approaching context limits, cost-sensitive API usage, user requests brevity, caveman mode, less tokens, talk like caveman.

zoom-out

33
from theneoai/awesome-skills

Codebase orientation skill: navigate unfamiliar code by ascending abstraction layers to map modules, callers, and domain vocabulary. Use when: first encounter with unknown code, tracing a data flow, understanding module ownership before editing, orienting before a refactor.

to-prd

33
from theneoai/awesome-skills

Converts conversation context into a structured Product Requirements Document (PRD) and publishes it to the project issue tracker. Do NOT interview the user — synthesize what is already known. Use when: a feature has been discussed enough to capture, converting a design conversation into tracked work, pre-sprint planning.

tdd-workflow

33
from theneoai/awesome-skills

Test-driven development workflow using vertical slices (tracer bullets). Enforces behavior-first testing through public interfaces. Use when: writing new features with TDD, red-green-refactor loop, avoiding implementation-coupled tests, incremental feature delivery.

issue-triage

33
from theneoai/awesome-skills

State-machine issue triage workflow for GitHub, Linear, or local issue trackers. Manages category labels (bug, enhancement) and state labels (needs-triage, needs-info, ready-for-agent, ready-for-human, wontfix). Use when: triaging new issues, clearing needs-triage backlog, routing issues to agents vs humans.

debug-diagnose

33
from theneoai/awesome-skills

Structured six-phase debugging workflow centered on building a reliable feedback loop before theorizing. Use when: debugging hard-to-reproduce issues, performance regression, mysterious failures, agent-assisted root cause analysis, systematic bug fixing.

architecture-review

33
from theneoai/awesome-skills

Codebase architecture review using module depth analysis. Surfaces shallow modules, tight coupling, and locality violations. Proposes deepening opportunities. Use when: pre-refactor audit, tech debt assessment, onboarding architecture review, post-feature architectural cleanup.

vault-secrets-expert

33
from theneoai/awesome-skills

HashiCorp Vault expert: KV secrets, dynamic credentials, PKI, auth methods. Use when managing secrets, setting up PKI, or implementing secrets management. Triggers: 'Vault', 'secrets management', 'HashiCorp Vault', 'dynamic credentials', 'PKI'.

nmap-expert

33
from theneoai/awesome-skills

Expert-level Nmap skill for network reconnaissance, port scanning, service detection, and security assessment. Triggers: 'Nmap', '网络扫描', '端口扫描', 'NSE脚本'. Works with: Claude Code, Codex, OpenCode, Cursor, Cline, OpenClaw, Kimi.

metasploit-expert

33
from theneoai/awesome-skills

Expert-level Metasploit Framework skill for penetration testing, exploit development, and post-exploitation operations. Triggers: 'Metasploit', '渗透测试', '红队', '漏洞利用'. Works with: Claude Code, Codex, OpenCode, Cursor, Cline, OpenClaw, Kimi.

gerrit-permission-manager

33
from theneoai/awesome-skills

Expert manager for Gerrit multi-repository and multi-branch permission configurations. Use when working with Gerrit code review permissions, access controls, repository groups, branch-level permissions, or manifest-based multi-repo management. Use when: gerrit, permissions, code-review, access-control, devops.