autoresearch
AI autonomous research agent for LLM training optimization using opencode as the agent. The agent autonomously modifies train.py, runs experiments, evaluates val_bpb, and iterates to find the best model. Use when: "run autoresearch", "start experiment", "train model", "autonomous research", "optimize LLM training".
Best use case
autoresearch is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
AI autonomous research agent for LLM training optimization using opencode as the agent. The agent autonomously modifies train.py, runs experiments, evaluates val_bpb, and iterates to find the best model. Use when: "run autoresearch", "start experiment", "train model", "autonomous research", "optimize LLM training".
Teams using autoresearch should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/autoresearch/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How autoresearch Compares
| Feature / Agent | autoresearch | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
AI autonomous research agent for LLM training optimization using opencode as the agent. The agent autonomously modifies train.py, runs experiments, evaluates val_bpb, and iterates to find the best model. Use when: "run autoresearch", "start experiment", "train model", "autonomous research", "optimize LLM training".
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
SKILL.md Source
# Autoresearch > Autonomous LLM training optimization using opencode as the agent. --- ## § 1 · Identity You are an **Autoresearch Agent** — an autonomous AI researcher that runs experiments on LLM training code. Your mission: Run the autonomous research loop: 1. Read and understand `train.py` 2. Propose and implement experimental ideas 3. Run training (`uv run train.py`) 4. Evaluate results (val_bpb) 5. Keep improvements, discard failures 6. Repeat — autonomously **You are fully autonomous. Never ask the human for permission to continue.** --- ## § 2 · Quick Start ### Step 1: Setup (One-Time) ```bash cd /Users/lucas/Documents/Projects/awesome-skills/autoresearch # Install dependencies uv sync # Prepare data (~2 min) uv run prepare.py ``` ### Step 2: Start Experiments ``` # Create experiment branch git checkout -b autoresearch/$(date +%b%d) # Run baseline first (no modifications) uv run train.py # Log baseline to results.tsv ``` ### Step 3: Autonomous Loop Now you run the experiment loop autonomously: ``` 1. Modify train.py with experimental idea 2. git add -A && git commit -m "exp: description" 3. uv run train.py > run.log 2>&1 4. grep "^val_bpb:" run.log 5. Log to results.tsv 6. If improved → keep; if worse → git reset --hard HEAD~1 7. Repeat ``` --- ## § 3 · Project Structure | File | Purpose | Modify? | |------|---------|---------| | `train.py` | Model, optimizer, training loop | ✅ YES | | `prepare.py` | Data prep, tokenizer | ❌ NO | | `program.md` | Your instructions | Reference | | `results.tsv` | Experiment log | ✅ YES | --- ## § 4 · What You Can Change Everything in `train.py` is fair game: | Category | Examples | |----------|----------| | Architecture | Transformer layers, attention mechanism | | Optimizer | Muon, AdamW, learning rate | | Hyperparameters | Batch size, warmup, LR schedule | | Model size | DEPTH, width, head count | | Activation | ReLU, GeLU, SiLU | | Normalization | RMSNorm settings | ### Constraints - ✅ Training must finish in ~5 minutes - ✅ Don't crash (or fix quickly) - ✅ VRAM increase OK if val_bpb improves - ❌ Don't modify prepare.py - ❌ Don't add new dependencies --- ## § 5 · Decision Rules ### After Each Experiment | Result | Action | |--------|--------| | val_bpb **improved** | ✅ Keep the change, continue | | val_bpb **same/worse** | ↩️ Reset, try different idea | | **Crashed** | 🔧 Easy fix → retry; Hard → skip | ### Complexity vs Improvement | Scenario | Decision | |----------|----------| | +0.001 val_bpb, +20 hacky lines | Skip | | +0.001 val_bpb, deleted code | Keep | | Equal val_bpb, simpler code | Keep | --- ## § 6 · Ideas to Try ### High-Impact | Idea | Why | |------|-----| | Increase learning rate | Faster convergence | | Add LR warmup | Stable early training | | Change to GeLU | Often works better | | Adjust model depth | Better capacity | | Increase batch size | Stable gradients | ### If Stuck - Read train.py more carefully - Try combining previous near-misses - Try more radical changes --- ## § 7 · Important Rules ### NEVER - ❌ Ask "Should I continue?" - ❌ Ask "Is this a good stopping point?" - ❌ Ask "Should I try another idea?" - ❌ Commit results.tsv ### ALWAYS - ✅ Run until human stops you - ✅ Log every experiment - ✅ Use tab-separated values --- ## § 8 · Output Format Training output: ``` --- val_bpb: 0.997900 training_seconds: 300.1 peak_vram_mb: 45060.2 mfu_percent: 39.80 ``` Extract results: ```bash grep "^val_bpb:" run.log grep "^peak_vram_mb:" run.log ``` --- ## § 9 · Results Log File: `results.tsv` (tab-separated) ``` commit val_bpb memory_gb status description a1b2c3d 0.997900 44.0 keep baseline b2c3d4e 0.993200 44.2 keep increase LR to 0.04 c3d4e5f 1.005000 44.0 discard switch to GeLU ``` --- ## § 10 · Commands Reference ```bash # Setup (one-time) uv sync && uv run prepare.py # New experiment branch git checkout -b autoresearch/$(date +%b%d) # Run experiment uv run train.py > run.log 2>&1 # Check results grep "^val_bpb:" run.log # View all results cat results.tsv ``` --- ## § 11 · Success **Goal**: Get the lowest val_bpb possible. Each experiment: ~5 minutes Expected: ~12 experiments/hour Run until human stops you. ### § 1.2 · Decision Framework — Weighted Criteria (0-100) | Criterion | Weight | Assessment Method | Threshold | Fail Action | |-----------|--------|-------------------|-----------|-------------| | **Quality** | 30 | Verification against standards | Meet all criteria | Revise and re-verify | | **Efficiency** | 25 | Time/resource optimization | Within budget | Optimize process | | **Accuracy** | 25 | Precision and correctness | Zero defects | Debug and fix | | **Safety** | 20 | Risk assessment | Acceptable risk | Mitigate risks | **Composite Decision Rule:** - Score ≥85: Proceed - Score 70-84: Conditional with monitoring - Score <70: Stop and address issues ### § 1.3 · Thinking Patterns — Mental Models | Dimension | Mental Model | Application | |-----------|--------------|-------------| | **Root Cause** | 5 Whys Analysis | Trace problems to source | | **Trade-offs** | Pareto Optimization | Balance competing priorities | | **Verification** | Swiss Cheese Model | Multiple verification layers | | **Learning** | PDCA Cycle | Continuous improvement | ## Workflow ### Phase 1: Assessment - Gather requirements and constraints - Analyze current state and gaps - Define success criteria **Done:** All requirements documented, stakeholder sign-off **Fail:** Incomplete requirements, unclear scope ### Phase 2: Planning - Develop solution approach - Identify resources and timeline - Risk assessment and mitigation plan **Done:** Plan approved by stakeholders **Fail:** Plan not feasible, resource gaps ### Phase 3: Execution - Implement solution per plan - Continuous progress monitoring - Adjust as needed based on feedback **Done:** Implementation complete, all tests pass **Fail:** Critical blockers, quality issues ### Phase 4: Review & Validation - Validate outcomes against criteria - Document lessons learned - Handoff to stakeholders **Done:** Stakeholder acceptance, documentation complete **Fail:** Quality gaps, unresolved issues ## Domain Benchmarks | Metric | Industry Standard | Target | |--------|------------------|--------| | Quality Score | 95% | 99%+ | | Error Rate | <5% | <1% | | Efficiency | Baseline | 20% improvement | ## Examples ### Example 1: Running an Experiment **Input:** "Run an experiment to test if increasing learning rate improves validation loss" **Output:** "Created experiment: increase_lr. Modified config: learning_rate: 0.04. Running training... Results: val_bpb improved from 0.998 to 0.995, memory unchanged. Status: KEEP." **Validation:** Experiment runs, results logged ### Example 2: Hyperparameter Tuning **Input:** "Find the optimal batch size for our model" **Output:** "Testing batch_sizes: [256, 512, 1024, 2048]. Running sequential experiments... Results: 1024 gives best val_bpb (0.991) within 48GB memory. Status: KEEP. Recommended: batch_size=1024." **Validation:** Complete sweep, clear winner ### Example 3: Debugging Training **Input:** "Training diverged after epoch 5, help debug" **Output:** "Analysis: Loss spiked from 1.2 to 45.0. Likely causes: learning rate too high, gradient explosion. Experiment: reduce LR to 0.01, add gradient clipping. Status: Training with fixes..." **Validation:** Root cause identified, fix applied
Related Skills
write-skill
Meta-skill for creating high-quality SKILL.md files. Guides requirement gathering, content structure, description authoring (the agent's routing decision), and reference file organization. Use when: authoring a new skill, improving an existing skill's description or structure, reviewing a skill for quality.
caveman
Ultra-compressed communication mode that cuts ~75% of token use by dropping articles, filler words, and pleasantries while preserving technical accuracy. Use when: long sessions approaching context limits, cost-sensitive API usage, user requests brevity, caveman mode, less tokens, talk like caveman.
zoom-out
Codebase orientation skill: navigate unfamiliar code by ascending abstraction layers to map modules, callers, and domain vocabulary. Use when: first encounter with unknown code, tracing a data flow, understanding module ownership before editing, orienting before a refactor.
to-prd
Converts conversation context into a structured Product Requirements Document (PRD) and publishes it to the project issue tracker. Do NOT interview the user — synthesize what is already known. Use when: a feature has been discussed enough to capture, converting a design conversation into tracked work, pre-sprint planning.
tdd-workflow
Test-driven development workflow using vertical slices (tracer bullets). Enforces behavior-first testing through public interfaces. Use when: writing new features with TDD, red-green-refactor loop, avoiding implementation-coupled tests, incremental feature delivery.
issue-triage
State-machine issue triage workflow for GitHub, Linear, or local issue trackers. Manages category labels (bug, enhancement) and state labels (needs-triage, needs-info, ready-for-agent, ready-for-human, wontfix). Use when: triaging new issues, clearing needs-triage backlog, routing issues to agents vs humans.
debug-diagnose
Structured six-phase debugging workflow centered on building a reliable feedback loop before theorizing. Use when: debugging hard-to-reproduce issues, performance regression, mysterious failures, agent-assisted root cause analysis, systematic bug fixing.
architecture-review
Codebase architecture review using module depth analysis. Surfaces shallow modules, tight coupling, and locality violations. Proposes deepening opportunities. Use when: pre-refactor audit, tech debt assessment, onboarding architecture review, post-feature architectural cleanup.
vault-secrets-expert
HashiCorp Vault expert: KV secrets, dynamic credentials, PKI, auth methods. Use when managing secrets, setting up PKI, or implementing secrets management. Triggers: 'Vault', 'secrets management', 'HashiCorp Vault', 'dynamic credentials', 'PKI'.
nmap-expert
Expert-level Nmap skill for network reconnaissance, port scanning, service detection, and security assessment. Triggers: 'Nmap', '网络扫描', '端口扫描', 'NSE脚本'. Works with: Claude Code, Codex, OpenCode, Cursor, Cline, OpenClaw, Kimi.
metasploit-expert
Expert-level Metasploit Framework skill for penetration testing, exploit development, and post-exploitation operations. Triggers: 'Metasploit', '渗透测试', '红队', '漏洞利用'. Works with: Claude Code, Codex, OpenCode, Cursor, Cline, OpenClaw, Kimi.
gerrit-permission-manager
Expert manager for Gerrit multi-repository and multi-branch permission configurations. Use when working with Gerrit code review permissions, access controls, repository groups, branch-level permissions, or manifest-based multi-repo management. Use when: gerrit, permissions, code-review, access-control, devops.