training-check
Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.
Best use case
training-check is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.
Teams using training-check should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/training-check/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How training-check Compares
| Feature / Agent | training-check | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
SKILL.md Source
# Training Check
Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.
## Context: $ARGUMENTS
## Constants
- **WANDB_RUN** - Read from project notes or pass as `entity/project/run_id`.
- **CHECK_INTERVAL** - Starts at 10 minutes, then gradually increases if consistently healthy: 10 min -> 20 min -> 30 min -> 60 min (cap).
- **REVIEWER_MODEL = `gpt-5.4`** - Used via a secondary Codex agent for ambiguous cases only.
## When to Use
- After training is confirmed running (session alive, loss decreasing for the first few steps)
- When the user wants recurring health checks during training
- **This skill checks training QUALITY, not process HEALTH.** Process health (session alive, GPU utilization) belongs to watchdog-style monitoring.
## Workflow
### Step 1: Read WandB Metrics
```python
import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()
```
If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH:
```bash
ssh server "tail -100 /path/to/training.log"
```
Check these signals:
- **Loss trend** - Is training loss decreasing over the last N steps?
- **Eval metrics** - Are evaluation metrics improving (or at least not degrading)?
- **NaN / Inf** - Any NaN or Inf values in loss or gradients?
- **Spikes** - Sudden large jumps in loss (>10x normal variance)?
- **Learning rate** - Is the schedule behaving as expected?
- **Gradient norm** - Exploding or vanishing?
### Step 2: Judgment
| Signal | Judgment | Action |
|--------|----------|--------|
| NaN/Inf in loss | **Clearly bad** | Stop training, investigate |
| Loss diverging (increasing for >N steps) | **Clearly bad** | Stop training, investigate |
| Eval metrics significantly worse than baseline | **Clearly bad** | Stop training, investigate |
| Loss decreasing, metrics improving | **Clearly fine** | Continue, increase check interval |
| Loss flat but not diverging | **Unsure** | -> Step 3 (secondary review) |
| Metrics noisy, can't tell trend | **Unsure** | -> Step 3 (secondary review) |
| Slightly worse than baseline but still early | **Unsure** | -> Step 3 (secondary review) |
### Step 3: Secondary Codex Judgment (only when unsure)
Only escalate when the signal is ambiguous. For clearly good or clearly bad signals, act directly.
```text
spawn_agent:
model: REVIEWER_MODEL
reasoning_effort: high
message: |
TRAINING HEALTH CHECK - need your judgment on ambiguous metrics.
Run: <entity>/<project>/<run_id>
Current epoch/step: X / Y total
Training loss (last 10 checkpoints): [values]
Eval metrics (last 3 evals): [values]
Baseline reference: [numbers from paper/reproduction]
What I'm unsure about: [specific concern]
Please respond with exactly one of:
- STOP: clearly problematic, should kill training
- CONTINUE: looks fine, check again next interval
- WAIT: not enough data to judge, check again sooner
```
If delegation is unavailable, make a local judgment using the same rubric and mark the decision `[pending external review]`. In ambiguous cases with no hard failure, prefer `WAIT` over `STOP`.
### Step 4: Act
| Decision | Action |
|----------|--------|
| **Stop** | Kill the training session. Save the WandB run URL, key metrics, and reason for stopping. Log to project notes for debugging. |
| **Continue** | Do nothing. Re-run at the next interval (increase interval if consistently healthy). |
| **Wait** | Do nothing but keep the current short interval (do not increase). |
## Integration with Watchdog
`training-check` and watchdog-style monitoring operate at different levels:
| Layer | Tool | What it checks | Frequency |
|-------|------|----------------|-----------|
| Process health | watchdog | Session alive? GPU active? | Every 60s (continuous) |
| Training quality | training-check | Loss trend? Metrics improving? | Every 10-60 min (periodic) |
Use both together:
- Watchdog catches crashes and idle GPUs immediately
- `training-check` catches subtle quality issues (loss plateau, metric degradation)
## Rules
- Do not stop training on the first sign of noise - some loss spikes are normal. Look at **trends over multiple checkpoints**.
- When stopping training, always save the WandB run URL and key metrics as evidence.
- If both WandB and log files are unreachable, report the connectivity issue and try again next interval. Do not assume training is broken.
- Gradually increase check interval when healthy (10 -> 20 -> 30 -> 60 min). Reset to 10 min after any anomaly.
- This skill is meant to be automated via a recurring scheduler. If the user wants ongoing monitoring, set up the best local mechanism available instead of waiting for manual reruns.
## Recurring Setup Example
```text
After training is confirmed stable:
Create a recurring job (cron, task scheduler, tmux loop, etc.)
that runs `/training-check <entity>/<project>/<run_id>` every 10 minutes.
```
As the check interval increases, update the old recurring job to match the new interval.Related Skills
novelty-check
Verify research idea novelty against recent literature. Use when user says "查新", "novelty check", "有没有人做过", "check novelty", or wants to verify a research idea is novel before implementing.
vast-gpu
Rent, manage, and destroy GPU instances on vast.ai. Use when user says "rent gpu", "vast.ai", "rent a server", "cloud gpu", or needs on-demand GPU without owning hardware.
system-profile
Profile a target (script, process, GPU, memory, interconnect) using external tools and code instrumentation. Produces structured performance reports with actionable recommendations. Use when user says "profile", "benchmark", "bottleneck", or wants performance analysis.
serverless-modal
Run GPU workloads on Modal — training, fine-tuning, inference, batch processing. Zero-config serverless: no SSH, no Docker, auto scale-to-zero. Use when user says "modal run", "modal training", "modal inference", "deploy to modal", "need a GPU", "run on modal", "serverless GPU", or needs remote GPU compute.
semantic-scholar
Search published venue papers (IEEE, ACM, Springer, etc.) via Semantic Scholar API. Complements /arxiv (preprints) with citation counts, venue metadata, and TLDR. Use when user says "search semantic scholar", "find IEEE papers", "find journal papers", "venue papers", "citation search", or wants published literature beyond arXiv preprints.
run-experiment
Deploy and run ML experiments on local, remote, Vast.ai, or Modal serverless GPU. Use when user says "run experiment", "deploy to server", "跑实验", or needs to launch training jobs.
result-to-claim
Use when experiments complete to judge what claims the results support, what they don't, and what evidence is still missing. Codex MCP evaluates results against intended claims and routes to next action (pivot, supplement, or confirm). Use after experiments finish — before writing the paper or running ablations.
research-review
Get a deep critical review of research from GPT via Codex MCP. Use when user says "review my research", "help me review", "get external review", or wants critical feedback on research ideas, papers, or experimental results.
research-refine
Turn a vague research direction into a problem-anchored, elegant, frontier-aware, implementation-oriented method plan via iterative GPT-5.4 review. Use when the user says "refine my approach", "帮我细化方案", "decompose this problem", "打磨idea", "refine research plan", "细化研究方案", or wants a concrete research method that stays simple, focused, and top-venue ready instead of a vague or overbuilt idea.
research-refine-pipeline
Run an end-to-end workflow that chains `research-refine` and `experiment-plan`. Use when the user wants a one-shot pipeline from vague research direction to focused final proposal plus detailed experiment roadmap, or asks to "串起来", build a pipeline, do it end-to-end, or generate both the method and experiment plan together.
research-pipeline
Full research pipeline: Workflow 1 (idea discovery) → implementation → Workflow 2 (auto review loop). Goes from a broad research direction all the way to a submission-ready paper. Use when user says "全流程", "full pipeline", "从找idea到投稿", "end-to-end research", or wants the complete autonomous research lifecycle.
research-lit
Search and analyze research papers, find related work, summarize key ideas. Use when user says "find papers", "related work", "literature review", "what does this paper say", or needs to understand academic papers.