veomni-debug
Use this skill for ANY bug, error, crash, wrong output, loss divergence, gradient explosion, test failure, CUDA error, distributed training hang, checkpoint load failure, or unexpected behavior. Covers both quick fixes (clear root cause) and complex debugging (unclear cause). Trigger: 'fix bug', 'fix error', 'broken', 'crash', 'doesn't work', 'fails with', 'loss NaN', 'training hangs', 'FSDP error', 'OOM'.
Best use case
veomni-debug is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Use this skill for ANY bug, error, crash, wrong output, loss divergence, gradient explosion, test failure, CUDA error, distributed training hang, checkpoint load failure, or unexpected behavior. Covers both quick fixes (clear root cause) and complex debugging (unclear cause). Trigger: 'fix bug', 'fix error', 'broken', 'crash', 'doesn't work', 'fails with', 'loss NaN', 'training hangs', 'FSDP error', 'OOM'.
Teams using veomni-debug should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/veomni-debug/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How veomni-debug Compares
| Feature / Agent | veomni-debug | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Use this skill for ANY bug, error, crash, wrong output, loss divergence, gradient explosion, test failure, CUDA error, distributed training hang, checkpoint load failure, or unexpected behavior. Covers both quick fixes (clear root cause) and complex debugging (unclear cause). Trigger: 'fix bug', 'fix error', 'broken', 'crash', 'doesn't work', 'fails with', 'loss NaN', 'training hangs', 'FSDP error', 'OOM'.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
SKILL.md Source
## Quick Path vs Full Protocol | Situation | Path | |-----------|------| | Clear error, obvious root cause, fix in <15 min | **Quick Path** (below) | | Root cause unclear, multiple hypotheses | **Full Protocol** (Phase 1–5) | | Distributed training issue (hang, wrong loss, sharding) | **Full Protocol** | | Numerical accuracy / loss divergence | **Full Protocol** | | 2+ failed fix attempts | **Full Protocol** | ### Quick Path 1. Reproduce the error. Read the full traceback. 2. Check `.agents/knowledge/constraints.md` for known pitfalls. 3. Write a reproducer test if feasible. 4. Minimal fix — root cause only, don't touch surrounding code. 5. Verify: reproducer passes, `pytest tests/<module>/` passes, no regressions across modalities. 6. Run `/veomni-review`, `make quality`, commit. If not resolved in 15 min → switch to Full Protocol. --- ## Full Protocol ### Before You Start Use TodoWrite to track all phases: ``` Phase 1: Investigate <symptom> -> in_progress Phase 2: Pattern analysis -> pending Phase 3: Hypothesis & test -> pending Phase 4: Implement fix -> pending Phase 5: Knowledge capture -> pending ``` ### Phase 1: Root Cause Investigation 1. Read the FULL error message / symptom. Don't skim. Extract 2-3 keywords. 2. **Check constraints first**: Read `.agents/knowledge/constraints.md` — many issues are known constraint violations. 3. Reproduce consistently. If you can't reproduce, you don't understand it. 4. `git log --oneline -10` — what changed recently? 5. Trace data flow backward through the call stack. 6. **Distributed training specifics**: - Check if error appears on all ranks or just rank 0. - FSDP2: verify sharding plan matches model structure (`veomni/distributed/parallel_plan.py`). - Sequence parallel: check that attention inputs are properly split/gathered (`veomni/distributed/sequence_parallel/`). - MoE: verify expert routing and load balancing (`veomni/distributed/moe/`). ### Phase 2: Pattern Analysis 1. Find a **working** example (previous commit, different config, reference implementation). 2. Compare **completely** — diff line by line, not skim. Include config YAML, environment vars, and launcher scripts. 3. Identify ALL differences between working and broken code. 4. Check dependencies — different transformers version? Different PyTorch version? 5. **If a package version upgrade is suspected**, create isolated uv environments to bisect: ```bash # Create a separate env with the old version uv venv .venv-old VIRTUAL_ENV=.venv-old uv sync --extra gpu --dev VIRTUAL_ENV=.venv-old uv pip install transformers==4.57.3 # or whichever old version # Create a separate env with the new version uv venv .venv-new VIRTUAL_ENV=.venv-new uv sync --no-group transformers-stable --extra transformers5-exp --extra gpu --dev ``` Run the same reproducer in both envs to confirm the version is the root cause. This avoids polluting the main `.venv/`. ### Phase 3: Hypothesis and Testing 1. Form ONE specific, falsifiable hypothesis. 2. Design a MINIMAL experiment (change one thing only). 3. Run the experiment. Record the result. 4. If wrong, update understanding and form new hypothesis. No random guess-and-check. **Red flags — STOP and restart from Phase 1:** - "Let me just try changing X and see what happens" - "Quick fix for now, clean up later" - "It probably works, let me move on" **Verification gate** — before acting on a conclusion, check: - Does the evidence actually support this cause, or just correlate? - Could a different root cause produce the same symptoms? - What observation would disprove this hypothesis? Have you looked for it? - If confidence < 80% or the evidence is ambiguous, launch a verification subagent (see Appendix). ### Phase 4: Implementation 1. Write a failing test that demonstrates the bug (if feasible). 2. Implement a SINGLE targeted fix addressing the root cause. 3. Verify: test passes, training runs correctly, no regressions. 4. Check for collateral — did the fix break other modalities or trainers? 5. Before committing: run `/veomni-review` skill. ### Phase 5: Knowledge Capture (mandatory) **Do this immediately after the fix is verified.** Knowledge decays fast. - [ ] **New hard constraint?** → add to `.agents/knowledge/constraints.md` - [ ] **Architecture insight?** → add to `.agents/knowledge/architecture.md` - [ ] **New test needed?** → add to `tests/` for regression prevention - [ ] **Docs outdated?** → update `docs/` if the fix changes API behavior, config semantics, or usage patterns If none apply, explicitly note "no new knowledge to capture." --- ## Three-Strike Rule If 3 consecutive fix attempts fail: - **STOP fixing symptoms.** - Question whether the underlying approach/architecture is wrong. - Step back and re-examine: are you solving the right problem? - Report to user with analysis before continuing. ## Common Pitfalls - **FSDP2 + gradient accumulation**: gradients must be accumulated in the unsharded space — accumulating sharded gradients produces wrong results. - **DCP checkpoint format**: model state dict keys must match exactly between save and load — renamed parameters break checkpoint loading silently. - **Multi-modality data collators**: text-only collators crash on multimodal data and vice versa — always check `data_collator` type matches the dataset. - **Sequence parallel**: attention outputs must be gathered before loss computation — partial outputs produce incorrect loss values. - **Patchgen**: model patches in `veomni/models/transformers/*/` are auto-generated — editing generated files directly will be overwritten. ## Domain-Specific Checklists Include the relevant checklist when investigating. ### Distributed Training Correctness - [ ] Is the loss identical (within tolerance) between 1-GPU and multi-GPU runs? - [ ] Are ALL model parameters sharded correctly? (check parallel_plan) - [ ] Is gradient clipping applied in the correct coordinate space? - [ ] For sequence parallel: are attention masks split consistently across ranks? - [ ] For MoE: are expert assignments deterministic across runs with the same seed? ### Numerical Correctness - [ ] Is there a reference implementation showing the SAME numbers? - [ ] Are ALL weights loaded? (check logs for missing/unexpected keys) - [ ] Is the comparison fair? (same inputs, same dtype, same parallelism) - [ ] Could there be a dtype mismatch? (float32 vs bfloat16 in computation) - [ ] Are there NaN/Inf values being silently masked or replaced? ## Appendix: Verification Subagent When confidence is low or evidence is ambiguous, launch a subagent to challenge your conclusion: ``` You are a critical reviewer. Your job is to find flaws in the following conclusion. ## Conclusion Under Review <the specific claim or decision> ## Evidence Presented <the data, logs, experiments supporting the conclusion> ## Your Task 1. Does the evidence actually support the conclusion, or just correlate? 2. Generate 2+ alternative explanations consistent with the same evidence. 3. What specific observation would DISPROVE this conclusion? Has it been checked? 4. Was the experiment controlled (one variable changed at a time)? ## Output Verdict: CONFIRMED / CHALLENGED / INSUFFICIENT_EVIDENCE Findings: [issues found, counter-hypotheses, missing evidence] ```
Related Skills
veomni-uv-update
Use this skill when updating dependencies managed by uv: bumping a package version, upgrading the uv tool itself, updating torch/CUDA stack, switching transformers version, or regenerating the lockfile. Trigger: 'update dependency', 'bump version', 'upgrade uv', 'update torch', 'update lockfile', 'uv sync fails'.
veomni-review
Use this skill before committing ANY code change — this is a mandatory gate in the commit flow. Also trigger proactively when: you've made changes across multiple files and want to check consistency, you're unsure if a fix is safe, a change touches shared infrastructure (BaseTrainer, distributed, model loading, data pipeline), or a change is larger than a few lines. The review launches a subagent that checks implementation quality, multi-file consistency, and known constraint violations, then rates the change as safe/needs-attention/risky.
veomni-new-op
Use this skill when adding a new optimized kernel or operator to veomni/ops/. Covers the full lifecycle: understanding VeOmni's ops architecture (monkey-patch + global function pointer pattern), implementing the kernel, registering it, adding tests, and documenting it. Trigger: 'add op', 'new kernel', 'add attention variant', 'new fused op', 'add triton kernel', 'optimize operator'.
veomni-new-model
Use this skill when adding support for a new model to VeOmni. Covers the full lifecycle: analyzing the HuggingFace model, creating model patches, defining parallel plans, writing configs, integrating with the trainer, and testing. Trigger: 'add model', 'support new model', 'integrate <model_name>', 'new model support'.
veomni-develop
VeOmni-specific checklist for feature development and refactoring. Covers impact analysis across modalities, trainer hierarchy, data pipeline, and distributed code. Use before implementing any non-trivial change. For model-specific or ops-specific work, use veomni-new-model or veomni-new-op instead. Trigger: 'add feature', 'implement', 'refactor', 'reorganize', 'new capability'.
create-pr
Create a pull request for the current branch. Handles uncommitted changes, generates a PR title matching the `[{modules}] {type}: {description}` format enforced by CI, and fills in the PR description template. Trigger: 'create pr', 'open pr', 'submit pr', 'make pr'.
debugging-streamlit
Debug Streamlit frontend and backend changes using make debug with hot-reload. Use when testing code changes, investigating bugs, checking UI behavior, or needing screenshots of the running app.
ios-debugger-agent
Debug the current iOS project on a booted simulator with XcodeBuildMCP.
error-diagnostics-smart-debug
Use when working with error diagnostics smart debug
error-debugging-multi-agent-review
Use when working with error debugging multi agent review
error-debugging-error-trace
You are an error tracking and observability expert specializing in implementing comprehensive error monitoring solutions. Set up error tracking systems, configure alerts, implement structured logging, and ensure teams can quickly identify and resolve production issues.
error-debugging-error-analysis
You are an expert error analysis specialist with deep expertise in debugging distributed systems, analyzing production incidents, and implementing comprehensive observability solutions.