troubleshooting

Common ComfyUI errors and fixes — OOM, missing nodes, dtype mismatches, black images, and debugging strategies

16 stars

Best use case

troubleshooting is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Common ComfyUI errors and fixes — OOM, missing nodes, dtype mismatches, black images, and debugging strategies

Teams using troubleshooting should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/troubleshooting/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/development/troubleshooting/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/troubleshooting/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How troubleshooting Compares

Feature / AgenttroubleshootingStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Common ComfyUI errors and fixes — OOM, missing nodes, dtype mismatches, black images, and debugging strategies

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# ComfyUI Troubleshooting Guide

## Error Diagnosis Strategy

When a workflow fails, follow this systematic approach:

1. **Get the error**: Use `get_history` to retrieve the execution result with full traceback
2. **Check logs**: Use `get_logs` with keyword filters like `"error"`, `"warning"`, `"traceback"`
3. **Identify the failing node**: The history response includes the `node_id` and `node_type` that failed
4. **Cross-reference inputs**: Use `get_node_info` to verify the failing node's expected input schema
5. **Check models**: Use `list_local_models` to verify all referenced model files exist

## Out of Memory (OOM)

### Error Pattern

```
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X MiB.
GPU 0 has a total capacity of 24.00 GiB of which X MiB is free.
```

Or:

```
RuntimeError: CUDA error: out of memory
```

### Root Cause

The GPU does not have enough VRAM to hold the model weights, intermediate tensors, and latent images simultaneously. Common triggers:
- High resolution images (2048x2048+)
- Multiple models loaded simultaneously
- FP32 precision models on limited VRAM
- Video generation (LTXV, AnimateDiff) with many frames
- Large batch sizes

### Fixes (in order of preference)

1. **Reduce resolution**: Drop to the model's native resolution (512 for SD 1.5, 1024 for SDXL/Flux)
2. **Use FP8/FP16 quantized models**: FP8 Flux models use ~8GB vs ~24GB for FP16
   - Search for FP8 variants: `search_models("flux fp8")` or `search_models("sdxl fp8")`
3. **Use `--lowvram` flag**: ComfyUI CLI flag that offloads model parts to CPU during inference
4. **Free VRAM between generations**: ComfyUI should auto-manage, but restarting clears leaked memory
5. **Use tiled VAE decoding**: For high-resolution images, tile the VAE decode step
   - Node: `VAEDecodeTiled` instead of `VAEDecode`
   - Breaks the image into tiles, decodes each separately, and stitches them together
6. **Reduce batch size**: Set batch_size to 1 in `EmptyLatentImage`
7. **Avoid multiple models**: Don't load two full checkpoints simultaneously — use one checkpoint and LoRAs instead
8. **For LTXV/video**: Always use FP8 quantized video models on 24GB cards

### VRAM Estimates

| Model | FP32 | FP16 | FP8 |
|-------|------|------|-----|
| SD 1.5 | ~4GB | ~2GB | ~1GB |
| SDXL | ~12GB | ~6GB | ~3GB |
| Flux Dev | ~48GB | ~24GB | ~12GB |
| Flux Schnell | ~48GB | ~24GB | ~12GB |
| LTXV | ~20GB+ | ~10GB+ | ~6GB |

## Device Mismatch

### Error Pattern

```
RuntimeError: Expected all tensors to be on the same device, but found at least
two devices, cuda:0 and cpu!
```

### Root Cause

A tensor on the CPU is being combined with a tensor on the GPU. This usually happens when:
- A custom node doesn't properly move tensors to the correct device
- Model offloading placed parts of the model on CPU
- A node produces CPU tensors while downstream expects GPU tensors

### Fixes

1. Check if the error occurs with a specific custom node — update or replace that node
2. If using `--lowvram` or `--cpu`, some nodes may not support CPU offloading
3. Restart ComfyUI to reset device state
4. Check if a custom node has a newer version that fixes device handling

## Missing Nodes

### Error Pattern

```
Cannot find node class 'NodeClassName'
```

Or in the execution response:
```
"error": {"type": "node_not_found", "message": "Cannot find node class 'X'"}
```

### Root Cause

The workflow references a node type that is not installed. This happens when:
- A custom node pack is not installed
- A custom node pack is installed but failed to load (import error)
- The node was renamed or removed in a pack update

### Fixes

1. **Search for the node pack**:
   ```
   search_custom_nodes("NodeClassName")
   ```
2. **Install via ComfyUI Manager** or the registry
3. **Check logs for import errors**:
   ```
   get_logs(keyword="import")
   get_logs(keyword="error")
   ```
   Import errors often reveal missing Python dependencies
4. **Install missing Python dependencies**: If the custom node requires a pip package:
   ```bash
   pip install missing-package
   ```
5. **Restart ComfyUI** after installing any custom node — nodes are loaded at startup

## NaN Tensor Errors

### Error Pattern

```
RuntimeError: Input contains NaN
```

Or images come out as solid gray/noise with NaN warnings in logs.

### Root Cause

Numerical instability during the diffusion process. Common triggers:
- **CFG scale too high**: Values above 15-20 can cause numerical overflow
- **Corrupted model weights**: Damaged download or incompatible merge
- **FP16 overflow**: Some operations overflow at half precision
- **Incompatible LoRA**: A LoRA trained for a different base model

### Fixes

1. **Lower CFG**: Try CFG 7.0 for SD 1.5/SDXL, 1.0 for Flux
2. **Use FP32 VAE**: Some VAEs produce NaN in FP16. Switch to `vae-ft-mse-840000-ema-pruned.safetensors` (FP32)
3. **Remove LoRAs**: Test without LoRAs to isolate the cause
4. **Re-download the model**: Hash verification can detect corrupted files
5. **Check LoRA compatibility**: Ensure the LoRA matches the base model family

## Dtype Mismatches

### Error Pattern

```
RuntimeError: expected scalar type Float but found Half
```

Or:

```
RuntimeError: expected scalar type Half but found Float
```

Or:

```
RuntimeError: Input type (float) and bias type (c10::Half) should be the same
```

### Root Cause

A model component expects one precision (FP32/FP16) but receives another. Most common with:
- VAE precision mismatch (FP16 model + FP32 VAE or vice versa)
- Mixed-precision LoRAs
- Custom nodes that force a specific dtype

### Fixes

1. **Use a separate VAE**: Load an explicit FP32 VAE instead of the checkpoint's built-in VAE
   - Node: `VAELoader` with `vae-ft-mse-840000-ema-pruned.safetensors`
2. **Match precision**: If the model is FP16, use FP16-compatible nodes throughout
3. **Force FP32 VAE decode**: Some node packs offer `VAEDecodeFP32` nodes
4. **Check ComfyUI settings**: `--force-fp32` flag forces everything to FP32 (uses more VRAM)

## CLIP Token Overflow

### Error Pattern

No explicit error — the prompt is silently truncated at 77 tokens, and details mentioned late in the prompt are ignored.

### Symptoms

- Later parts of long prompts have no effect on the image
- Adding more descriptive text doesn't change the output
- Removing early tokens suddenly makes later tokens work

### Fixes

1. **Use BREAK token**: Split the prompt at natural boundaries:
   ```
   subject description, pose, clothing, setting
   BREAK
   lighting, style, quality, camera angle
   ```
2. **Use CLIPTextEncodeSDXL**: SDXL's dual-CLIP processes two 77-token chunks
3. **Prioritize important tokens**: Put the most important descriptors first
4. **Use fewer filler words**: Remove articles and prepositions where possible
5. **Use embeddings**: Condense complex concepts into single tokens with textual inversions

## Black Images

### Error Pattern

No error in the execution — the workflow "succeeds" but produces completely black or near-black images.

### Root Causes and Fixes

| Cause | Diagnosis | Fix |
|-------|-----------|-----|
| `denoise = 0` | Check KSampler inputs | Set denoise to 1.0 for txt2img, 0.5-0.8 for img2img |
| `cfg = 0` | Check KSampler inputs | Set CFG to 7.0 (SD 1.5), 1.0 (Flux) |
| `steps = 0` | Check KSampler inputs | Set steps to 20+ (standard) or 4+ (turbo) |
| Wrong VAE | VAE doesn't match model | Use the correct VAE for the model family |
| Empty prompt | CLIPTextEncode has empty text | Add a text prompt |
| Wrong scheduler | Incompatible scheduler/sampler combo | Try `"normal"` scheduler with `"euler"` sampler |
| Seed collision | Extremely rare | Change the seed value |
| FP16 VAE overflow | VAE decode produces black | Use FP32 VAE or VAEDecodeTiled |

### Quick Diagnostic Checklist

1. Check `denoise` > 0 (should be 1.0 for txt2img)
2. Check `cfg` > 0 (should be 7.0 for SD 1.5, 1.0 for Flux)
3. Check `steps` > 0 (should be 20 for standard, 4 for turbo)
4. Verify the positive prompt is not empty
5. Try a different seed
6. Try a known-working sampler/scheduler combo: `euler` + `normal`

## Connection Type Errors

### Error Pattern

```
Output type 'IMAGE' doesn't match input type 'LATENT'
```

Or:

```
Required input 'model' of type 'MODEL' but got connection of type 'CLIP'
```

### Root Cause

Connecting the wrong output slot of a node to an incompatible input. Often caused by using the wrong output index.

### Fixes

1. **Check output indices**: Use `get_node_info` to verify the exact output order
   - `CheckpointLoaderSimple` outputs: 0=MODEL, 1=CLIP, 2=VAE
   - Getting index wrong: `["1", 0]` gives MODEL, `["1", 1]` gives CLIP
2. **Verify connection format**: `["nodeId", outputIndex]` — node ID is a string, index is an integer
3. **Check data type flow**: Ensure the pipeline follows the correct type chain:
   ```
   MODEL → KSampler
   CLIP → CLIPTextEncode → CONDITIONING → KSampler
   LATENT → KSampler → LATENT → VAEDecode → IMAGE
   VAE → VAEDecode, VAEEncode
   ```

## Model Loading Errors

### Error Pattern

```
FileNotFoundError: [Errno 2] No such file or directory: 'models/checkpoints/model.safetensors'
```

Or:

```
SafetensorError: Error reading file: invalid header
```

Or:

```
RuntimeError: PytorchStreamReader failed reading zip archive
```

### Root Causes

- **File not found**: Model file doesn't exist at the referenced path
- **Corrupted download**: Incomplete or damaged file
- **Wrong format**: File is not a valid safetensors/pickle/checkpoint format

### Fixes

1. **Verify the model exists**: `list_local_models(model_type="checkpoints")`
2. **Check the exact filename**: Model names in workflows must match the filename exactly (case-sensitive)
3. **Re-download**: If hash mismatch or corruption:
   ```
   download_model(url="...", target_subfolder="checkpoints")
   ```
4. **Check file size**: A 1KB safetensors file is clearly corrupted — re-download
5. **Verify subfolder**: Models must be in the correct subfolder (`checkpoints/`, `loras/`, `vae/`, etc.)

## Torch / CUDA Version Errors

### Error Pattern

```
RuntimeError: CUDA error: no kernel image is available for execution on the device
```

Or:

```
ImportError: cannot import name 'xxx' from 'torch'
```

Or:

```
AssertionError: Torch not compiled with CUDA enabled
```

### Root Cause

PyTorch and CUDA version incompatibility, usually after:
- Updating PyTorch without matching CUDA toolkit
- Installing a custom node that downgrades/changes PyTorch
- Using pip install that pulls a CPU-only PyTorch

### Fixes

1. **Check current versions**:
   ```
   get_system_stats()  # Shows PyTorch version and CUDA version
   ```
2. **Verify CUDA availability**: In Python: `torch.cuda.is_available()`
3. **Reinstall PyTorch with CUDA**: Visit pytorch.org for the correct install command matching your CUDA version
4. **Pin PyTorch version**: After fixing, avoid running `pip install` commands that might change PyTorch
5. **Use ComfyUI's bundled venv**: ComfyUI Desktop ships with a pre-configured Python environment

## ComfyUI Desktop vs CLI Differences

### Key Differences

| Aspect | ComfyUI Desktop | ComfyUI CLI |
|--------|----------------|-------------|
| Default port | 8000 | 8188 |
| Python | Embedded (bundled) | System/venv Python |
| Install location | `AppData/Local/Programs/ComfyUI/` | Wherever you cloned it |
| Custom nodes | `Documents/ComfyUI/custom_nodes/` | `./custom_nodes/` in repo |
| Models | `Documents/ComfyUI/models/` | `./models/` in repo |
| Config | `extra_model_paths.yaml` for shared paths | Same |
| Updates | Auto-updater in the app | `git pull` |

### Common Issues

- **Wrong port**: MCP tools default to 8188 — if using Desktop, configure for port 8000
- **Path confusion**: Desktop separates user data from application files
- **Custom node pip installs**: Desktop's embedded Python may not be on PATH — install within the venv

## Error-Specific Debugging Commands

### Workflow Failed — Get Details

```
get_history()                           # Most recent execution
get_history(prompt_id="abc-123")        # Specific execution
```

The response includes:
- `status.status_str`: "success" or "error"
- `status.messages`: Timestamped execution messages
- `outputs`: Node outputs (images, etc.)
- Error traceback for failed nodes

### Check Server Health

```
get_system_stats()    # GPU info, VRAM, Python/PyTorch versions
get_queue()           # Running and pending jobs
get_logs(max_lines=50, keyword="error")  # Recent error logs
```

### Verify Node Availability

```
get_node_info(node_type="KSampler")              # Check specific node
get_node_info(node_type="ControlNetApply")        # Verify custom nodes loaded
```

### Verify Models

```
list_local_models(model_type="checkpoints")       # Installed checkpoints
list_local_models(model_type="loras")             # Installed LoRAs
list_local_models(model_type="controlnet")        # Installed ControlNets
```

## Quick Reference: Error to Fix

| Error Message (partial) | Most Likely Fix |
|--------------------------|----------------|
| `CUDA out of memory` | Reduce resolution, use FP8 model, `--lowvram` |
| `Expected all tensors on same device` | Update custom node, restart ComfyUI |
| `Cannot find node class` | Install the node pack, restart ComfyUI |
| `Input contains NaN` | Lower CFG, use FP32 VAE, remove LoRAs |
| `expected scalar type Float but found Half` | Use FP32 VAE, or `--force-fp32` |
| `No such file or directory` (model) | Check filename, re-download model |
| `invalid header` (safetensors) | Re-download — file is corrupted |
| `CUDA error: no kernel image` | Reinstall PyTorch with matching CUDA version |
| Black images, no error | Check denoise > 0, cfg > 0, steps > 0, prompt not empty |
| Image looks garbled/noisy | Wrong model+VAE combo, wrong sampler settings |
| `Connection refused` on port 8188 | ComfyUI not running, or using Desktop (port 8000) |
| `Prompt outputs failed validation` | Node inputs don't match schema — check `get_node_info` |

Related Skills

linux-troubleshooting

16
from diegosouzapw/awesome-omni-skill

Linux system troubleshooting workflow for diagnosing and resolving system issues, performance problems, and service failures.

kubernetes-troubleshooting

16
from diegosouzapw/awesome-omni-skill

Debug Kubernetes pods, services, networking, and scaling issues. Use this skill when troubleshooting K8s deployments, investigating pod failures, or diagnosing cluster problems.

flux-troubleshooting

16
from diegosouzapw/awesome-omni-skill

Use when Flux resources show Ready False, reconciliation errors appear in logs, deployments fail to sync from Git, HelmRelease installations fail, source artifacts are not being fetched, or image automation is not updating tags

arc-runner-troubleshooting

16
from diegosouzapw/awesome-omni-skill

Troubleshoot ARC (Actions Runner Controller) runners on Rackspace Spot Kubernetes. Diagnose stuck jobs, scaling issues, and cluster access. Activates on "runner", "ARC", "stuck job", "queued", "GitHub Actions", or "CI stuck".

terway-troubleshooting

16
from diegosouzapw/awesome-omni-skill

Troubleshoot Terway CNI issues in Kubernetes using Kubernetes events and Terway logs. Use when diagnosing "cni plugin not initialized", Pod create/delete failures, or ENI/IPAM problems in Terway (centralized or non-centralized IPAM).

assertion-troubleshooting

16
from diegosouzapw/awesome-omni-skill

Phylax Credible Layer assertions troubleshooting. Diagnoses common assertion failures and non-triggering issues. Use when phylax/credible layer assertions fail unexpectedly or do not execute.

bgo

10
from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

8bit-docs-patterns

16
from diegosouzapw/awesome-omni-skill

Create documentation with gaming-specific examples, retro styling, and 8-bit terminology. Apply when documenting gaming blocks, RPG components, or retro-styled UI elements.

1k-new-skill

16
from diegosouzapw/awesome-omni-skill

Creates a new Claude Code Skill following best practices. Use when the user wants to create a new skill, add a skill, or asks about writing skills for Claude Code. Fetches latest documentation before generating skill content. New skill. Create a skill.

zotero-literature-verification

16
from diegosouzapw/awesome-omni-skill

Complete workflow for verifying academic literature citations using Zotero MCP with full PDF reading and token management

zerubbabel-devops

16
from diegosouzapw/awesome-omni-skill

Provides expert DevOps analysis, CI/CD pipeline review, and infrastructure assessment. Use this skill when the user needs deployment pipeline evaluation, infrastructure review, or platform engineering guidance. Triggers include requests for DevOps audit, CI/CD review, deployment strategy assessment, or when asked to evaluate infrastructure patterns. Produces detailed consultant-style reports with findings and prioritized recommendations — does NOT write implementation code.

zeabur

16
from diegosouzapw/awesome-omni-skill

Zeabur cloud platform assistant for deployment, management, and optimization. Use when: (1) Deploying applications to Zeabur (Git, Docker, local upload, templates), (2) Managing Zeabur services via CLI (npx zeabur) or GraphQL API, (3) Configuring domains, environment variables, networking, or volumes, (4) Troubleshooting deployment failures, connectivity issues, or build errors, (5) Optimizing Zeabur costs, performance, or high availability architecture, (6) Working with Zeabur templates (YAML spec creation and deployment), (7) Setting up CI/CD pipelines with Zeabur, (8) Managing databases and prebuilt services on Zeabur, (9) Any mention of zeabur, zeabur.yaml, zbpack, zeabur.app, or zeabur CLI commands. Triggers: zeabur, deploy to zeabur, zeabur cli, zeabur api, zeabur template, zbpack, zeabur.app, zeabur domain, zeabur variable, zeabur service.