debug-cuda-crash

Tutorial for debugging CUDA crashes using API logging

25 stars

Best use case

debug-cuda-crash is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Tutorial for debugging CUDA crashes using API logging

Teams using debug-cuda-crash should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/debug-cuda-crash/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/aiskillstore/marketplace/flashinfer-ai/debug-cuda-crash/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/debug-cuda-crash/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How debug-cuda-crash Compares

Feature / Agentdebug-cuda-crashStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Tutorial for debugging CUDA crashes using API logging

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Tutorial: Debugging CUDA Crashes with API Logging

This tutorial shows you how to debug CUDA crashes and errors in FlashInfer using the `@flashinfer_api` logging decorator.

## Goal

When your code crashes with CUDA errors (illegal memory access, out-of-bounds, NaN/Inf), use API logging to:
- Capture input tensors BEFORE the crash occurs
- Understand what data caused the problem
- Track tensor shapes, dtypes, and values through your pipeline
- Detect numerical issues (NaN, Inf, wrong shapes)

## Why Use API Logging?

**Problem**: CUDA errors often crash the program, leaving no debugging information.

**Solution**: FlashInfer's `@flashinfer_api` decorator logs inputs BEFORE execution, so you can see what caused the crash even after the program terminates.

## Step 1: Enable API Logging

### Basic Logging (Function Names Only)

```bash
export FLASHINFER_LOGLEVEL=1        # Log function names
export FLASHINFER_LOGDEST=stdout    # Log to console

python my_script.py
```

Output:
```
[2025-12-18 10:30:45] FlashInfer API Call: batch_decode_with_padded_kv_cache
```

### Detailed Logging (Inputs/Outputs with Metadata)

```bash
export FLASHINFER_LOGLEVEL=3        # Log inputs/outputs with metadata
export FLASHINFER_LOGDEST=debug.log # Save to file

python my_script.py
```

Output in `debug.log`:
```
================================================================================
[2025-12-18 10:30:45] FlashInfer API Logging - System Information
================================================================================
FlashInfer version: 0.6.0
CUDA toolkit version: 12.1
GPU 0: NVIDIA H100 PCIe
  Compute capability: 9.0 (SM90)
PyTorch version: 2.1.0
================================================================================

================================================================================
[2025-12-18 10:30:46] FlashInfer API Call: batch_decode_with_padded_kv_cache
--------------------------------------------------------------------------------
Positional input arguments:
  arg[0]:
    Tensor(
      shape=(32, 8, 128)
      dtype=torch.bfloat16
      device=cuda:0
      requires_grad=False
      is_contiguous=True
    )
Keyword input arguments:
  kv_cache=
    Tensor(
      shape=(1024, 2, 8, 128)
      dtype=torch.bfloat16
      device=cuda:0
      requires_grad=False
      is_contiguous=True
    )
```

### Full Logging (With Tensor Statistics)

```bash
export FLASHINFER_LOGLEVEL=5        # Log with min/max/mean/nan/inf
export FLASHINFER_LOGDEST=debug.log

python my_script.py
```

Additional output:
```
  Tensor(
    shape=(32, 8, 128)
    dtype=torch.bfloat16
    device=cuda:0
    requires_grad=False
    is_contiguous=True
    min=-3.125000
    max=4.250000
    mean=0.015625
    nan_count=0
    inf_count=0
  )
```

## Step 2: Reproduce the Crash

### Example: Shape Mismatch

Your code crashes with:
```
RuntimeError: CUDA error: an illegal memory access was encountered
```

Enable logging and run again:

```bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=crash_log.txt

python my_script.py
```

The log shows inputs before the crash:
```
[2025-12-18 10:32:15] FlashInfer API Call: batch_decode_with_padded_kv_cache
Positional input arguments:
  arg[0]:
    Tensor(
      shape=(32, 8, 128)      # Query tensor
      ...
    )
Keyword input arguments:
  kv_cache=
    Tensor(
      shape=(1024, 2, 8, 64)  # ❌ Wrong! Should be (..., 128) not (..., 64)
      ...
    )
```

**Found the bug**: `head_dim` mismatch (64 vs 128)

## Step 3: Common CUDA Errors and How to Debug

### Error 1: Illegal Memory Access

**Error Message**:
```
RuntimeError: CUDA error: an illegal memory access was encountered
```

**Enable logging**:
```bash
export FLASHINFER_LOGLEVEL=3
python my_script.py
```

**What to check in logs**:
- ✅ Tensor shapes match expected dimensions
- ✅ All tensors are on CUDA (not CPU)
- ✅ Tensor strides are reasonable
- ✅ `is_contiguous=True` (if required)

**Common causes**:
- Wrong tensor dimensions
- CPU tensor passed to GPU kernel
- Incorrect stride patterns

### Error 2: NaN or Inf Values

**Error Message**:
```
RuntimeError: Function ... returned nan or inf
```

**Enable statistics logging**:
```bash
export FLASHINFER_LOGLEVEL=5        # Level 5 shows nan_count, inf_count
python my_script.py
```

**What to check in logs**:
```
Tensor(
  ...
  min=-1234567.000000     # ❌ Suspiciously large
  max=9876543.000000      # ❌ Suspiciously large
  mean=nan                # ❌ NaN detected
  nan_count=128           # ❌ 128 NaN values!
  inf_count=0
)
```

**Common causes**:
- Division by zero in previous operation
- Numerical overflow/underflow
- Uninitialized memory

### Error 3: Out of Memory

**Error Message**:
```
RuntimeError: CUDA out of memory
```

**Enable logging**:
```bash
export FLASHINFER_LOGLEVEL=3
python my_script.py
```

**What to check in logs**:
- ✅ Tensor shapes (are they unexpectedly large?)
- ✅ Batch size
- ✅ Sequence length

Example:
```
Tensor(
  shape=(1024, 8192, 128, 128)  # ❌ Way too large! Should be (1024, 128, 128)?
  ...
)
```

### Error 4: Wrong Dtype

**Error Message**:
```
RuntimeError: expected scalar type BFloat16 but found Float16
```

**Enable logging**:
```bash
export FLASHINFER_LOGLEVEL=3
python my_script.py
```

**What to check in logs**:
```
Tensor(
  dtype=torch.float16     # ❌ Should be torch.bfloat16
  ...
)
```

## Step 4: Multi-Process Debugging

When running with multiple GPUs/processes, use `%i` pattern:

```bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug_rank_%i.txt    # %i = process ID

torchrun --nproc_per_node=4 my_script.py
```

This creates separate logs:
- `debug_rank_12345.txt` (process 12345)
- `debug_rank_12346.txt` (process 12346)
- `debug_rank_12347.txt` (process 12347)
- `debug_rank_12348.txt` (process 12348)

Now you can debug each rank independently.

## Step 5: Advanced Debugging with compute-sanitizer

For harder bugs, combine API logging with CUDA tools:

### Use compute-sanitizer (Memory Checker)

```bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log

compute-sanitizer --tool memcheck python my_script.py
```

Output shows exact memory errors:
```
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
=========     at 0x1234 in ScaleKernel<float>
=========     by thread (256,0,0) in block (10,0,0)
=========     Address 0x7f1234567890 is out of bounds
```

Check `debug.log` to see what inputs caused this kernel to fail.

### Use cuda-gdb (Debugger)

```bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log

cuda-gdb --args python my_script.py
```

In gdb:
```
(cuda-gdb) run
(cuda-gdb) where     # Show stack trace when it crashes
```

Check `debug.log` for the inputs that led to the crash.

## Step 6: Kernel-Level Debugging with printf()

You can use `printf()` inside CUDA kernels for debugging:

### Basic Usage

```cpp
__global__ void MyKernel(const float* input, float* output, int n) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  // Print from one thread to avoid spam
  if (threadIdx.x == 0 && blockIdx.x == 0) {
    printf("n=%d, input[0]=%f\n", n, input[0]);
  }

  if (idx < n) {
    output[idx] = input[idx] * 2.0f;
  }
}
```

**Important**: Flush printf buffer after kernel:
```python
my_kernel(input, output)
torch.cuda.synchronize()  # ← Flushes printf output
```

### ⚠️ Warp-Specialized Kernels: Choosing the Right Print Thread

**Problem**: `threadIdx.x == 0` doesn't work for all warps (warp starting at thread 32 won't have thread 0).

**Solution**: Choose one representative thread per specialization group.

```cpp
__global__ void WarpSpecializedKernel(...) {
  // Define your group's representative thread
  // e.g., first thread of each warp: threadIdx.x % 32 == 0
  // e.g., first thread of each 4-warp group: threadIdx.x % 128 == 0

  if (is_group_representative) {
    printf("Group %d processing\n", group_id);
  }
}
```

**Common mistake** ❌:
```cpp
// ❌ Only warp 0 will print!
if (threadIdx.x == 0) {
  printf("Warp %d processing\n", threadIdx.x / 32);
}
```

### Quick Reference

| Kernel Type | Print Condition | Notes |
|-------------|-----------------|-------|
| Simple kernel | `threadIdx.x == 0` | One thread per block |
| Warp-specialized | One thread per group | Depends on kernel design |

### Other Kernel Debugging Tools

```cpp
// Assert for invariants
assert(value >= 0.0f && "Value must be non-negative");

// Compile-time checks
static_assert(BLOCK_SIZE % 32 == 0, "BLOCK_SIZE must be multiple of warp size");
```

## Environment Variables Reference

| Variable | Values | Description |
|----------|--------|-------------|
| `FLASHINFER_LOGLEVEL` | `0` | No logging (default) |
|  | `1` | Function names only |
|  | `3` | Inputs/outputs with metadata |
|  | `5` | + Tensor statistics (min/max/mean/nan/inf) |
| `FLASHINFER_LOGDEST` | `stdout` | Log to console (default) |
|  | `stderr` | Log to stderr |
|  | `<path>` | Log to file |
|  | `log_%i.txt` | Multi-process: %i = process ID |

## Best Practices

### 1. Always Start with Level 3

```bash
export FLASHINFER_LOGLEVEL=3
```

Level 3 provides tensor metadata (shape, dtype, device) without overwhelming output.

### 2. Use Level 5 for Numerical Issues

```bash
export FLASHINFER_LOGLEVEL=5
```

Only use level 5 when debugging NaN/Inf problems (adds statistics).

### 3. Log to File for Crashes

```bash
export FLASHINFER_LOGDEST=crash_log.txt
```

Console output may be lost when program crashes. File logs persist.

### 4. Compare Before/After

Enable logging and compare:
- Last successful API call (inputs logged, outputs logged) ✅
- First failed API call (inputs logged, no outputs) ❌ ← This is where it crashed!

### 5. Disable Logging in Production

```bash
unset FLASHINFER_LOGLEVEL   # or export FLASHINFER_LOGLEVEL=0
```

Logging has zero overhead when disabled (decorator returns original function).

## Troubleshooting

### No Logs Appearing

**Problem**: Set `FLASHINFER_LOGLEVEL=3` but no logs appear

**Solutions**:
1. **Check if API has the decorator**: Not all FlashInfer APIs have `@flashinfer_api` yet (work in progress)

2. **Verify environment variable**:
   ```bash
   echo $FLASHINFER_LOGLEVEL    # Should print "3"
   ```

3. **Check log destination**:
   ```bash
   echo $FLASHINFER_LOGDEST     # Should print path or "stdout"
   ```

### Too Much Output

**Problem**: Level 5 produces too much output

**Solution**: Use level 3 instead:
```bash
export FLASHINFER_LOGLEVEL=3   # Skip tensor statistics
```

### Statistics Skipped in CUDA Graph

**Warning**: `[statistics skipped: CUDA graph capture in progress]`

**What it means**: Level 5 statistics are automatically skipped during CUDA graph capture (to avoid synchronization)

**This is normal**: The framework protects you from graph capture issues.

## Quick Examples

### Debug Shape Mismatch
```bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=stdout
python my_script.py
# Check tensor shapes in output
```

### Debug NaN/Inf
```bash
export FLASHINFER_LOGLEVEL=5         # Show statistics
export FLASHINFER_LOGDEST=debug.log
python my_script.py
# Check nan_count and inf_count in debug.log
```

### Debug Multi-GPU Training
```bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=rank_%i.log   # Separate log per rank
torchrun --nproc_per_node=8 train.py
# Check rank_*.log files
```

### Combine with Memory Checker
```bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=inputs.log
compute-sanitizer --tool memcheck python my_script.py
# inputs.log shows what data caused the memory error
```

## Example: Full Debug Session

### Your code crashes:
```python
import torch
from flashinfer import batch_decode_with_padded_kv_cache

q = torch.randn(32, 8, 128, dtype=torch.bfloat16, device="cuda")
kv = torch.randn(1024, 2, 8, 64, dtype=torch.bfloat16, device="cuda")  # Wrong dim!

output = batch_decode_with_padded_kv_cache(q, kv)  # ❌ Crashes
```

### Enable logging:
```bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log
python test.py
```

### Check debug.log:
```
[2025-12-18 10:45:23] FlashInfer API Call: batch_decode_with_padded_kv_cache
Positional input arguments:
  arg[0]:
    Tensor(
      shape=(32, 8, 128)
      dtype=torch.bfloat16
      ...
    )
  arg[1]:
    Tensor(
      shape=(1024, 2, 8, 64)    # ❌ Found it! Last dim should be 128
      dtype=torch.bfloat16
      ...
    )
```

### Fix the bug:
```python
kv = torch.randn(1024, 2, 8, 128, dtype=torch.bfloat16, device="cuda")  # ✅ Fixed
```

### Success!
```bash
python test.py
# No crash, outputs logged successfully
```

## Summary

1. **Enable logging** before the crash:
   ```bash
   export FLASHINFER_LOGLEVEL=3
   export FLASHINFER_LOGDEST=debug.log
   ```

2. **Run your code** - inputs are logged BEFORE crash

3. **Check the log** - last API call shows what caused the crash

4. **Fix the issue** based on logged input metadata

5. **Disable logging** when done:
   ```bash
   export FLASHINFER_LOGLEVEL=0
   ```

## Related Documentation

- See CLAUDE.md "API Logging with @flashinfer_api" for technical details
- See `flashinfer/api_logging.py` for implementation
- See CUDA documentation for compute-sanitizer and cuda-gdb

Related Skills

exa-debug-bundle

25
from ComeOnOliver/skillshub

Collect Exa debug evidence for support tickets and troubleshooting. Use when encountering persistent issues, preparing support tickets, or collecting diagnostic information for Exa problems. Trigger with phrases like "exa debug", "exa support bundle", "collect exa logs", "exa diagnostic".

evernote-debug-bundle

25
from ComeOnOliver/skillshub

Debug Evernote API issues with diagnostic tools and techniques. Use when troubleshooting API calls, inspecting requests/responses, or diagnosing integration problems. Trigger with phrases like "debug evernote", "evernote diagnostic", "troubleshoot evernote", "evernote logs", "inspect evernote".

documenso-debug-bundle

25
from ComeOnOliver/skillshub

Comprehensive debugging toolkit for Documenso integrations. Use when troubleshooting complex issues, gathering diagnostic information, or creating support tickets for Documenso problems. Trigger with phrases like "debug documenso", "documenso diagnostics", "troubleshoot documenso", "documenso support ticket".

deepgram-debug-bundle

25
from ComeOnOliver/skillshub

Collect Deepgram debug evidence for support and troubleshooting. Use when preparing support tickets, investigating issues, or collecting diagnostic information for Deepgram problems. Trigger: "deepgram debug", "deepgram support ticket", "collect deepgram logs", "deepgram diagnostic", "deepgram debug bundle".

databricks-debug-bundle

25
from ComeOnOliver/skillshub

Collect Databricks debug evidence for support tickets and troubleshooting. Use when encountering persistent issues, preparing support tickets, or collecting diagnostic information for Databricks problems. Trigger with phrases like "databricks debug", "databricks support bundle", "collect databricks logs", "databricks diagnostic".

customerio-debug-bundle

25
from ComeOnOliver/skillshub

Collect Customer.io debug evidence for support tickets. Use when creating support requests, investigating delivery failures, or documenting integration issues. Trigger: "customer.io debug", "customer.io support ticket", "collect customer.io logs", "customer.io diagnostics".

cursor-debug-bundle

25
from ComeOnOliver/skillshub

Debug AI suggestion quality, context issues, and code generation problems in Cursor. Triggers on "debug cursor ai", "cursor suggestions wrong", "bad cursor completion", "cursor ai debug", "cursor hallucination".

coreweave-debug-bundle

25
from ComeOnOliver/skillshub

Collect CoreWeave cluster diagnostics for support tickets. Use when preparing a support case, collecting GPU node status, or documenting pod failures. Trigger with phrases like "coreweave debug", "coreweave support", "coreweave diagnostics", "collect coreweave logs".

coderabbit-debug-bundle

25
from ComeOnOliver/skillshub

Collect CodeRabbit debug evidence for support tickets and troubleshooting. Use when encountering persistent issues, preparing support tickets, or collecting diagnostic information for CodeRabbit problems. Trigger with phrases like "coderabbit debug", "coderabbit support bundle", "coderabbit diagnostic", "coderabbit not working evidence".

clickup-debug-bundle

25
from ComeOnOliver/skillshub

Collect ClickUp API diagnostic information for troubleshooting and support. Use when encountering persistent issues, preparing support tickets, or collecting API connectivity and rate limit diagnostics. Trigger: "clickup debug", "clickup diagnostics", "clickup support bundle", "collect clickup logs", "clickup health check".

clickhouse-debug-bundle

25
from ComeOnOliver/skillshub

Collect ClickHouse diagnostic data — system tables, query logs, merge status, and server metrics for support tickets and troubleshooting. Use when investigating persistent issues, preparing debug artifacts, or collecting evidence for ClickHouse support. Trigger: "clickhouse debug", "clickhouse diagnostics", "clickhouse support bundle", "collect clickhouse logs", "clickhouse system tables".

clerk-debug-bundle

25
from ComeOnOliver/skillshub

Collect comprehensive debug information for Clerk issues. Use when troubleshooting complex problems, preparing support tickets, or diagnosing intermittent issues. Trigger with phrases like "clerk debug", "clerk diagnostics", "clerk support ticket", "clerk troubleshooting".