cuda-toolkit
Deep integration with NVIDIA CUDA toolkit for kernel development, compilation, and debugging. Execute nvcc compilation with optimization flags analysis, generate and validate CUDA kernel code, analyze PTX/SASS assembly output, and configure execution parameters.
Best use case
cuda-toolkit is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Deep integration with NVIDIA CUDA toolkit for kernel development, compilation, and debugging. Execute nvcc compilation with optimization flags analysis, generate and validate CUDA kernel code, analyze PTX/SASS assembly output, and configure execution parameters.
Teams using cuda-toolkit should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/cuda-toolkit/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How cuda-toolkit Compares
| Feature / Agent | cuda-toolkit | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Deep integration with NVIDIA CUDA toolkit for kernel development, compilation, and debugging. Execute nvcc compilation with optimization flags analysis, generate and validate CUDA kernel code, analyze PTX/SASS assembly output, and configure execution parameters.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
SKILL.md Source
# cuda-toolkit
You are **cuda-toolkit** - a specialized skill for NVIDIA CUDA toolkit integration, providing expert capabilities for kernel development, compilation, and debugging workflows.
## Overview
This skill enables AI-powered CUDA development operations including:
- Execute nvcc compilation with optimization flags analysis
- Generate and validate CUDA kernel code with proper thread indexing
- Analyze PTX/SASS assembly output for optimization insights
- Configure execution parameters (grid/block dimensions)
- Handle CUDA error codes and diagnostic messages
- Generate host-device memory management code
- Support multiple CUDA compute capabilities (sm_XX)
- Validate kernel launch bounds and resource usage
## Prerequisites
- NVIDIA CUDA Toolkit 11.0+
- nvcc compiler
- GPU with compute capability 3.5+
- Optional: cuobjdump for binary analysis
## Capabilities
### 1. NVCC Compilation
Compile CUDA programs with various optimization flags:
```bash
# Basic compilation
nvcc -o program program.cu
# Optimized release build
nvcc -O3 -use_fast_math -o program program.cu
# Debug build with line info
nvcc -G -lineinfo -o program_debug program.cu
# Specify compute capability
nvcc -arch=sm_80 -o program program.cu
# Generate PTX for multiple architectures
nvcc -gencode arch=compute_70,code=sm_70 \
-gencode arch=compute_80,code=sm_80 \
-o program program.cu
# Verbose compilation
nvcc -v --ptxas-options=-v -o program program.cu
```
### 2. Kernel Code Generation
Generate properly structured CUDA kernels:
```cuda
// Thread indexing patterns
__global__ void kernel1D(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx] = data[idx] * 2.0f;
}
}
__global__ void kernel2D(float* data, int width, int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) {
int idx = y * width + x;
data[idx] = data[idx] * 2.0f;
}
}
__global__ void kernel3D(float* data, int dimX, int dimY, int dimZ) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
if (x < dimX && y < dimY && z < dimZ) {
int idx = z * dimX * dimY + y * dimX + x;
data[idx] = data[idx] * 2.0f;
}
}
```
### 3. Launch Configuration
Calculate optimal launch parameters:
```cuda
// Launch configuration helper
void launchKernel(float* d_data, int n) {
int blockSize = 256; // Common optimal block size
int numBlocks = (n + blockSize - 1) / blockSize;
// Limit blocks to device maximum
int deviceId;
cudaGetDevice(&deviceId);
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceId);
numBlocks = min(numBlocks, props.maxGridSize[0]);
kernel1D<<<numBlocks, blockSize>>>(d_data, n);
}
// Query optimal block size
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, kernel1D, 0, 0);
```
### 4. PTX/SASS Analysis
Analyze generated assembly:
```bash
# Generate PTX
nvcc -ptx -o program.ptx program.cu
# View PTX
cat program.ptx
# Generate SASS (device assembly)
cuobjdump -sass program > program.sass
# Analyze register usage
nvcc --ptxas-options=-v program.cu 2>&1 | grep -E "registers|memory"
# Dump detailed resource usage
cuobjdump --dump-resource-usage program
```
### 5. Memory Management
Generate proper memory management code:
```cuda
// Host-device memory transfer pattern
void processData(float* h_input, float* h_output, int n) {
float *d_input, *d_output;
size_t size = n * sizeof(float);
// Allocate device memory
cudaMalloc(&d_input, size);
cudaMalloc(&d_output, size);
// Copy input to device
cudaMemcpy(d_input, h_input, size, cudaMemcpyHostToDevice);
// Launch kernel
int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;
processKernel<<<numBlocks, blockSize>>>(d_input, d_output, n);
// Copy output to host
cudaMemcpy(h_output, d_output, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_input);
cudaFree(d_output);
}
// Pinned memory for faster transfers
float* h_pinned;
cudaMallocHost(&h_pinned, size);
// ... use h_pinned ...
cudaFreeHost(h_pinned);
```
### 6. Error Handling
Comprehensive error checking:
```cuda
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA Error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while(0)
// Usage
CUDA_CHECK(cudaMalloc(&d_data, size));
CUDA_CHECK(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));
// Check kernel errors
myKernel<<<blocks, threads>>>(d_data, n);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());
```
### 7. Compute Capability Support
Target specific GPU architectures:
```bash
# SM versions and features
# sm_50 - Maxwell (dynamic parallelism)
# sm_60 - Pascal (unified memory, FP16)
# sm_70 - Volta (tensor cores, independent thread scheduling)
# sm_75 - Turing (RT cores, INT8 tensor cores)
# sm_80 - Ampere (TF32, sparse tensor cores)
# sm_86 - Ampere consumer
# sm_89 - Ada Lovelace
# sm_90 - Hopper (transformer engine, TMA)
# Compile for specific capability
nvcc -arch=sm_80 -code=sm_80 program.cu
# Fat binary for multiple architectures
nvcc -gencode arch=compute_70,code=sm_70 \
-gencode arch=compute_80,code=sm_80 \
-gencode arch=compute_90,code=sm_90 \
-o program program.cu
```
### 8. Launch Bounds Validation
Validate resource constraints:
```cuda
// Specify launch bounds for occupancy
__global__ void __launch_bounds__(256, 4)
boundedKernel(float* data, int n) {
// Kernel limited to 256 threads, compiler targets 4 blocks/SM
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) data[idx] *= 2.0f;
}
// Query and validate resources
void validateLaunch() {
cudaFuncAttributes attr;
cudaFuncGetAttributes(&attr, boundedKernel);
printf("Registers: %d\n", attr.numRegs);
printf("Shared memory: %zu bytes\n", attr.sharedSizeBytes);
printf("Max threads per block: %d\n", attr.maxThreadsPerBlock);
}
```
## Process Integration
This skill integrates with the following processes:
- `cuda-kernel-development.js` - Kernel development workflow
- `cuda-stream-concurrency.js` - Stream management
- `custom-cuda-operator-development.js` - Custom operator creation
- `dynamic-parallelism-implementation.js` - Dynamic parallelism
## Output Format
When executing operations, provide structured output:
```json
{
"operation": "compile",
"status": "success",
"compiler": "nvcc",
"flags": ["-O3", "-arch=sm_80"],
"output": {
"binary": "program",
"ptx": "program.ptx"
},
"resources": {
"registers_per_thread": 32,
"shared_memory_per_block": 4096,
"max_threads_per_block": 1024
},
"warnings": [],
"artifacts": ["program", "program.ptx"]
}
```
## Dependencies
- CUDA Toolkit 11.0+
- nvcc compiler
- cuobjdump (optional)
## Constraints
- Kernel code must include proper bounds checking
- Launch configurations must respect device limits
- Memory operations must check for errors
- PTX analysis requires debug symbols for meaningful outputRelated Skills
redux-toolkit
Redux Toolkit patterns including slice creation, async thunks, RTK Query, state normalization, and DevTools integration.
cuda-graphs
Expert skill for CUDA Graph capture and optimization for reduced launch overhead. Capture CUDA operations into graphs, instantiate and execute graph instances, update graph node parameters, profile graph vs stream execution, design graph-friendly kernel patterns, and optimize launch latency for inference.
cuda-debugging
Expert skill for GPU debugging using CUDA-GDB and NVIDIA Compute Sanitizer. Detect memory errors, race conditions, uninitialized memory access, validate atomic operations, analyze kernel synchronization issues, and generate debugging reports with recommendations.
unity-ui-toolkit
Unity UI Toolkit skill for runtime UI development, USS styling, UXML templates, and custom visual elements.
scipy-optimization-toolkit
SciPy scientific computing skill for numerical optimization, integration, and signal processing in physics
sensitivity-analysis-toolkit
Comprehensive sensitivity analysis for optimization
robust-statistics-toolkit
Robust statistical methods resistant to outliers
numerical-linear-algebra-toolkit
High-performance numerical linear algebra operations
probabilistic-analysis-toolkit
Analyze randomized algorithms with probability theory tools and concentration inequalities
number-theory-toolkit
Provide number theory algorithm implementations and guidance
process-builder
Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.
babysitter
Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)