coreweave-common-errors

Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors. Use when pods are stuck Pending, GPUs are not allocated, or experiencing CUDA and NCCL errors. Trigger with phrases like "coreweave error", "coreweave pod pending", "coreweave gpu not found", "coreweave debug", "fix coreweave".

25 stars

Best use case

coreweave-common-errors is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors. Use when pods are stuck Pending, GPUs are not allocated, or experiencing CUDA and NCCL errors. Trigger with phrases like "coreweave error", "coreweave pod pending", "coreweave gpu not found", "coreweave debug", "fix coreweave".

Teams using coreweave-common-errors should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/coreweave-common-errors/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-common-errors/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/coreweave-common-errors/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How coreweave-common-errors Compares

Feature / Agentcoreweave-common-errorsStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors. Use when pods are stuck Pending, GPUs are not allocated, or experiencing CUDA and NCCL errors. Trigger with phrases like "coreweave error", "coreweave pod pending", "coreweave gpu not found", "coreweave debug", "fix coreweave".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# CoreWeave Common Errors

## Error Reference

### 1. Pod Stuck Pending -- No GPU Available
```bash
kubectl describe pod <pod-name> | grep -A5 Events
# "0/N nodes are available: insufficient nvidia.com/gpu"
```
**Fix**: Check GPU availability: `kubectl get nodes -l gpu.nvidia.com/class=A100_PCIE_80GB`. Try a different GPU type or region.

### 2. CUDA Out of Memory
```
torch.cuda.OutOfMemoryError: CUDA out of memory
```
**Fix**: Reduce batch size, enable gradient checkpointing, or use a larger GPU (A100-80GB instead of 40GB).

### 3. Image Pull BackOff
**Fix**: Create an imagePullSecret:
```bash
kubectl create secret docker-registry regcred \
  --docker-server=ghcr.io \
  --docker-username=$GH_USER \
  --docker-password=$GH_TOKEN
```

### 4. NCCL Timeout (Multi-GPU)
```
NCCL error: unhandled system error
```
**Fix**: Ensure all GPUs are on the same node (NVLink). For multi-node, use InfiniBand-connected nodes.

### 5. PVC Not Mounting
**Fix**: Check storage class availability: `kubectl get sc`. Use CoreWeave storage classes like `shared-hdd-ord1` or `shared-ssd-ord1`.

### 6. Node Affinity Mismatch
**Fix**: List valid GPU class labels:
```bash
kubectl get nodes -o json | jq -r '.items[].metadata.labels["gpu.nvidia.com/class"]' | sort -u
```

### 7. Service Not Reachable
**Fix**: Check Service and Endpoints:
```bash
kubectl get svc,endpoints <service-name>
```

## Resources

- [CoreWeave Documentation](https://docs.coreweave.com)
- [GPU Instance Types](https://docs.coreweave.com/docs/platform/instances/gpu-instances)

## Next Steps

For diagnostics, see `coreweave-debug-bundle`.

Related Skills

fathom-common-errors

25
from ComeOnOliver/skillshub

Diagnose and fix Fathom API errors including auth failures and missing data. Use when API calls fail, transcripts are empty, or webhooks are not firing. Trigger with phrases like "fathom error", "fathom not working", "fathom api failure", "fix fathom".

exa-common-errors

25
from ComeOnOliver/skillshub

Diagnose and fix Exa API errors by HTTP code and error tag. Use when encountering Exa errors, debugging failed requests, or troubleshooting integration issues. Trigger with phrases like "exa error", "fix exa", "exa not working", "debug exa", "exa 429", "exa 401".

evernote-common-errors

25
from ComeOnOliver/skillshub

Diagnose and fix common Evernote API errors. Use when encountering Evernote API exceptions, debugging failures, or troubleshooting integration issues. Trigger with phrases like "evernote error", "evernote exception", "fix evernote issue", "debug evernote", "evernote troubleshooting".

elevenlabs-common-errors

25
from ComeOnOliver/skillshub

Diagnose and fix ElevenLabs API errors by HTTP status code. Use when encountering ElevenLabs errors, debugging failed TTS/STS requests, or troubleshooting voice cloning and streaming issues. Trigger: "elevenlabs error", "fix elevenlabs", "elevenlabs not working", "debug elevenlabs", "elevenlabs 401", "elevenlabs 429", "elevenlabs 400".

documenso-common-errors

25
from ComeOnOliver/skillshub

Diagnose and resolve common Documenso API errors and issues. Use when encountering Documenso errors, debugging integration issues, or troubleshooting failed operations. Trigger with phrases like "documenso error", "documenso 401", "documenso failed", "fix documenso", "documenso not working".

deepgram-common-errors

25
from ComeOnOliver/skillshub

Diagnose and fix common Deepgram errors and issues. Use when troubleshooting Deepgram API errors, debugging transcription failures, or resolving integration issues. Trigger: "deepgram error", "deepgram not working", "fix deepgram", "deepgram troubleshoot", "transcription failed", "deepgram 401".

cursor-common-errors

25
from ComeOnOliver/skillshub

Troubleshoot common Cursor IDE errors: authentication, completion, indexing, API, and performance issues. Triggers on "cursor error", "cursor not working", "cursor issue", "cursor problem", "fix cursor", "cursor crash".

coreweave-webhooks-events

25
from ComeOnOliver/skillshub

Monitor CoreWeave cluster events and GPU workload status. Use when tracking pod lifecycle events, monitoring GPU utilization, or alerting on inference service health changes. Trigger with phrases like "coreweave events", "coreweave monitoring", "coreweave pod alerts", "coreweave gpu monitoring".

coreweave-upgrade-migration

25
from ComeOnOliver/skillshub

Upgrade CoreWeave deployments and migrate between GPU types. Use when migrating from A100 to H100, upgrading CUDA versions, or updating inference server versions. Trigger with phrases like "upgrade coreweave", "coreweave gpu migration", "coreweave cuda upgrade", "migrate coreweave".

coreweave-security-basics

25
from ComeOnOliver/skillshub

Secure CoreWeave deployments with RBAC, network policies, and secrets management. Use when hardening GPU workloads, managing model access, or configuring namespace isolation. Trigger with phrases like "coreweave security", "coreweave rbac", "secure coreweave", "coreweave secrets".

coreweave-sdk-patterns

25
from ComeOnOliver/skillshub

Production-ready patterns for CoreWeave GPU workload management with kubectl and Python. Use when building inference clients, managing GPU deployments programmatically, or creating reusable CoreWeave deployment templates. Trigger with phrases like "coreweave patterns", "coreweave client", "coreweave Python", "coreweave deployment template".

coreweave-reference-architecture

25
from ComeOnOliver/skillshub

Reference architecture for CoreWeave GPU cloud deployments. Use when designing ML infrastructure, planning multi-model serving, or establishing CoreWeave deployment standards. Trigger with phrases like "coreweave architecture", "coreweave design", "coreweave infrastructure", "coreweave best practices".