castai-common-errors

Diagnose and fix CAST AI agent, API, and autoscaler errors. Use when the CAST AI agent is offline, nodes are not scaling, or API calls return errors. Trigger with phrases like "cast ai error", "cast ai not working", "cast ai agent offline", "cast ai debug", "fix cast ai".

1,868 stars

Best use case

castai-common-errors is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Diagnose and fix CAST AI agent, API, and autoscaler errors. Use when the CAST AI agent is offline, nodes are not scaling, or API calls return errors. Trigger with phrases like "cast ai error", "cast ai not working", "cast ai agent offline", "cast ai debug", "fix cast ai".

Teams using castai-common-errors should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/castai-common-errors/SKILL.md --create-dirs "https://raw.githubusercontent.com/jeremylongshore/claude-code-plugins-plus-skills/main/plugins/saas-packs/castai-pack/skills/castai-common-errors/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/castai-common-errors/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How castai-common-errors Compares

Feature / Agentcastai-common-errorsStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Diagnose and fix CAST AI agent, API, and autoscaler errors. Use when the CAST AI agent is offline, nodes are not scaling, or API calls return errors. Trigger with phrases like "cast ai error", "cast ai not working", "cast ai agent offline", "cast ai debug", "fix cast ai".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# CAST AI Common Errors

## Overview

Diagnostic guide for the 10 most common CAST AI issues, covering agent connectivity, API errors, autoscaler failures, and node provisioning problems.

## Prerequisites

- `kubectl` access to the cluster
- `CASTAI_API_KEY` configured
- Access to CAST AI console for log correlation

## Error Reference

### 1. Agent Pod CrashLoopBackOff

```bash
kubectl get pods -n castai-agent
kubectl logs -n castai-agent deployment/castai-agent --tail=50
```

**Causes and fixes:**
- **Invalid API key**: Regenerate at console.cast.ai > API
- **Wrong provider**: Set `--set provider=eks|gke|aks` correctly in Helm
- **RBAC missing**: Apply the required ClusterRole and ClusterRoleBinding
- **Network blocked**: Ensure outbound HTTPS to `api.cast.ai` is allowed

### 2. Agent Shows "Disconnected" in Console

```bash
# Check agent heartbeat
kubectl logs -n castai-agent deployment/castai-agent | grep -i "heartbeat\|connect\|error"

# Verify network connectivity from inside the cluster
kubectl run castai-debug --image=curlimages/curl --rm -it --restart=Never -- \
  curl -s -o /dev/null -w "%{http_code}" https://api.cast.ai/v1/kubernetes/external-clusters
```

**Fix**: Restart the agent pod: `kubectl rollout restart deployment/castai-agent -n castai-agent`

### 3. API Returns 401 Unauthorized

```bash
# Test API key
curl -s -o /dev/null -w "%{http_code}" \
  -H "X-API-Key: ${CASTAI_API_KEY}" \
  https://api.cast.ai/v1/kubernetes/external-clusters
# Should return 200, not 401
```

**Fix**: Generate a new API key at console.cast.ai > API > API Access Keys.

### 4. Nodes Not Scaling Up (Unschedulable Pods)

```bash
# Check for pending pods
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

# Verify unschedulable pods policy is enabled
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq '.unschedulablePods'
```

**Causes:**
- `unschedulablePods.enabled` is `false` -- enable it
- Cluster limits reached -- increase `clusterLimits.cpu.maxCores`
- No matching node template -- check constraints match pod requirements

### 5. Nodes Not Scaling Down (Empty Nodes)

```bash
# Check node downscaler configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq '.nodeDownscaler'
```

**Causes:**
- `nodeDownscaler.enabled` is `false`
- Pods with `PodDisruptionBudget` blocking eviction
- DaemonSet-only nodes with system pods preventing drain
- Delay too high -- reduce `emptyNodes.delaySeconds`

### 6. Spot Instance Fallback Not Working

```bash
# Check spot configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq '.spotInstances'
```

**Fix**: Enable `spotDiversityEnabled: true` and set `spotDiversityPriceIncreaseLimitPercent` to 20-30 for better availability.

### 7. Evictor Too Aggressive

Symptoms: Pods being evicted too frequently, service disruption.

```bash
kubectl get events --field-selector reason=Evicted -A --sort-by=.lastTimestamp | tail -20
```

**Fix**: Increase evictor cycle interval or switch to non-aggressive mode:
```bash
helm upgrade castai-evictor castai-helm/castai-evictor \
  -n castai-agent \
  --set castai.apiKey="${CASTAI_API_KEY}" \
  --set castai.clusterID="${CASTAI_CLUSTER_ID}" \
  --set evictor.aggressiveMode=false \
  --set evictor.cycleInterval=600
```

### 8. Terraform State Drift

```bash
terraform plan -var-file=environments/prod.tfvars
# If drift detected:
terraform refresh -var-file=environments/prod.tfvars
```

**Fix**: Avoid mixing Terraform and console-based policy changes. Pick one source of truth.

### 9. Helm Chart Version Mismatch

```bash
# Check installed versions
helm list -n castai-agent
helm search repo castai-helm --versions | head -10

# Update to latest
helm repo update
helm upgrade castai-agent castai-helm/castai-agent -n castai-agent \
  --reuse-values
```

### 10. Workload Autoscaler Not Recommending

```bash
kubectl logs -n castai-agent deployment/castai-workload-autoscaler --tail=50
```

**Causes:**
- Insufficient metrics data (wait 24h)
- Missing annotation `autoscaling.cast.ai/enabled: "true"`
- Workload autoscaler pod not running

## Escalation Path

1. Collect debug info: Helm releases, agent logs, cluster events
2. Check https://status.cast.ai for platform issues
3. Contact support with cluster ID and screenshots

## Resources

- [CAST AI Troubleshooting](https://docs.cast.ai/docs/casti-ai-components)
- [CAST AI Status](https://status.cast.ai)
- [Autoscaler Checklist](https://docs.cast.ai/docs/autoscaler-checklist)

## Next Steps

For comprehensive diagnostics, see `castai-debug-bundle`.

Related Skills

workhuman-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

Workhuman common errors for employee recognition and rewards API. Use when integrating Workhuman Social Recognition, or building recognition workflows with HRIS systems. Trigger: "workhuman common errors".

wispr-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

Wispr Flow common errors for voice-to-text API integration. Use when integrating Wispr Flow dictation, WebSocket streaming, or building voice-powered applications. Trigger: "wispr common errors".

windsurf-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix common Windsurf IDE and Cascade errors. Use when Cascade stops working, Supercomplete fails, indexing hangs, or encountering Windsurf-specific issues. Trigger with phrases like "windsurf error", "fix windsurf", "windsurf not working", "cascade broken", "windsurf slow".

webflow-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix Webflow Data API v2 errors — 400, 401, 403, 404, 409, 429, 500. Use when encountering Webflow API errors, debugging failed requests, or troubleshooting integration issues. Trigger with phrases like "webflow error", "fix webflow", "webflow not working", "debug webflow", "webflow 429", "webflow 401".

vercel-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix common Vercel deployment and function errors. Use when encountering Vercel errors, debugging failed deployments, or troubleshooting serverless function issues. Trigger with phrases like "vercel error", "fix vercel", "vercel not working", "debug vercel", "vercel 500", "vercel build failed".

veeva-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

Veeva Vault common errors for REST API and clinical operations. Use when working with Veeva Vault document management and CRM. Trigger: "veeva common errors".

vastai-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix Vast.ai common errors and exceptions. Use when encountering Vast.ai errors, debugging failed instances, or troubleshooting GPU rental issues. Trigger with phrases like "vastai error", "fix vastai", "vastai not working", "debug vastai", "vastai instance failed".

twinmind-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix TwinMind common errors and exceptions. Use when encountering transcription errors, debugging failed requests, or troubleshooting integration issues. Trigger with phrases like "twinmind error", "fix twinmind", "twinmind not working", "debug twinmind", "transcription failed".

together-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

Together AI common errors for inference, fine-tuning, and model deployment. Use when working with Together AI's OpenAI-compatible API. Trigger: "together common errors".

techsmith-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

TechSmith common errors for Snagit COM API and Camtasia automation. Use when working with TechSmith screen capture and video editing automation. Trigger: "techsmith common errors".

supabase-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

Diagnose and fix Supabase errors across PostgREST, PostgreSQL, Auth, Storage, and Realtime. Use when encountering error codes like PGRST301, 42501, 23505, or auth failures. Use when debugging failed queries, RLS policy violations, or HTTP 4xx/5xx responses. Trigger with "supabase error", "fix supabase", "PGRST", "supabase 403", "RLS not working", "supabase auth error", "unique constraint", "foreign key violation".

stackblitz-common-errors

1868
from jeremylongshore/claude-code-plugins-plus-skills

Fix WebContainer and StackBlitz errors: COOP/COEP, SharedArrayBuffer, boot failures. Use when WebContainers fail to boot, embeds don't load, or processes crash inside WebContainers. Trigger: "stackblitz error", "webcontainer error", "SharedArrayBuffer not defined".