vastai-observability

Monitor Vast.ai GPU instance health, utilization, and costs. Use when setting up monitoring dashboards, configuring alerts, or tracking GPU utilization and spending. Trigger with phrases like "vastai monitoring", "vastai metrics", "vastai observability", "monitor vastai", "vastai alerts".

1,868 stars

Best use case

vastai-observability is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Monitor Vast.ai GPU instance health, utilization, and costs. Use when setting up monitoring dashboards, configuring alerts, or tracking GPU utilization and spending. Trigger with phrases like "vastai monitoring", "vastai metrics", "vastai observability", "monitor vastai", "vastai alerts".

Teams using vastai-observability should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/vastai-observability/SKILL.md --create-dirs "https://raw.githubusercontent.com/jeremylongshore/claude-code-plugins-plus-skills/main/plugins/saas-packs/vastai-pack/skills/vastai-observability/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/vastai-observability/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How vastai-observability Compares

Feature / Agentvastai-observabilityStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Monitor Vast.ai GPU instance health, utilization, and costs. Use when setting up monitoring dashboards, configuring alerts, or tracking GPU utilization and spending. Trigger with phrases like "vastai monitoring", "vastai metrics", "vastai observability", "monitor vastai", "vastai alerts".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Vast.ai Observability

## Overview
Monitor Vast.ai GPU instance health, utilization, and costs. Key metrics: GPU utilization (idle GPUs waste $0.20-$4.00/hr), instance uptime, training progress, cost accumulation, and spot preemption events.

## Prerequisites
- Vast.ai account with active instances
- `vastai` CLI installed
- Optional: Prometheus, Grafana, or Datadog for dashboarding

## Instructions

### Step 1: Instance Metrics Collector

```python
import subprocess, json, time
from datetime import datetime

class VastMetricsCollector:
    def __init__(self, output_file="vast_metrics.jsonl"):
        self.output_file = output_file

    def collect(self):
        result = subprocess.run(
            ["vastai", "show", "instances", "--raw"],
            capture_output=True, text=True)
        instances = json.loads(result.stdout)

        metrics = {
            "timestamp": datetime.utcnow().isoformat(),
            "total_instances": len(instances),
            "running": 0, "total_hourly_cost": 0,
            "instances": [],
        }

        for inst in instances:
            status = inst.get("actual_status", "unknown")
            dph = inst.get("dph_total", 0)
            if status == "running":
                metrics["running"] += 1
                metrics["total_hourly_cost"] += dph

            metrics["instances"].append({
                "id": inst["id"],
                "gpu": inst.get("gpu_name"),
                "status": status,
                "dph": dph,
                "gpu_util": inst.get("gpu_util", 0),
                "gpu_temp": inst.get("gpu_temp", 0),
            })

        with open(self.output_file, "a") as f:
            f.write(json.dumps(metrics) + "\n")

        return metrics

    def run(self, interval=60):
        while True:
            m = self.collect()
            print(f"[{m['timestamp']}] Running: {m['running']} | "
                  f"Cost: ${m['total_hourly_cost']:.3f}/hr")
            time.sleep(interval)
```

### Step 2: Alert Conditions

```python
def check_alerts(metrics):
    alerts = []

    # Idle GPU alert (running but <10% utilization)
    for inst in metrics["instances"]:
        if inst["status"] == "running" and inst["gpu_util"] < 10:
            alerts.append(f"IDLE: Instance {inst['id']} GPU util={inst['gpu_util']}% "
                         f"(wasting ${inst['dph']:.3f}/hr)")

    # High temperature alert
    for inst in metrics["instances"]:
        if inst.get("gpu_temp", 0) > 85:
            alerts.append(f"HOT: Instance {inst['id']} GPU temp={inst['gpu_temp']}C")

    # Budget alert
    daily_projection = metrics["total_hourly_cost"] * 24
    if daily_projection > 100:
        alerts.append(f"BUDGET: Projected daily cost ${daily_projection:.2f}")

    return alerts
```

### Step 3: Remote GPU Monitoring

```bash
# SSH into instance and collect nvidia-smi metrics
ssh -p $PORT root@$HOST "nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu,power.draw --format=csv,noheader,nounits"
# Output: 95, 20480, 24576, 72, 285
```

### Step 4: Prometheus Exporter (Optional)

```python
from prometheus_client import Gauge, start_http_server

gpu_util = Gauge("vastai_gpu_utilization", "GPU utilization %", ["instance_id", "gpu_name"])
hourly_cost = Gauge("vastai_hourly_cost", "Total hourly cost USD")
instance_count = Gauge("vastai_instance_count", "Running instances")

def export_metrics(metrics):
    instance_count.set(metrics["running"])
    hourly_cost.set(metrics["total_hourly_cost"])
    for inst in metrics["instances"]:
        if inst["status"] == "running":
            gpu_util.labels(inst["id"], inst["gpu"]).set(inst["gpu_util"])

start_http_server(9090)  # Prometheus scrape target
```

## Output
- Metrics collector with JSONL output
- Alert conditions (idle GPU, high temp, budget)
- Remote GPU monitoring via SSH + nvidia-smi
- Optional Prometheus exporter for Grafana dashboards

## Error Handling
| Alert | Threshold | Response |
|-------|-----------|----------|
| Idle GPU | util < 10% for > 10 min | Investigate or destroy instance |
| High temp | > 85C sustained | Reduce workload or report to host |
| Budget exceeded | Projected daily > $100 | Destroy non-critical instances |
| Instance offline | Status changed from running | Trigger auto-recovery |

## Resources
- [Vast.ai CLI](https://docs.vast.ai/cli/get-started)
- [NVIDIA nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface)

## Next Steps
For incident response procedures, see `vastai-incident-runbook`.

## Examples

**Quick dashboard**: Run `VastMetricsCollector().run(interval=30)` in tmux on a monitoring server. Pipe alerts to Slack via webhook.

**Cost tracking**: Parse `vast_metrics.jsonl` to plot hourly cost over time and identify spending patterns.

Related Skills

windsurf-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Monitor Windsurf AI adoption, feature usage, and team productivity metrics. Use when tracking AI feature usage, measuring ROI, setting up dashboards, or analyzing Cascade effectiveness across your team. Trigger with phrases like "windsurf monitoring", "windsurf metrics", "windsurf analytics", "windsurf usage", "windsurf adoption".

webflow-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up observability for Webflow integrations — Prometheus metrics for API calls, OpenTelemetry tracing, structured logging with pino, Grafana dashboards, and alerting for rate limits, errors, and latency. Trigger with phrases like "webflow monitoring", "webflow metrics", "webflow observability", "monitor webflow", "webflow alerts", "webflow tracing".

vercel-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up Vercel observability with runtime logs, analytics, log drains, and OpenTelemetry tracing. Use when implementing monitoring for Vercel deployments, setting up log drains, or configuring alerting for function errors and performance. Trigger with phrases like "vercel monitoring", "vercel metrics", "vercel observability", "vercel logs", "vercel alerts", "vercel tracing".

veeva-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Veeva Vault observability for enterprise operations. Use when implementing advanced Veeva Vault patterns. Trigger: "veeva observability".

vastai-webhooks-events

1868
from jeremylongshore/claude-code-plugins-plus-skills

Build event-driven workflows around Vast.ai instance lifecycle events. Use when monitoring instance status changes, implementing auto-recovery, or building event-driven GPU orchestration. Trigger with phrases like "vastai events", "vastai instance monitoring", "vastai status changes", "vastai lifecycle events".

vastai-upgrade-migration

1868
from jeremylongshore/claude-code-plugins-plus-skills

Upgrade Vast.ai CLI, migrate API versions, and handle breaking changes. Use when upgrading vastai CLI, detecting deprecations, or migrating between API versions. Trigger with phrases like "upgrade vastai", "vastai migration", "vastai breaking changes", "update vastai CLI".

vastai-security-basics

1868
from jeremylongshore/claude-code-plugins-plus-skills

Apply Vast.ai security best practices for API keys and instance access. Use when securing API keys, hardening SSH access to GPU instances, or auditing Vast.ai security configuration. Trigger with phrases like "vastai security", "vastai secrets", "secure vastai", "vastai API key security", "vastai ssh security".

vastai-sdk-patterns

1868
from jeremylongshore/claude-code-plugins-plus-skills

Apply production-ready Vast.ai SDK patterns for Python and REST API. Use when implementing Vast.ai integrations, refactoring SDK usage, or establishing coding standards for GPU cloud operations. Trigger with phrases like "vastai SDK patterns", "vastai best practices", "vastai code patterns", "idiomatic vastai".

vastai-reference-architecture

1868
from jeremylongshore/claude-code-plugins-plus-skills

Implement Vast.ai reference architecture for GPU compute workflows. Use when designing ML training pipelines, structuring GPU orchestration, or establishing architecture patterns for Vast.ai applications. Trigger with phrases like "vastai architecture", "vastai design pattern", "vastai project structure", "vastai ml pipeline".

vastai-rate-limits

1868
from jeremylongshore/claude-code-plugins-plus-skills

Handle Vast.ai API rate limits with backoff and request optimization. Use when encountering 429 errors, implementing retry logic, or optimizing API request throughput. Trigger with phrases like "vastai rate limit", "vastai throttling", "vastai 429", "vastai retry", "vastai backoff".

vastai-prod-checklist

1868
from jeremylongshore/claude-code-plugins-plus-skills

Execute Vast.ai production deployment checklist for GPU workloads. Use when deploying training pipelines to production, preparing for large-scale GPU jobs, or auditing production readiness. Trigger with phrases like "vastai production", "deploy vastai", "vastai go-live", "vastai launch checklist".

vastai-performance-tuning

1868
from jeremylongshore/claude-code-plugins-plus-skills

Optimize Vast.ai GPU instance selection, startup time, and training throughput. Use when optimizing instance selection, reducing startup latency, or maximizing GPU utilization on rented hardware. Trigger with phrases like "vastai performance", "optimize vastai", "vastai slow", "vastai gpu utilization", "vastai throughput".