coreweave-observability

Set up GPU monitoring and observability for CoreWeave workloads. Use when implementing GPU metrics dashboards, configuring alerts, or tracking inference latency and throughput. Trigger with phrases like "coreweave monitoring", "coreweave observability", "coreweave gpu metrics", "coreweave grafana".

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

coreweave-observability is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using coreweave-observability should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/coreweave-observability/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-observability/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/coreweave-observability/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How coreweave-observability Compares

Feature / Agent	coreweave-observability	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# CoreWeave Observability

## GPU Metrics (DCGM Exporter)

CKS clusters come with DCGM exporter pre-installed. Key metrics:

| Metric | Description |
|--------|-------------|
| `DCGM_FI_DEV_GPU_UTIL` | GPU core utilization % |
| `DCGM_FI_DEV_FB_USED` | GPU memory used (MB) |
| `DCGM_FI_DEV_FB_FREE` | GPU memory free (MB) |
| `DCGM_FI_DEV_POWER_USAGE` | Power consumption (W) |
| `DCGM_FI_DEV_GPU_TEMP` | GPU temperature (C) |

## Prometheus Alert Rules

```yaml
groups:
  - name: coreweave-gpu
    rules:
      - alert: GPUUtilizationLow
        expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: "GPU utilization below 20% for 30min -- consider scaling down"

      - alert: GPUMemoryHigh
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "GPU memory >95% -- risk of OOM"

      - alert: InferencePodDown
        expr: kube_deployment_status_replicas_available{deployment=~".*inference.*"} == 0
        for: 2m
        labels: { severity: critical }
```

## Resources

- [CoreWeave Observability](https://www.coreweave.com/observability)

## Next Steps

For incident response, see `coreweave-incident-runbook`.

Related Skills

exa-observability

from ComeOnOliver/skillshub

Set up monitoring, metrics, and alerting for Exa search integrations. Use when implementing monitoring for Exa operations, building dashboards, or configuring alerting for search quality and latency. Trigger with phrases like "exa monitoring", "exa metrics", "exa observability", "monitor exa", "exa alerts", "exa dashboard".

evernote-observability

from ComeOnOliver/skillshub

Implement observability for Evernote integrations. Use when setting up monitoring, logging, tracing, or alerting for Evernote applications. Trigger with phrases like "evernote monitoring", "evernote logging", "evernote metrics", "evernote observability".

documenso-observability

from ComeOnOliver/skillshub

Implement monitoring, logging, and tracing for Documenso integrations. Use when setting up observability, implementing metrics collection, or debugging production issues. Trigger with phrases like "documenso monitoring", "documenso metrics", "documenso logging", "documenso tracing", "documenso observability".

deepgram-observability

from ComeOnOliver/skillshub

Set up comprehensive observability for Deepgram integrations. Use when implementing monitoring, setting up dashboards, or configuring alerting for Deepgram integration health. Trigger: "deepgram monitoring", "deepgram metrics", "deepgram observability", "monitor deepgram", "deepgram alerts", "deepgram dashboard".

databricks-observability

from ComeOnOliver/skillshub

Set up comprehensive observability for Databricks with metrics, traces, and alerts. Use when implementing monitoring for Databricks jobs, setting up dashboards, or configuring alerting for pipeline health. Trigger with phrases like "databricks monitoring", "databricks metrics", "databricks observability", "monitor databricks", "databricks alerts", "databricks logging".

customerio-observability

from ComeOnOliver/skillshub

Set up Customer.io monitoring and observability. Use when implementing metrics, structured logging, alerting, or Grafana dashboards for Customer.io integrations. Trigger: "customer.io monitoring", "customer.io metrics", "customer.io dashboard", "customer.io alerts", "customer.io observability".

coreweave-webhooks-events

from ComeOnOliver/skillshub

Monitor CoreWeave cluster events and GPU workload status. Use when tracking pod lifecycle events, monitoring GPU utilization, or alerting on inference service health changes. Trigger with phrases like "coreweave events", "coreweave monitoring", "coreweave pod alerts", "coreweave gpu monitoring".

coreweave-upgrade-migration

from ComeOnOliver/skillshub

Upgrade CoreWeave deployments and migrate between GPU types. Use when migrating from A100 to H100, upgrading CUDA versions, or updating inference server versions. Trigger with phrases like "upgrade coreweave", "coreweave gpu migration", "coreweave cuda upgrade", "migrate coreweave".

coreweave-security-basics

from ComeOnOliver/skillshub

Secure CoreWeave deployments with RBAC, network policies, and secrets management. Use when hardening GPU workloads, managing model access, or configuring namespace isolation. Trigger with phrases like "coreweave security", "coreweave rbac", "secure coreweave", "coreweave secrets".

coreweave-sdk-patterns

from ComeOnOliver/skillshub

Production-ready patterns for CoreWeave GPU workload management with kubectl and Python. Use when building inference clients, managing GPU deployments programmatically, or creating reusable CoreWeave deployment templates. Trigger with phrases like "coreweave patterns", "coreweave client", "coreweave Python", "coreweave deployment template".

coreweave-reference-architecture

from ComeOnOliver/skillshub

Reference architecture for CoreWeave GPU cloud deployments. Use when designing ML infrastructure, planning multi-model serving, or establishing CoreWeave deployment standards. Trigger with phrases like "coreweave architecture", "coreweave design", "coreweave infrastructure", "coreweave best practices".

coreweave-rate-limits

from ComeOnOliver/skillshub

Handle CoreWeave API and GPU quota limits. Use when hitting quota limits, managing GPU resource allocation, or implementing request queuing for inference endpoints. Trigger with phrases like "coreweave quota", "coreweave limits", "coreweave gpu allocation", "coreweave throttle".