groq-observability

Set up observability for Groq integrations: latency histograms, token throughput, rate limit gauges, cost tracking, and Prometheus alerts. Trigger with phrases like "groq monitoring", "groq metrics", "groq observability", "monitor groq", "groq alerts", "groq dashboard".

1,868 stars

Best use case

groq-observability is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Set up observability for Groq integrations: latency histograms, token throughput, rate limit gauges, cost tracking, and Prometheus alerts. Trigger with phrases like "groq monitoring", "groq metrics", "groq observability", "monitor groq", "groq alerts", "groq dashboard".

Teams using groq-observability should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/groq-observability/SKILL.md --create-dirs "https://raw.githubusercontent.com/jeremylongshore/claude-code-plugins-plus-skills/main/plugins/saas-packs/groq-pack/skills/groq-observability/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/groq-observability/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How groq-observability Compares

Feature / Agentgroq-observabilityStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Set up observability for Groq integrations: latency histograms, token throughput, rate limit gauges, cost tracking, and Prometheus alerts. Trigger with phrases like "groq monitoring", "groq metrics", "groq observability", "monitor groq", "groq alerts", "groq dashboard".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Groq Observability

## Overview
Monitor Groq LPU inference for latency, token throughput, rate limit utilization, and cost. Groq's defining advantage is speed (280-560 tok/s), so latency degradation is the highest-priority signal. The API returns rich timing metadata (`queue_time`, `prompt_time`, `completion_time`) and rate limit headers on every response.

## Key Metrics to Track

| Metric | Type | Source | Why |
|--------|------|--------|-----|
| TTFT (time to first token) | Histogram | Client-side timing | Groq's main value prop |
| Tokens/second | Gauge | `usage.completion_time` | Throughput degradation |
| Total latency | Histogram | Client-side timing | End-to-end performance |
| Rate limit remaining | Gauge | `x-ratelimit-remaining-*` headers | Prevent 429s |
| Token usage | Counter | `usage.total_tokens` | Cost attribution |
| Error rate by code | Counter | Error handler | Availability |
| Estimated cost | Counter | Tokens * model price | Budget tracking |

## Instructions

### Step 1: Instrumented Groq Client
```typescript
import Groq from "groq-sdk";

const groq = new Groq();

interface GroqMetrics {
  model: string;
  latencyMs: number;
  ttftMs: number;
  tokensPerSec: number;
  promptTokens: number;
  completionTokens: number;
  totalTokens: number;
  queueTimeMs: number;
  estimatedCostUsd: number;
}

const PRICE_PER_1M: Record<string, { input: number; output: number }> = {
  "llama-3.1-8b-instant": { input: 0.05, output: 0.08 },
  "llama-3.3-70b-versatile": { input: 0.59, output: 0.79 },
  "llama-3.3-70b-specdec": { input: 0.59, output: 0.99 },
  "meta-llama/llama-4-scout-17b-16e-instruct": { input: 0.11, output: 0.34 },
};

async function trackedCompletion(
  model: string,
  messages: any[],
  options?: { maxTokens?: number; temperature?: number }
): Promise<{ result: any; metrics: GroqMetrics }> {
  const start = performance.now();

  const result = await groq.chat.completions.create({
    model,
    messages,
    max_tokens: options?.maxTokens ?? 1024,
    temperature: options?.temperature ?? 0.7,
  });

  const latencyMs = performance.now() - start;
  const usage = result.usage!;
  const pricing = PRICE_PER_1M[model] || { input: 0.10, output: 0.10 };

  const metrics: GroqMetrics = {
    model,
    latencyMs: Math.round(latencyMs),
    ttftMs: Math.round(((usage as any).prompt_time ?? 0) * 1000),
    tokensPerSec: Math.round(
      usage.completion_tokens / ((usage as any).completion_time || latencyMs / 1000)
    ),
    promptTokens: usage.prompt_tokens,
    completionTokens: usage.completion_tokens,
    totalTokens: usage.total_tokens,
    queueTimeMs: Math.round(((usage as any).queue_time ?? 0) * 1000),
    estimatedCostUsd:
      (usage.prompt_tokens / 1_000_000) * pricing.input +
      (usage.completion_tokens / 1_000_000) * pricing.output,
  };

  emitMetrics(metrics);
  return { result, metrics };
}
```

### Step 2: Prometheus Metrics
```typescript
import { Histogram, Counter, Gauge } from "prom-client";

const groqLatency = new Histogram({
  name: "groq_latency_ms",
  help: "Groq API latency in milliseconds",
  labelNames: ["model"],
  buckets: [50, 100, 200, 500, 1000, 2000, 5000],
});

const groqTokens = new Counter({
  name: "groq_tokens_total",
  help: "Total tokens processed",
  labelNames: ["model", "direction"],
});

const groqThroughput = new Gauge({
  name: "groq_tokens_per_second",
  help: "Current tokens per second",
  labelNames: ["model"],
});

const groqRateLimitRemaining = new Gauge({
  name: "groq_ratelimit_remaining",
  help: "Remaining rate limit quota",
  labelNames: ["type"],
});

const groqCost = new Counter({
  name: "groq_cost_usd",
  help: "Estimated cost in USD",
  labelNames: ["model"],
});

const groqErrors = new Counter({
  name: "groq_errors_total",
  help: "API errors by status code",
  labelNames: ["model", "status_code"],
});

function emitMetrics(m: GroqMetrics) {
  groqLatency.labels(m.model).observe(m.latencyMs);
  groqTokens.labels(m.model, "input").inc(m.promptTokens);
  groqTokens.labels(m.model, "output").inc(m.completionTokens);
  groqThroughput.labels(m.model).set(m.tokensPerSec);
  groqCost.labels(m.model).inc(m.estimatedCostUsd);
}
```

### Step 3: Rate Limit Header Tracking
```typescript
// Parse rate limit headers from any Groq response
function trackRateLimitHeaders(headers: Record<string, string>) {
  const remaining = {
    requests: parseInt(headers["x-ratelimit-remaining-requests"] || "0"),
    tokens: parseInt(headers["x-ratelimit-remaining-tokens"] || "0"),
  };

  groqRateLimitRemaining.labels("requests").set(remaining.requests);
  groqRateLimitRemaining.labels("tokens").set(remaining.tokens);

  return remaining;
}
```

### Step 4: Prometheus Alert Rules
```yaml
# prometheus/groq-alerts.yml
groups:
  - name: groq
    rules:
      - alert: GroqLatencyHigh
        expr: histogram_quantile(0.95, rate(groq_latency_ms_bucket[5m])) > 1000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Groq P95 latency > 1s (normally < 200ms)"

      - alert: GroqRateLimitCritical
        expr: groq_ratelimit_remaining{type="requests"} < 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Groq rate limit nearly exhausted (< 5 requests remaining)"

      - alert: GroqThroughputDrop
        expr: groq_tokens_per_second < 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Groq throughput dropped below 100 tok/s (expected 280+)"

      - alert: GroqErrorRateHigh
        expr: rate(groq_errors_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Groq API error rate elevated (> 5% of requests)"

      - alert: GroqCostSpike
        expr: increase(groq_cost_usd[1h]) > 10
        labels:
          severity: warning
        annotations:
          summary: "Groq spend exceeded $10 in the past hour"
```

### Step 5: Structured Request Logging
```typescript
// Structured JSON log for each Groq request
function logGroqRequest(metrics: GroqMetrics, requestId?: string) {
  const logEntry = {
    ts: new Date().toISOString(),
    service: "groq",
    model: metrics.model,
    latency_ms: metrics.latencyMs,
    ttft_ms: metrics.ttftMs,
    tokens_per_sec: metrics.tokensPerSec,
    prompt_tokens: metrics.promptTokens,
    completion_tokens: metrics.completionTokens,
    queue_time_ms: metrics.queueTimeMs,
    cost_usd: metrics.estimatedCostUsd.toFixed(6),
    request_id: requestId,
  };

  // Output as structured JSON for log aggregation
  console.log(JSON.stringify(logEntry));
}
```

### Step 6: Dashboard Panels
Key Grafana/dashboard panels for Groq monitoring:

1. **TTFT Distribution** (histogram) -- Groq's main value; alert if > 500ms
2. **Tokens/Second by Model** (time series) -- should be 280-560 range
3. **Rate Limit Utilization** (gauge, 0-100%) -- alert at 90%
4. **Request Volume** (counter rate) -- by model
5. **Error Rate** (counter rate) -- by status code (429, 5xx)
6. **Cumulative Cost** (counter) -- by model, daily/weekly/monthly
7. **Queue Time** (histogram) -- Groq-specific, should be < 50ms

## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| 429 with high retry-after | RPM or TPM exhausted | Implement request queuing |
| Latency spike > 2s | Model overloaded or large prompt | Reduce prompt size or switch to lighter model |
| 503 Service Unavailable | Groq capacity issue | Enable fallback to alternative provider |
| Tokens/sec drop | Streaming disabled or large prompts | Enable streaming for better perceived performance |

## Resources
- [Groq API Reference (usage fields)](https://console.groq.com/docs/api-reference)
- [Groq Rate Limits](https://console.groq.com/docs/rate-limits)
- [prom-client on npm](https://www.npmjs.com/package/prom-client)

## Next Steps
For incident response procedures, see `groq-incident-runbook`.

Related Skills

windsurf-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Monitor Windsurf AI adoption, feature usage, and team productivity metrics. Use when tracking AI feature usage, measuring ROI, setting up dashboards, or analyzing Cascade effectiveness across your team. Trigger with phrases like "windsurf monitoring", "windsurf metrics", "windsurf analytics", "windsurf usage", "windsurf adoption".

webflow-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up observability for Webflow integrations — Prometheus metrics for API calls, OpenTelemetry tracing, structured logging with pino, Grafana dashboards, and alerting for rate limits, errors, and latency. Trigger with phrases like "webflow monitoring", "webflow metrics", "webflow observability", "monitor webflow", "webflow alerts", "webflow tracing".

vercel-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up Vercel observability with runtime logs, analytics, log drains, and OpenTelemetry tracing. Use when implementing monitoring for Vercel deployments, setting up log drains, or configuring alerting for function errors and performance. Trigger with phrases like "vercel monitoring", "vercel metrics", "vercel observability", "vercel logs", "vercel alerts", "vercel tracing".

veeva-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Veeva Vault observability for enterprise operations. Use when implementing advanced Veeva Vault patterns. Trigger: "veeva observability".

vastai-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Monitor Vast.ai GPU instance health, utilization, and costs. Use when setting up monitoring dashboards, configuring alerts, or tracking GPU utilization and spending. Trigger with phrases like "vastai monitoring", "vastai metrics", "vastai observability", "monitor vastai", "vastai alerts".

twinmind-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Monitor TwinMind transcription quality, meeting coverage, action item extraction rates, and memory vault health. Use when implementing observability, or managing TwinMind meeting AI operations. Trigger with phrases like "twinmind observability", "twinmind observability".

speak-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Monitor Speak API health, assessment latency, session metrics, and pronunciation score distributions. Use when implementing observability, or managing Speak language learning platform operations. Trigger with phrases like "speak observability", "speak observability".

snowflake-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up Snowflake observability using ACCOUNT_USAGE views, alerts, and external monitoring. Use when implementing Snowflake monitoring dashboards, setting up query performance tracking, or configuring alerting for warehouse and pipeline health. Trigger with phrases like "snowflake monitoring", "snowflake metrics", "snowflake observability", "snowflake dashboard", "snowflake alerts".

shopify-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up observability for Shopify app integrations with query cost tracking, rate limit monitoring, webhook delivery metrics, and structured logging. Trigger with phrases like "shopify monitoring", "shopify metrics", "shopify observability", "monitor shopify API", "shopify alerts", "shopify dashboard".

salesforce-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Set up observability for Salesforce integrations with API limit monitoring, error tracking, and alerting. Use when implementing monitoring for Salesforce operations, tracking API consumption, or configuring alerting for Salesforce integration health. Trigger with phrases like "salesforce monitoring", "salesforce metrics", "salesforce observability", "monitor salesforce", "salesforce alerts", "salesforce API usage dashboard".

retellai-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Retell AI observability — AI voice agent and phone call automation. Use when working with Retell AI for voice agents, phone calls, or telephony. Trigger with phrases like "retell observability", "retellai-observability", "voice agent".

replit-observability

1868
from jeremylongshore/claude-code-plugins-plus-skills

Monitor Replit deployments with health checks, uptime tracking, resource usage, and alerting. Use when setting up monitoring for Replit apps, building health dashboards, or configuring alerting for deployment health and performance. Trigger with phrases like "replit monitoring", "replit metrics", "replit observability", "monitor replit", "replit alerts", "replit uptime".