llm-evaluate

Evaluate LLM models for cost/performance ratio. Fetches current pricing and recommends optimal model for your use case. Use during project init or when optimizing costs.

16 stars

Best use case

llm-evaluate is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Evaluate LLM models for cost/performance ratio. Fetches current pricing and recommends optimal model for your use case. Use during project init or when optimizing costs.

Teams using llm-evaluate should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/llm-evaluate/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/llm-evaluate/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/llm-evaluate/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How llm-evaluate Compares

Feature / Agentllm-evaluateStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Evaluate LLM models for cost/performance ratio. Fetches current pricing and recommends optimal model for your use case. Use during project init or when optimizing costs.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# LLM Model Evaluation

Evaluiert LLM-Modelle basierend auf aktuellem Preis/Leistungs-Verhältnis.

---

## Wann nutzen?

- Während `/init-project` bei der Komplexitätsbewertung
- Bei Kosten-Optimierung bestehender Projekte
- Wenn neue Modelle erscheinen (regelmäßig checken)
- Vor größeren Production-Deployments

---

## Step 1: Use Case verstehen

Falls kein Argument übergeben, frage:

```
Was ist dein Use Case?

Beispiele:
• "Chat-Bot für Kundenservice" (High-Volume, schnelle Antworten)
• "Dokumenten-Analyse" (Langer Context, Reasoning)
• "Code-Generierung" (Präzision wichtig)
• "GDPR-konforme EU-App" (Compliance)
• "Budget-Projekt" (Kosten minimieren)
```

---

## Step 2: Aktuelle Preise holen

**WICHTIG:** Preise ändern sich häufig. Hole aktuelle Daten.

### 2.1 Web Search für aktuelle Preise

Suche nach aktuellen Preisen mit WebSearch:

```
Query: "[Provider] API pricing 2026"
```

Für jeden Provider:
- Anthropic Claude pricing
- OpenAI GPT pricing
- Google Gemini pricing
- DeepSeek pricing
- xAI Grok pricing
- Mistral pricing

### 2.2 Pricing Endpoints (falls verfügbar)

Einige Provider haben öffentliche Pricing-Pages:

| Provider | Pricing URL |
|----------|-------------|
| Anthropic | https://www.anthropic.com/pricing |
| OpenAI | https://openai.com/api/pricing |
| Google | https://ai.google.dev/pricing |
| DeepSeek | https://platform.deepseek.com/api-docs/pricing |
| Mistral | https://mistral.ai/technology/#pricing |
| xAI | https://x.ai/api |

### 2.3 Fallback: Cached Reference

Falls Web-Fetch fehlschlägt, nutze `.claude/reference/llm-configuration.md` als Fallback (aber weise auf möglicherweise veraltete Daten hin).

---

## Step 3: Modelle bewerten

### 3.1 Bewertungskriterien

| Kriterium | Gewichtung | Beschreibung |
|-----------|------------|--------------|
| **Kosten** | 30% | Input + Output Tokens |
| **Qualität** | 30% | Benchmark-Scores, Erfahrungswerte |
| **Latenz** | 20% | Time to first token, Throughput |
| **Context** | 10% | Max Context Window |
| **Features** | 10% | Vision, Tools, Streaming |

### 3.2 Use Case Mapping

| Use Case | Wichtig | Unwichtig |
|----------|---------|-----------|
| **Chat-Bot** | Latenz, Kosten | Context |
| **Dokument-Analyse** | Context, Qualität | Latenz |
| **Code-Gen** | Qualität | Kosten |
| **High-Volume** | Kosten, Latenz | Qualität |
| **GDPR** | Compliance | Kosten |

---

## Step 4: Empfehlung ausgeben

### 4.1 Empfehlungs-Template

```
┌─────────────────────────────────────────────────────────────────────────────┐
│  LLM EVALUATION - [Use Case]                                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  📅 Preise Stand: [Datum der Abfrage]                                       │
│                                                                             │
│  TOP 3 EMPFEHLUNGEN:                                                        │
│                                                                             │
│  🥇 #1: [Modell]                                                            │
│      Provider: [Provider]                                                   │
│      Input:    $[X]/1M tokens                                               │
│      Output:   $[X]/1M tokens                                               │
│      Context:  [X]K                                                         │
│      Score:    [X]/100 (basierend auf Use Case)                             │
│      Warum:    [Begründung]                                                 │
│                                                                             │
│  🥈 #2: [Modell]                                                            │
│      ...                                                                    │
│                                                                             │
│  🥉 #3: [Modell]                                                            │
│      ...                                                                    │
│                                                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  KOSTEN-SCHÄTZUNG (bei 1M Requests/Monat, 1000 Tokens avg):                │
│                                                                             │
│  Modell #1: ~$[X]/Monat                                                     │
│  Modell #2: ~$[X]/Monat                                                     │
│  Modell #3: ~$[X]/Monat                                                     │
│                                                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  FALLBACK-STRATEGIE:                                                        │
│                                                                             │
│  Primary:  [Modell #1]                                                      │
│  Fallback: [Modell #2]                                                      │
│  Budget:   [Modell #3]                                                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 4.2 Portkey Config generieren

Falls gewünscht, generiere die Portkey-Konfiguration:

```typescript
// Empfohlene Portkey Konfiguration für [Use Case]
const config = {
  strategy: {
    mode: 'fallback',
  },
  targets: [
    { provider: '[primary]', model: '[model]' },
    { provider: '[fallback]', model: '[model]' },
  ],
  cache: {
    mode: 'semantic',
    ttl: 3600,
  },
};
```

---

## Step 5: Dokumentation aktualisieren

Falls signifikante Preisänderungen gefunden wurden:

1. Weise den User darauf hin
2. Frage ob `.claude/reference/llm-configuration.md` aktualisiert werden soll
3. Bei "Ja": Update die Preistabellen

---

## Automatische Intervall-Checks

### Weekly Reminder

Dieser Skill sollte regelmäßig genutzt werden:

```
Empfehlung: Führe /llm-evaluate monatlich aus um:
- Neue Modelle zu entdecken
- Preisänderungen zu berücksichtigen
- Kosten-Optimierung zu prüfen
```

### Bei Projekt-Init

Während `/init-project` wird dieser Skill automatisch bei der Komplexitätsbewertung (Step 0.2) aufgerufen um das optimale Modell für den Use Case zu empfehlen.

---

## Modell-Datenbank (Referenz)

### Anthropic

| Modell | Input/1M | Output/1M | Context | Stärken |
|--------|----------|-----------|---------|---------|
| Claude Opus 4.5 | $15 | $75 | 200K | Best reasoning |
| Claude Sonnet 4 | $3 | $15 | 200K | Best coding |
| Claude Haiku 3.5 | $0.25 | $1.25 | 200K | Fast, cheap |

### OpenAI

| Modell | Input/1M | Output/1M | Context | Stärken |
|--------|----------|-----------|---------|---------|
| GPT-4o | $5 | $15 | 128K | Multimodal |
| GPT-4o-mini | $0.15 | $0.60 | 128K | Budget GPT-4 |
| o1 | $15 | $60 | 200K | Deep reasoning |

### Google

| Modell | Input/1M | Output/1M | Context | Stärken |
|--------|----------|-----------|---------|---------|
| Gemini 2.0 Pro | $7 | $21 | 1M | Long context |
| Gemini 2.0 Flash | $0.10 | $0.40 | 1M | Fast, cheap |
| Gemini Flash 8B | $0.04 | $0.15 | 1M | Ultra cheap |

### DeepSeek

| Modell | Input/1M | Output/1M | Context | Stärken |
|--------|----------|-----------|---------|---------|
| DeepSeek V3 | $0.27 | $1.10 | 64K | Open-weight |
| DeepSeek R1 | $0.55 | $2.19 | 64K | Reasoning |

### xAI

| Modell | Input/1M | Output/1M | Context | Stärken |
|--------|----------|-----------|---------|---------|
| Grok 2 | $2 | $10 | 128K | Realtime data |
| Grok 2 Vision | $2 | $10 | 32K | Image understanding |

### Mistral

| Modell | Input/1M | Output/1M | Context | Stärken |
|--------|----------|-----------|---------|---------|
| Mistral Large | $2 | $6 | 128K | EU-hosted |
| Codestral | $0.30 | $0.90 | 256K | Code |
| Mistral Small | $0.20 | $0.60 | 128K | Budget |

---

## Beispiel-Ausführung

```
User: /llm-evaluate Chat-Bot für Kundenservice

Claude:
┌─────────────────────────────────────────────────────────────────────────────┐
│  LLM EVALUATION - Customer Service Chat Bot                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  📅 Preise Stand: Januar 2026                                               │
│                                                                             │
│  TOP 3 EMPFEHLUNGEN:                                                        │
│                                                                             │
│  🥇 #1: Claude Haiku 3.5                                                    │
│      Provider: Anthropic                                                    │
│      Input:    $0.25/1M tokens                                              │
│      Output:   $1.25/1M tokens                                              │
│      Context:  200K                                                         │
│      Score:    92/100                                                       │
│      Warum:    Schnell, günstig, gute Qualität für Chat                     │
│                                                                             │
│  🥈 #2: Gemini 2.0 Flash                                                    │
│      Provider: Google                                                       │
│      Input:    $0.10/1M tokens                                              │
│      Output:   $0.40/1M tokens                                              │
│      Context:  1M                                                           │
│      Score:    88/100                                                       │
│      Warum:    Noch günstiger, aber weniger konsistent                      │
│                                                                             │
│  🥉 #3: DeepSeek V3                                                         │
│      Provider: DeepSeek                                                     │
│      Input:    $0.27/1M tokens                                              │
│      Output:   $1.10/1M tokens                                              │
│      Context:  64K                                                          │
│      Score:    82/100                                                       │
│      Warum:    Sehr günstig, aber China-based (Compliance?)                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Integration mit init-project

Dieser Skill wird automatisch während `/init-project` aufgerufen:

1. User beschreibt Projekt (Step 0.1)
2. Komplexitätsstufe wird ermittelt (Step 0.2)
3. **`/llm-evaluate` läuft im Hintergrund**
4. Stack-Empfehlung inkl. LLM-Empfehlung (Step 0.3)

---

**Version:** 1.0
**Last Updated:** January 2026

Related Skills

bgo

10
from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

obsidian-daily

16
from diegosouzapw/awesome-omni-skill

Manage Obsidian Daily Notes via obsidian-cli. Create and open daily notes, append entries (journals, logs, tasks, links), read past notes by date, and search vault content. Handles relative dates like "yesterday", "last Friday", "3 days ago".

obsidian-additions

16
from diegosouzapw/awesome-omni-skill

Create supplementary materials attached to existing notes: experiments, meetings, reports, logs, conspectuses, practice sessions, annotations, AI outputs, links collections. Two-step process: (1) create aggregator space, (2) create concrete addition in base/additions/. INVOKE when user wants to attach any supplementary material to an existing note. Triggers: "addition", "create addition", "experiment", "meeting notes", "report", "conspectus", "log", "practice", "annotations", "links", "link collection", "аддишн", "конспект", "встреча", "отчёт", "эксперимент", "практика", "аннотации", "ссылки", "добавь к заметке".

observe

16
from diegosouzapw/awesome-omni-skill

Query and manage Observe using the Observe CLI. Use when the user wants to run OPAL queries, list datasets, manage objects, or interact with their Observe tenant from the command line.

observability-review

16
from diegosouzapw/awesome-omni-skill

AI agent that analyzes operational signals (metrics, logs, traces, alerts, SLO/SLI reports) from observability platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana, Elastic) and produces practical, risk-aware triage and recommendations. Use when reviewing system health, investigating performance issues, analyzing monitoring data, evaluating service reliability, or providing SRE analysis of operational metrics. Distinguishes between critical issues requiring action, items needing investigation, and informational observations requiring no action.

nvidia-nim

16
from diegosouzapw/awesome-omni-skill

NVIDIA NIM inference microservices for deploying AI models with OpenAI-compatible APIs, self-hosted or cloud

numpy-string-ops

16
from diegosouzapw/awesome-omni-skill

Vectorized string manipulation using the char module and modern string alternatives, including cleaning and search operations. Triggers: string operations, numpy.char, text cleaning, substring search.

nova-act-usability

16
from diegosouzapw/awesome-omni-skill

AI-orchestrated usability testing using Amazon Nova Act. The agent generates personas, runs tests to collect raw data, interprets responses to determine goal achievement, and generates HTML reports. Tests real user workflows (booking, checkout, posting) with safety guardrails. Use when asked to "test website usability", "run usability test", "generate usability report", "evaluate user experience", "test checkout flow", "test booking process", or "analyze website UX".

notebook-writer

16
from diegosouzapw/awesome-omni-skill

Create and document Jupyter notebooks for reproducible analyses

nomistakes

16
from diegosouzapw/awesome-omni-skill

Error prevention and best practices enforcement for agent-assisted coding. Use when writing code to catch common mistakes, enforce patterns, prevent bugs, validate inputs, handle errors, follow coding standards, avoid anti-patterns, and ensure code quality through proactive checks and guardrails.

nlss

16
from diegosouzapw/awesome-omni-skill

Workspace-first R statistics suite with subskills and agent-run metaskills (including run-demo for guided onboarding, explain-statistics for concept explanations, explain-results for interpreting outputs, format-document for NLSS format alignment, screen-data for diagnostics, check-assumptions for model-specific checks, and write-full-report for end-to-end reporting) that produce NLSS format tables/narratives and JSONL logs from CSV/SAV/RDS/RData/Parquet. Covers descriptives, frequencies/crosstabs, correlations, t-tests/ANOVA/nonparametric, regression/mixed models, SEM/CFA/mediation, EFA, power, reliability/scale analysis, assumptions, plots, missingness/imputation, data transforms, and workspace management.

nexus-bootstrap

16
from diegosouzapw/awesome-omni-skill

Enables your AI agent to discover and install skills from the Nexus Skills Marketplace. Install this skill first to unlock self-service skill management.