ClaudeGeminiCodexCursorAI Engineering & LLM Operations

llm-ops

LLM Operations -- RAG, embeddings, vector databases, fine-tuning, prompt engineering avancado, custos de LLM, evals de qualidade e arquiteturas de IA para producao.

31,392 stars
Complexity: medium

About this skill

This skill empowers an AI agent to design, implement, and optimize advanced Large Language Model (LLM) systems for production environments. It covers critical areas such as Retrieval Augmented Generation (RAG), creating robust embedding pipelines, integrating various vector databases (e.g., Pinecone, Chroma, pgvector), performing LLM fine-tuning, advanced prompt engineering techniques, LLM cost optimization, quality evaluations (evals), semantic caching, streaming, and designing robust agentic architectures for scalable AI applications.

Best use case

Designing and implementing a RAG system for knowledge retrieval; optimizing LLM inference costs and performance; developing custom LLM solutions through fine-tuning; evaluating the quality and robustness of LLM outputs; architecting scalable and reliable AI systems for enterprise production; building sophisticated AI agents.

LLM Operations -- RAG, embeddings, vector databases, fine-tuning, prompt engineering avancado, custos de LLM, evals de qualidade e arquiteturas de IA para producao.

A detailed plan, architectural design, implementation code, cost analysis, or evaluation report for a production-grade LLM system. The agent can provide solutions for RAG implementation, embedding pipeline creation, vector database integration, fine-tuning strategies, prompt engineering improvements, cost reduction, quality evaluation frameworks, and scalable AI architectures.

Practical example

Example input

Design a cost-effective RAG system for our internal documentation, including choices for vector database and embedding model, and suggest advanced prompt engineering strategies for accuracy.

Example output

{"system_design_proposal": {"title": "Cost-Effective RAG System for Internal Documentation", "overview": "This proposal outlines a Retrieval Augmented Generation (RAG) system designed to provide accurate and context-aware responses from internal documentation while optimizing operational costs.", "components": {"embedding_model": {"choice": "HuggingFace `all-MiniLM-L6-v2` (open-source, cost-efficient, good performance for semantic search)", "pipeline": "Batch processing with pre-computed embeddings, update mechanism for new documents."}, "vector_database": {"choice": "pgvector on PostgreSQL (leverages existing infrastructure, cost-effective, scalable)", "alternatives": ["Chroma (simpler for smaller scale)", "Pinecone (managed service for very large scale/high QPS if budget allows)"]}, "llm_for_generation": {"choice": "Claude 3 Haiku (balanced cost/performance, strong reasoning)", "alternatives": ["GPT-3.5 Turbo (similar profile)", "Fine-tuned open-source model (e.g., Llama 3 8B) for domain-specific nuances if data available and compute budget allows for self-hosting"]}, "retrieval_strategy": "Hybrid search (semantic + keyword), re-ranking with cross-encoders."}, "prompt_engineering_strategies": ["Zero-shot prompting with clear instructions.", "Few-shot examples for specific query types.", "Chain-of-thought prompting for complex reasoning.", "System prompts to enforce persona and safety guidelines.", "Contextual compression and query expansion techniques."], "cost_optimization_measures": ["Optimized embedding model choice.", "Leveraging open-source vector database.", "Caching of frequently accessed embeddings and LLM responses (semantic cache).", "Monitoring token usage and fine-tuning prompt length.", "Strategic use of different LLM tiers (e.g., Haiku for general queries, Opus for complex)."], "quality_evaluation": ["Automated metrics (faithfulness, relevancy, answer correctness).", "Human-in-the-loop evaluation for critical responses.", "A/B testing for prompt variations."], "next_steps": ["Prototype RAG system with selected components.", "Develop data ingestion pipeline.", "Implement initial prompt templates.", "Conduct pilot testing and gather feedback."]}}

When to use this skill

  • When a complex LLM-based application needs to be designed or developed from scratch; when an existing LLM system requires optimization for performance, cost, or accuracy; when evaluating and benchmarking different LLM models or strategies; when integrating multiple AI components (e.g., vector databases, RAG, agents) into a cohesive system; when advanced prompt engineering is needed to achieve specific outputs or overcome limitations.

When not to use this skill

  • For simple, single-prompt text generation tasks that do not involve complex system design or optimization; when the task is purely about data analysis without an LLM component; when only basic knowledge retrieval is needed, and a full RAG system implementation is overkill; for tasks strictly outside the realm of LLM system development or optimization.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/llm-ops/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/llm-ops/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/llm-ops/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How llm-ops Compares

Feature / Agentllm-opsStandard Approach
Platform SupportClaude, Gemini, Codex, CursorLimited / Varies
Context Awareness High Baseline
Installation ComplexitymediumN/A

Frequently Asked Questions

What does this skill do?

LLM Operations -- RAG, embeddings, vector databases, fine-tuning, prompt engineering avancado, custos de LLM, evals de qualidade e arquiteturas de IA para producao.

Which AI agents support this skill?

This skill is designed for Claude, Gemini, Codex, Cursor.

How difficult is it to install?

The installation complexity is rated as medium. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# LLM-OPS -- IA de Producao

## Overview

LLM Operations -- RAG, embeddings, vector databases, fine-tuning, prompt engineering avancado, custos de LLM, evals de qualidade e arquiteturas de IA para producao. Ativar para: implementar RAG, criar pipeline de embeddings, Pinecone/Chroma/pgvector, fine-tuning, prompt engineering, reducao de custos de LLM, evals, cache semantico, streaming, agents.

## When to Use This Skill

- When you need specialized assistance with this domain

## Do Not Use This Skill When

- The task is unrelated to llm ops
- A simpler, more specific tool can handle the request
- The user needs general-purpose assistance without domain expertise

## How It Works

> A diferenca entre um prototipo de IA e um produto de IA e operabilidade.
> LLM-Ops e a engenharia que torna IA confiavel, escalavel e economica.

---

## Arquitetura Rag Completa

[Documentos] -> [Chunking] -> [Embeddings] -> [Vector DB]
                                                      |
    [Query] -> [Embed query] -> [Semantic Search] -> [Top K chunks]
                                                          |
                                           [LLM + Context] -> [Resposta]

## Pipeline De Indexacao

from anthropic import Anthropic
    import chromadb

    client = Anthropic()
    chroma = chromadb.PersistentClient(path="./chroma_db")

    def chunk_text(text, chunk_size=500, overlap=50):
        words = text.split()
        chunks = []
        for i in range(0, len(words), chunk_size - overlap):
            chunk = " ".join(words[i:i + chunk_size])
            if chunk: chunks.append(chunk)
        return chunks

    def index_document(doc_id, content_text, metadata=None):
        chunks = chunk_text(content_text)
        ids = [f"{doc_id}_chunk_{i}" for i in range(len(chunks))]
        collection.upsert(ids=ids, documents=chunks)
        return len(chunks)

## Pipeline De Query Com Rag

def rag_query(query, top_k=5, system=None):
        results = collection.query(
            query_texts=[query], n_results=top_k,
            include=["documents", "metadatas", "distances"])
        context_parts = []
        for doc, meta, dist in zip(results["documents"][0],
                                    results["metadatas"][0],
                                    results["distances"][0]):
            if dist < 1.5:
                src = meta.get("source", "doc")
                context_parts.append(f"[Fonte: {src}]
{doc}")
        context = "

---

".join(context_parts)
        response = client.messages.create(
            model="claude-opus-4-20250805", max_tokens=1024,
            system=system or "Responda baseado no contexto.",
            messages=[{"role": "user", "content": f"Contexto:
{context}

{query}"}])
        return response.content[0].text

---

## Escolha Do Vector Db

| DB | Melhor Para | Hosting | Custo |
|----|------------|---------|-------|
| Chroma | Desenvolvimento, local | Self-hosted | Gratis |
| pgvector | Ja usa PostgreSQL | Self/Cloud | Gratis |
| Pinecone | Producao gerenciada | Cloud | USD 70+/mes |
| Weaviate | Multi-modal | Self/Cloud | Gratis+ |
| Qdrant | Alta performance | Self/Cloud | Gratis+ |

## Pgvector

CREATE EXTENSION IF NOT EXISTS vector;
    CREATE TABLE knowledge_embeddings (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        content TEXT NOT NULL,
        embedding vector(1536),
        metadata JSONB,
        created_at TIMESTAMPTZ DEFAULT NOW()
    );
    CREATE INDEX ON knowledge_embeddings
    USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
    SELECT content, 1 - (embedding <=> QUERY_VECTOR) AS similarity
    FROM knowledge_embeddings ORDER BY similarity DESC LIMIT 5;

---

## Estrutura De Prompt De Elite

Componentes do system prompt Auri:

- Identidade: Nome (Auri), Tom (Natural, caloroso, direto), Plataforma (Amazon Alexa)
- Regras: Maximo 3 paragrafos curtos, sem markdown, linguagem conversacional
- Capacidades: analise de negocios, conselho baseado em dados, criatividade
- Limitacoes: sem internet tempo real, sem transacoes financeiras
- Personalizacao: {user_name}, {user_preferences}, {relevant_history}

## Chain-Of-Thought

def cot_analysis(problem: str) -> str:
        steps = [
            "1. O que exatamente esta sendo pedido?",
            "2. Que informacoes sao criticas para resolver?",
            "3. Quais abordagens possiveis existem?",
            "4. Qual abordagem e melhor e por que?",
            "5. Quais riscos ou limitacoes existem?",
        ]
        prompt = f"Analise passo a passo:

PROBLEMA: {problem}

"
        prompt += "
".join(steps) + "

Resposta final (concisa, para voz):"
        return call_claude(prompt)

---

## Cache Semantico

class SemanticCache:
        def __init__(self, similarity_threshold=0.95):
            self.threshold = similarity_threshold
            self.cache = {}

        def get_cached(self, query, embedding):
            for cached_emb, (response, _) in self.cache.items():
                if cosine_similarity(embedding, cached_emb) >= self.threshold:
                    return response
            return None

        def set_cache(self, query, embedding, response):
            self.cache[tuple(embedding)] = (response, query)

## Estimativa De Custos Claude

PRICING = {
        "claude-opus-4-20250805": {"input": 15.00, "output": 75.00},
        "claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
        "claude-haiku-3-5": {"input": 0.80, "output": 4.00},
    }

    def estimate_monthly_cost(model, avg_input, avg_output, req_per_day):
        p = PRICING[model]
        daily = (avg_input + avg_output) * req_per_day / 1e6
        monthly = daily * p["input"] * 30
        return {"model": model, "monthly_cost": "USD %.2f" % monthly}

---

## Framework De Avaliacao

from anthropic import Anthropic
    client = Anthropic()

    def evaluate_response(question, expected, actual, criteria):
        criteria_text = "
".join(f"- {c}" for c in criteria)
        eval_prompt = (
            f"Avalie a resposta do assistente de IA.

"
            f"PERGUNTA: {question}
RESPOSTA ESPERADA: {expected}
"
            f"RESPOSTA ATUAL: {actual}

Criterios:
{criteria_text}

"
            "Nota 0-10 e justificativa para cada criterio. Formato JSON."
        )
        response = client.messages.create(
            model="claude-haiku-3-5", max_tokens=1024,
            messages=[{"role": "user", "content": eval_prompt}]
        )
        import json
        return json.loads(response.content[0].text)

    AURI_EVALS = [
        {
            "question": "Quais sao os principais riscos de abrir startup agora?",
            "criteria": ["precisao_factual", "relevancia", "clareza_para_voz"]
        },
    ]

---

## 6. Comandos

| Comando | Acao |
|---------|------|
| /rag-setup | Configura pipeline RAG completo |
| /embed-docs | Indexa documentos no vector DB |
| /prompt-optimize | Otimiza prompt para qualidade e custo |
| /cost-estimate | Estima custo mensal do LLM |
| /eval-run | Roda suite de evals de qualidade |
| /cache-setup | Configura cache semantico |
| /model-select | Escolhe modelo ideal para o caso de uso |

## Best Practices

- Provide clear, specific context about your project and requirements
- Review all suggestions before applying them to production code
- Combine with other complementary skills for comprehensive analysis

## Common Pitfalls

- Using this skill for tasks outside its domain expertise
- Applying recommendations without understanding your specific context
- Not providing enough project context for accurate analysis

Related Skills

nft-standards

31392
from sickn33/antigravity-awesome-skills

Master ERC-721 and ERC-1155 NFT standards, metadata best practices, and advanced NFT features.

Web3 & BlockchainClaude

nextjs-app-router-patterns

31392
from sickn33/antigravity-awesome-skills

Comprehensive patterns for Next.js 14+ App Router architecture, Server Components, and modern full-stack React development.

Web FrameworksClaude

new-rails-project

31392
from sickn33/antigravity-awesome-skills

Create a new Rails project

Code GenerationClaude

networkx

31392
from sickn33/antigravity-awesome-skills

NetworkX is a Python package for creating, manipulating, and analyzing complex networks and graphs.

Network AnalysisClaude

network-engineer

31392
from sickn33/antigravity-awesome-skills

Expert network engineer specializing in modern cloud networking, security architectures, and performance optimization.

Network EngineeringClaude

nestjs-expert

31392
from sickn33/antigravity-awesome-skills

You are an expert in Nest.js with deep knowledge of enterprise-grade Node.js application architecture, dependency injection patterns, decorators, middleware, guards, interceptors, pipes, testing strategies, database integration, and authentication systems.

Frameworks & LibrariesClaude

nerdzao-elite

31392
from sickn33/antigravity-awesome-skills

Senior Elite Software Engineer (15+) and Senior Product Designer. Full workflow with planning, architecture, TDD, clean code, and pixel-perfect UX validation.

Software DevelopmentClaude

nerdzao-elite-gemini-high

31392
from sickn33/antigravity-awesome-skills

Modo Elite Coder + UX Pixel-Perfect otimizado especificamente para Gemini 3.1 Pro High. Workflow completo com foco em qualidade máxima e eficiência de tokens.

Software DevelopmentClaudeGemini

native-data-fetching

31392
from sickn33/antigravity-awesome-skills

Use when implementing or debugging ANY network request, API call, or data fetching. Covers fetch API, React Query, SWR, error handling, caching, offline support, and Expo Router data loaders (useLoaderData).

API IntegrationClaude

n8n-workflow-patterns

31392
from sickn33/antigravity-awesome-skills

Proven architectural patterns for building n8n workflows.

Workflow AutomationClaude

n8n-validation-expert

31392
from sickn33/antigravity-awesome-skills

Expert guide for interpreting and fixing n8n validation errors.

Workflow AutomationClaude

n8n-node-configuration

31392
from sickn33/antigravity-awesome-skills

Operation-aware node configuration guidance. Use when configuring nodes, understanding property dependencies, determining required fields, choosing between get_node detail levels, or learning common configuration patterns by node type.

Workflow AutomationClaude