ClaudeGeminiCodexCursorAI Engineering & LLM Operations

llm-ops

LLM Operations -- RAG, embeddings, vector databases, fine-tuning, prompt engineering avancado, custos de LLM, evals de qualidade e arquiteturas de IA para producao.

31,392 stars

bysickn33

Complexity: medium

View on GitHub Installation ↓

About this skill

This skill empowers an AI agent to design, implement, and optimize advanced Large Language Model (LLM) systems for production environments. It covers critical areas such as Retrieval Augmented Generation (RAG), creating robust embedding pipelines, integrating various vector databases (e.g., Pinecone, Chroma, pgvector), performing LLM fine-tuning, advanced prompt engineering techniques, LLM cost optimization, quality evaluations (evals), semantic caching, streaming, and designing robust agentic architectures for scalable AI applications.

Best use case

Designing and implementing a RAG system for knowledge retrieval; optimizing LLM inference costs and performance; developing custom LLM solutions through fine-tuning; evaluating the quality and robustness of LLM outputs; architecting scalable and reliable AI systems for enterprise production; building sophisticated AI agents.

LLM Operations -- RAG, embeddings, vector databases, fine-tuning, prompt engineering avancado, custos de LLM, evals de qualidade e arquiteturas de IA para producao.

A detailed plan, architectural design, implementation code, cost analysis, or evaluation report for a production-grade LLM system. The agent can provide solutions for RAG implementation, embedding pipeline creation, vector database integration, fine-tuning strategies, prompt engineering improvements, cost reduction, quality evaluation frameworks, and scalable AI architectures.

Practical example

Example input

Design a cost-effective RAG system for our internal documentation, including choices for vector database and embedding model, and suggest advanced prompt engineering strategies for accuracy.

Example output

{"system_design_proposal": {"title": "Cost-Effective RAG System for Internal Documentation", "overview": "This proposal outlines a Retrieval Augmented Generation (RAG) system designed to provide accurate and context-aware responses from internal documentation while optimizing operational costs.", "components": {"embedding_model": {"choice": "HuggingFace `all-MiniLM-L6-v2` (open-source, cost-efficient, good performance for semantic search)", "pipeline": "Batch processing with pre-computed embeddings, update mechanism for new documents."}, "vector_database": {"choice": "pgvector on PostgreSQL (leverages existing infrastructure, cost-effective, scalable)", "alternatives": ["Chroma (simpler for smaller scale)", "Pinecone (managed service for very large scale/high QPS if budget allows)"]}, "llm_for_generation": {"choice": "Claude 3 Haiku (balanced cost/performance, strong reasoning)", "alternatives": ["GPT-3.5 Turbo (similar profile)", "Fine-tuned open-source model (e.g., Llama 3 8B) for domain-specific nuances if data available and compute budget allows for self-hosting"]}, "retrieval_strategy": "Hybrid search (semantic + keyword), re-ranking with cross-encoders."}, "prompt_engineering_strategies": ["Zero-shot prompting with clear instructions.", "Few-shot examples for specific query types.", "Chain-of-thought prompting for complex reasoning.", "System prompts to enforce persona and safety guidelines.", "Contextual compression and query expansion techniques."], "cost_optimization_measures": ["Optimized embedding model choice.", "Leveraging open-source vector database.", "Caching of frequently accessed embeddings and LLM responses (semantic cache).", "Monitoring token usage and fine-tuning prompt length.", "Strategic use of different LLM tiers (e.g., Haiku for general queries, Opus for complex)."], "quality_evaluation": ["Automated metrics (faithfulness, relevancy, answer correctness).", "Human-in-the-loop evaluation for critical responses.", "A/B testing for prompt variations."], "next_steps": ["Prototype RAG system with selected components.", "Develop data ingestion pipeline.", "Implement initial prompt templates.", "Conduct pilot testing and gather feedback."]}}

When to use this skill

When a complex LLM-based application needs to be designed or developed from scratch; when an existing LLM system requires optimization for performance, cost, or accuracy; when evaluating and benchmarking different LLM models or strategies; when integrating multiple AI components (e.g., vector databases, RAG, agents) into a cohesive system; when advanced prompt engineering is needed to achieve specific outputs or overcome limitations.

When not to use this skill

For simple, single-prompt text generation tasks that do not involve complex system design or optimization; when the task is purely about data analysis without an LLM component; when only basic knowledge retrieval is needed, and a full RAG system implementation is overkill; for tasks strictly outside the realm of LLM system development or optimization.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/llm-ops/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/llm-ops/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/llm-ops/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How llm-ops Compares

Feature / Agent	llm-ops	Standard Approach
Platform Support	Claude, Gemini, Codex, Cursor	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	medium	N/A

Frequently Asked Questions

What does this skill do?

LLM Operations -- RAG, embeddings, vector databases, fine-tuning, prompt engineering avancado, custos de LLM, evals de qualidade e arquiteturas de IA para producao.

Which AI agents support this skill?

This skill is designed for Claude, Gemini, Codex, Cursor.

How difficult is it to install?

The installation complexity is rated as medium. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

ChatGPT vs Claude for Agent Skills

Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

SKILL.md Source

# LLM-OPS -- IA de Producao

## Overview

LLM Operations -- RAG, embeddings, vector databases, fine-tuning, prompt engineering avancado, custos de LLM, evals de qualidade e arquiteturas de IA para producao. Ativar para: implementar RAG, criar pipeline de embeddings, Pinecone/Chroma/pgvector, fine-tuning, prompt engineering, reducao de custos de LLM, evals, cache semantico, streaming, agents.

## When to Use This Skill

- When you need specialized assistance with this domain

## Do Not Use This Skill When

- The task is unrelated to llm ops
- A simpler, more specific tool can handle the request
- The user needs general-purpose assistance without domain expertise

## How It Works

> A diferenca entre um prototipo de IA e um produto de IA e operabilidade.
> LLM-Ops e a engenharia que torna IA confiavel, escalavel e economica.

---

## Arquitetura Rag Completa

[Documentos] -> [Chunking] -> [Embeddings] -> [Vector DB]
                                                      |
    [Query] -> [Embed query] -> [Semantic Search] -> [Top K chunks]
                                                          |
                                           [LLM + Context] -> [Resposta]

## Pipeline De Indexacao

from anthropic import Anthropic
    import chromadb

    client = Anthropic()
    chroma = chromadb.PersistentClient(path="./chroma_db")

    def chunk_text(text, chunk_size=500, overlap=50):
        words = text.split()
        chunks = []
        for i in range(0, len(words), chunk_size - overlap):
            chunk = " ".join(words[i:i + chunk_size])
            if chunk: chunks.append(chunk)
        return chunks

    def index_document(doc_id, content_text, metadata=None):
        chunks = chunk_text(content_text)
        ids = [f"{doc_id}_chunk_{i}" for i in range(len(chunks))]
        collection.upsert(ids=ids, documents=chunks)
        return len(chunks)

## Pipeline De Query Com Rag

def rag_query(query, top_k=5, system=None):
        results = collection.query(
            query_texts=[query], n_results=top_k,
            include=["documents", "metadatas", "distances"])
        context_parts = []
        for doc, meta, dist in zip(results["documents"][0],
                                    results["metadatas"][0],
                                    results["distances"][0]):
            if dist < 1.5:
                src = meta.get("source", "doc")
                context_parts.append(f"[Fonte: {src}]
{doc}")
        context = "

---

".join(context_parts)
        response = client.messages.create(
            model="claude-opus-4-20250805", max_tokens=1024,
            system=system or "Responda baseado no contexto.",
            messages=[{"role": "user", "content": f"Contexto:
{context}

{query}"}])
        return response.content[0].text

---

## Escolha Do Vector Db

| DB | Melhor Para | Hosting | Custo |
|----|------------|---------|-------|
| Chroma | Desenvolvimento, local | Self-hosted | Gratis |
| pgvector | Ja usa PostgreSQL | Self/Cloud | Gratis |
| Pinecone | Producao gerenciada | Cloud | USD 70+/mes |
| Weaviate | Multi-modal | Self/Cloud | Gratis+ |
| Qdrant | Alta performance | Self/Cloud | Gratis+ |

## Pgvector

CREATE EXTENSION IF NOT EXISTS vector;
    CREATE TABLE knowledge_embeddings (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        content TEXT NOT NULL,
        embedding vector(1536),
        metadata JSONB,
        created_at TIMESTAMPTZ DEFAULT NOW()
    );
    CREATE INDEX ON knowledge_embeddings
    USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
    SELECT content, 1 - (embedding <=> QUERY_VECTOR) AS similarity
    FROM knowledge_embeddings ORDER BY similarity DESC LIMIT 5;

---

## Estrutura De Prompt De Elite

Componentes do system prompt Auri:

- Identidade: Nome (Auri), Tom (Natural, caloroso, direto), Plataforma (Amazon Alexa)
- Regras: Maximo 3 paragrafos curtos, sem markdown, linguagem conversacional
- Capacidades: analise de negocios, conselho baseado em dados, criatividade
- Limitacoes: sem internet tempo real, sem transacoes financeiras
- Personalizacao: {user_name}, {user_preferences}, {relevant_history}

## Chain-Of-Thought

def cot_analysis(problem: str) -> str:
        steps = [
            "1. O que exatamente esta sendo pedido?",
            "2. Que informacoes sao criticas para resolver?",
            "3. Quais abordagens possiveis existem?",
            "4. Qual abordagem e melhor e por que?",
            "5. Quais riscos ou limitacoes existem?",
        ]
        prompt = f"Analise passo a passo:

PROBLEMA: {problem}

"
        prompt += "
".join(steps) + "

Resposta final (concisa, para voz):"
        return call_claude(prompt)

---

## Cache Semantico

class SemanticCache:
        def __init__(self, similarity_threshold=0.95):
            self.threshold = similarity_threshold
            self.cache = {}

        def get_cached(self, query, embedding):
            for cached_emb, (response, _) in self.cache.items():
                if cosine_similarity(embedding, cached_emb) >= self.threshold:
                    return response
            return None

        def set_cache(self, query, embedding, response):
            self.cache[tuple(embedding)] = (response, query)

## Estimativa De Custos Claude

PRICING = {
        "claude-opus-4-20250805": {"input": 15.00, "output": 75.00},
        "claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
        "claude-haiku-3-5": {"input": 0.80, "output": 4.00},
    }

    def estimate_monthly_cost(model, avg_input, avg_output, req_per_day):
        p = PRICING[model]
        daily = (avg_input + avg_output) * req_per_day / 1e6
        monthly = daily * p["input"] * 30
        return {"model": model, "monthly_cost": "USD %.2f" % monthly}

---

## Framework De Avaliacao

from anthropic import Anthropic
    client = Anthropic()

    def evaluate_response(question, expected, actual, criteria):
        criteria_text = "
".join(f"- {c}" for c in criteria)
        eval_prompt = (
            f"Avalie a resposta do assistente de IA.

"
            f"PERGUNTA: {question}
RESPOSTA ESPERADA: {expected}
"
            f"RESPOSTA ATUAL: {actual}

Criterios:
{criteria_text}

"
            "Nota 0-10 e justificativa para cada criterio. Formato JSON."
        )
        response = client.messages.create(
            model="claude-haiku-3-5", max_tokens=1024,
            messages=[{"role": "user", "content": eval_prompt}]
        )
        import json
        return json.loads(response.content[0].text)

    AURI_EVALS = [
        {
            "question": "Quais sao os principais riscos de abrir startup agora?",
            "criteria": ["precisao_factual", "relevancia", "clareza_para_voz"]
        },
    ]

---

## 6. Comandos

| Comando | Acao |
|---------|------|
| /rag-setup | Configura pipeline RAG completo |
| /embed-docs | Indexa documentos no vector DB |
| /prompt-optimize | Otimiza prompt para qualidade e custo |
| /cost-estimate | Estima custo mensal do LLM |
| /eval-run | Roda suite de evals de qualidade |
| /cache-setup | Configura cache semantico |
| /model-select | Escolhe modelo ideal para o caso de uso |

## Best Practices

- Provide clear, specific context about your project and requirements
- Review all suggestions before applying them to production code
- Combine with other complementary skills for comprehensive analysis

## Common Pitfalls

- Using this skill for tasks outside its domain expertise
- Applying recommendations without understanding your specific context
- Not providing enough project context for accurate analysis

Related Skills

nft-standards

31392

from sickn33/antigravity-awesome-skills

Master ERC-721 and ERC-1155 NFT standards, metadata best practices, and advanced NFT features.

Web3 & BlockchainClaude

nextjs-app-router-patterns

31392

from sickn33/antigravity-awesome-skills

Comprehensive patterns for Next.js 14+ App Router architecture, Server Components, and modern full-stack React development.

Web FrameworksClaude

new-rails-project

31392

from sickn33/antigravity-awesome-skills

Create a new Rails project

Code GenerationClaude

networkx

31392

from sickn33/antigravity-awesome-skills

NetworkX is a Python package for creating, manipulating, and analyzing complex networks and graphs.

Network AnalysisClaude

network-engineer

31392

from sickn33/antigravity-awesome-skills

Expert network engineer specializing in modern cloud networking, security architectures, and performance optimization.

Network EngineeringClaude

nestjs-expert

31392

from sickn33/antigravity-awesome-skills

You are an expert in Nest.js with deep knowledge of enterprise-grade Node.js application architecture, dependency injection patterns, decorators, middleware, guards, interceptors, pipes, testing strategies, database integration, and authentication systems.

Frameworks & LibrariesClaude

nerdzao-elite

31392

from sickn33/antigravity-awesome-skills

Senior Elite Software Engineer (15+) and Senior Product Designer. Full workflow with planning, architecture, TDD, clean code, and pixel-perfect UX validation.

Software DevelopmentClaude

nerdzao-elite-gemini-high

31392

from sickn33/antigravity-awesome-skills

Modo Elite Coder + UX Pixel-Perfect otimizado especificamente para Gemini 3.1 Pro High. Workflow completo com foco em qualidade máxima e eficiência de tokens.

Software DevelopmentClaudeGemini

native-data-fetching

31392

from sickn33/antigravity-awesome-skills

Use when implementing or debugging ANY network request, API call, or data fetching. Covers fetch API, React Query, SWR, error handling, caching, offline support, and Expo Router data loaders (useLoaderData).

API IntegrationClaude

n8n-workflow-patterns

31392

from sickn33/antigravity-awesome-skills

Proven architectural patterns for building n8n workflows.

Workflow AutomationClaude

n8n-validation-expert

31392

from sickn33/antigravity-awesome-skills

Expert guide for interpreting and fixing n8n validation errors.

Workflow AutomationClaude

n8n-node-configuration

31392

from sickn33/antigravity-awesome-skills

Operation-aware node configuration guidance. Use when configuring nodes, understanding property dependencies, determining required fields, choosing between get_node detail levels, or learning common configuration patterns by node type.

Workflow AutomationClaude