ai-llm-inference

Operational patterns for LLM inference: latency budgeting, tail-latency control, caching, batching/scheduling, quantization/compression, parallelism, and reliable serving at scale. Emphasizes production-grade performance, cost control, and observability.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

ai-llm-inference is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using ai-llm-inference should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ai-llm-inference/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/ai-agents/ai-llm-inference/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ai-llm-inference/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ai-llm-inference Compares

Feature / Agent	ai-llm-inference	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# LLMOps - Inference & Optimization - Production Skill Hub

**Modern Best Practices (January 2026)**:

- Treat inference as a **systems problem**: SLOs, tail latency, retries, overload, and cache strategy.
- Use **continuous batching / smart scheduling** when serving many concurrent requests (Orca scheduling: https://www.usenix.org/conference/osdi22/presentation/yu).
- Use **KV-cache aware serving** (PagedAttention/vLLM: https://arxiv.org/abs/2309.06180) and **efficient attention kernels** (FlashAttention: https://arxiv.org/abs/2205.14135).
- Use **speculative decoding** when latency is critical and draft-model quality is acceptable (speculative decoding: https://arxiv.org/abs/2302.01318).
- Quantize only with **measured** quality impact and rollback plan (quantization must be validated on your eval set).

This skill provides **production-ready operational patterns** for optimizing LLM inference performance, cost, and reliability. It centralizes **decision rules**, **optimization strategies**, **configuration templates**, and **operational checklists** for inference workloads.

No theory. No narrative. Only what Codex can execute.

---

## When to Use This Skill

Codex should activate this skill whenever the user asks for:

- Optimizing LLM inference latency or throughput
- Choosing quantization strategies (FP8/FP4/INT8/INT4)
- Configuring vLLM, TensorRT-LLM, or DeepSpeed inference
- Scaling LLM inference across GPUs (tensor/pipeline parallelism)
- Building high-throughput LLM APIs
- Improving context window performance (KV cache optimization)
- Using speculative decoding for faster generation
- Reducing cost per token
- Profiling and benchmarking inference workloads
- Planning infrastructure capacity
- CPU/edge deployment patterns
- High availability and resilience patterns

## Scope Boundaries (Use These Skills for Depth)

- **Prompting, tuning, datasets** -> [ai-llm](../ai-llm/SKILL.md)
- **RAG pipeline construction** -> [ai-rag](../ai-rag/SKILL.md)
- **Deployment, APIs, monitoring** -> [ai-mlops](../ai-mlops/SKILL.md)
- **Safety, governance** -> [ai-mlops](../ai-mlops/SKILL.md)
- **Performance monitoring** -> [qa-observability](../qa-observability/SKILL.md)
- **Infrastructure operations** -> [ops-devops-platform](../ops-devops-platform/SKILL.md)

---

## Quick Reference

| Task | Tool/Framework | Command/Pattern | When to Use |
|------|----------------|-----------------|-------------|
| Latency budget | SLO + load model | TTFT/ITL + P95/P99 under load | Any production endpoint |
| Tail-latency control | Scheduling + timeouts | Admission control + queue caps + backpressure | Prevent p99 explosions |
| Throughput | Batching + KV-cache aware serving | Continuous batching + KV paging | High concurrency serving |
| Cost control | Model tiering + caching | Cache (prefix/response) + quotas | Reduce spend and overload risk |
| Long context | Prefill optimization | Chunked prefill + prompt compression | Long inputs and RAG-heavy apps |
| Parallelism | TP/PP/DP | Choose by model size and interconnect | Models that do not fit one device |
| Reliability | Resilience patterns | Timeouts + circuit breakers + idempotency | Avoid cascading failures |

---

## Decision Tree: Inference Optimization Strategy

```text
Need to optimize LLM inference: [Optimization Path]
    │
    ├─ High throughput (>10k tok/s) OR P99 variance > 3x P50?
    │   └─ YES -> Disaggregated inference (prefill/decode separation)
    │            See references/disaggregated-inference.md
    │
    ├─ Primary constraint: Throughput?
    │   ├─ Many concurrent users? -> batching + KV-cache aware serving + admission control
    │   ├─ Chat/agents with KV reuse? -> SGLang (RadixAttention)
    │   └─ Mostly batch/offline? -> batch inference jobs + large batches + spot capacity
    │
    ├─ Primary constraint: Cost?
    │   ├─ Can accept lower quality tier? -> model tiering (small/medium/large router)
    │   └─ Must keep quality? -> caching + prompt/context reduction before quantization
    │
    ├─ Primary constraint: Latency?
    │   ├─ Draft model acceptable? -> speculative decoding
    │   └─ Long context? -> prefill optimizations + FlashAttention-3 + context budgets
    │
    ├─ Large model (>70B)?
    │   ├─ Multiple GPUs? -> Tensor parallelism (NVLink required)
    │   └─ Deep model? -> Pipeline parallelism (minimize bubbles)
    │
    ├─ Hardware selection?
    │   ├─ Memory-bound? -> more HBM, higher bandwidth
    │   ├─ Latency-bound? -> faster clocks + kernel support
    │   └─ Multi-node? -> prioritize interconnect (NVLink/RDMA) and topology
    │
    │   Notes: treat GPU/SKU advice as time-sensitive; verify with vendor docs and your own benchmarks.
    │   See references/gpu-optimization-checklists.md and references/infrastructure-tuning.md
    │
    └─ Edge deployment?
        └─ CPU + quantization -> llama.cpp/GGUF for constrained resources
```

---

## Intake Checklist (REQUIRED)

Before recommending changes, collect (or infer) these inputs:

- Model + variant (size, context length, precision/quantization, tokenizer)
- Traffic shape (prompt/output length distributions, concurrency, QPS, streaming vs non-streaming)
- SLOs and budgets (TTFT/ITL/total latency targets, error budget, cost per request)
- Serving stack (engine/version, batching/scheduling settings, caching, parallelism, autoscaling)
- Hardware and topology (GPU type/count, VRAM, NVLink/RDMA, CPU/RAM, storage, cluster/runtime)
- Constraints (quality floor, safety requirements, rollout/rollback constraints)

## Core Concepts & Practices

### Core Concepts (Vendor-Agnostic)

- **Latency components**: queueing + prefill + decode; optimize the largest contributor first.
- **Tail latency**: p99 is dominated by queuing and long prompts; fix with admission control and context budgets.
- **Retries**: retries can multiply load; bound retries and use hedged requests only with strict budgets.
- **Caching**: prefix caching helps repeated system/tool scaffolds; response caching helps repeated questions (requires invalidation).
- **Security & privacy**: prompts/outputs can contain sensitive data; scrub logs, enforce auth/tenancy, and rate-limit abuse (OWASP LLM Top 10: https://owasp.org/www-project-top-10-for-large-language-model-applications/).

### Implementation Practices (Tooling Examples)

- **Measure under load**: benchmark TTFT/ITL and p95/p99 with realistic concurrency and prompt lengths.
- **Separate environments**: dev/stage/prod model configs; promote only after passing the inference review checklist.
- **Export telemetry**: request-level tokens, TTFT/ITL, queue depth, GPU memory headroom, and error classes (OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/).

### Do / Avoid

**Do**
- Do enforce `max_input_tokens` and `max_output_tokens` at the API boundary.
- Do cap concurrency and queue depth; return overload errors quickly.
- Do validate quality after any quantization or kernel change.

**Avoid**
- Avoid unbounded retries (amplifies outages).
- Avoid unbounded context windows (OOM + latency spikes).
- Avoid benchmarking on single requests; always test with realistic concurrency.

---

## Accuracy Protocol (REQUIRED)

- Treat performance ratios (for example, "2x faster") as hypotheses unless a source is cited and the workload is comparable.
- Do not recommend hardware/SKU changes without stating assumptions (model size, context length, concurrency, interconnect).
- Prefer a measured baseline + checklist-driven rollout over "best practice" claims.

---

## Resources (Detailed Operational Guides)

For comprehensive guides on specific topics, see:

### Infrastructure & Serving

- [Disaggregated Inference](references/disaggregated-inference.md) - Prefill/decode separation (2025+ standard)
- [Infrastructure Tuning](references/infrastructure-tuning.md) - OS, container, Kubernetes optimization for GPU workloads
- [Serving Architectures](references/serving-architectures.md) - Production serving stack patterns (vLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo)
- [Resilience & HA Patterns](references/resilience-ha-patterns.md) - Multi-region, failover, traffic management

### Performance Optimization

- [Quantization Patterns](references/quantization-patterns.md) - FP8/FP4/INT8/INT4 decision trees (FP8 first, INT8 not on Blackwell)
- [KV Cache Optimization](references/kv-cache-optimization.md) - PagedAttention, FlashAttention-3, FlashInfer, RadixAttention
- [Parallelism Patterns](references/parallelism-patterns.md) - Tensor/pipeline/expert parallelism strategies
- [Optimization Strategies](references/optimization-strategies.md) - Throughput, cost, memory optimization
- [Batching & Scheduling](references/batching-and-scheduling.md) - Continuous batching and throughput patterns

### Deployment & Operations

- [Edge & CPU Optimization](references/edge-cpu-optimization.md) - llama.cpp, GGUF, mobile/browser deployment
- [GPU Optimization Checklists](references/gpu-optimization-checklists.md) - Hardware-specific tuning
- [Speculative Decoding Guide](references/speculative-decoding-guide.md) - Advanced generation acceleration
- [Profiling & Capacity Planning](references/profiling-and-capacity-planning.md) - Benchmarking, SLOs, replica sizing

---

## Templates

### Inference Configs

Production-ready configuration templates for leading inference engines:

- [vLLM Configuration](assets/inference/template-vllm-config.md) - Continuous batching, PagedAttention setup
- [TensorRT-LLM Configuration](assets/inference/template-tensorrtllm-config.md) - NVIDIA kernel optimizations
- [DeepSpeed Inference](assets/inference/template-deepspeed-inference.md) - PyTorch-friendly inference

### Quantization & Compression

Model compression templates for reducing memory and cost:

- [GPTQ Quantization](assets/quantization/template-gptq.md) - GPU post-training quantization
- [AWQ Quantization](assets/quantization/template-awq.md) - Activation-aware weight quantization
- [GGUF Format](assets/quantization/template-gguf.md) - CPU/edge optimized formats

### Serving Pipelines

High-throughput serving architectures:

- [LLM API Server](assets/serving/template-llm-api.md) - FastAPI + vLLM production setup
- [High-Throughput Setup](assets/serving/template-high-throughput-setup.md) - Multi-replica scaling patterns

### Caching & Batching

Performance optimization templates:

- [Prefix Caching](assets/caching/template-prefix-caching.md) - KV cache reuse strategies
- [Batching Configuration](assets/batching/template-batching-config.md) - Continuous batching tuning

### Benchmarking

Performance measurement and validation:

- [Latency & Throughput Testing](assets/benchmarking/template-latency-throughput-test.md) - Load testing framework

### Checklists

- [Inference Performance Review Checklist](assets/checklists/inference-review-checklist.md) - Baseline, bottlenecks, rollout readiness

## Navigation

**Resources**

- [references/disaggregated-inference.md](references/disaggregated-inference.md)
- [references/serving-architectures.md](references/serving-architectures.md)
- [references/profiling-and-capacity-planning.md](references/profiling-and-capacity-planning.md)
- [references/gpu-optimization-checklists.md](references/gpu-optimization-checklists.md)
- [references/speculative-decoding-guide.md](references/speculative-decoding-guide.md)
- [references/resilience-ha-patterns.md](references/resilience-ha-patterns.md)
- [references/optimization-strategies.md](references/optimization-strategies.md)
- [references/kv-cache-optimization.md](references/kv-cache-optimization.md)
- [references/batching-and-scheduling.md](references/batching-and-scheduling.md)
- [references/quantization-patterns.md](references/quantization-patterns.md)
- [references/parallelism-patterns.md](references/parallelism-patterns.md)
- [references/edge-cpu-optimization.md](references/edge-cpu-optimization.md)
- [references/infrastructure-tuning.md](references/infrastructure-tuning.md)

**Templates**
- [assets/serving/template-llm-api.md](assets/serving/template-llm-api.md)
- [assets/serving/template-high-throughput-setup.md](assets/serving/template-high-throughput-setup.md)
- [assets/inference/template-vllm-config.md](assets/inference/template-vllm-config.md)
- [assets/inference/template-tensorrtllm-config.md](assets/inference/template-tensorrtllm-config.md)
- [assets/inference/template-deepspeed-inference.md](assets/inference/template-deepspeed-inference.md)
- [assets/quantization/template-awq.md](assets/quantization/template-awq.md)
- [assets/quantization/template-gptq.md](assets/quantization/template-gptq.md)
- [assets/quantization/template-gguf.md](assets/quantization/template-gguf.md)
- [assets/batching/template-batching-config.md](assets/batching/template-batching-config.md)
- [assets/caching/template-prefix-caching.md](assets/caching/template-prefix-caching.md)
- [assets/benchmarking/template-latency-throughput-test.md](assets/benchmarking/template-latency-throughput-test.md)
- [assets/checklists/inference-review-checklist.md](assets/checklists/inference-review-checklist.md)

**Data**
- [data/sources.json](data/sources.json) - Curated external references

---

## Trend Awareness Protocol

**IMPORTANT**: When users ask recommendation questions about LLM inference, you MUST use WebSearch to check current trends before answering.

### Trigger Conditions

- "What's the best inference engine for [use case]?"
- "What should I use for [serving/quantization/batching]?"
- "What's the latest in LLM inference optimization?"
- "Current best practices for [vLLM/TensorRT/quantization]?"
- "Is [inference tool] still relevant in 2026?"
- "[vLLM] vs [TensorRT-LLM] vs [SGLang]?"
- "Best quantization method for [model size]?"
- "What GPU should I use for inference?"

### Required Searches

1. Search: `"LLM inference optimization best practices 2026"`
2. Search: `"[vLLM/TensorRT-LLM/SGLang] comparison 2026"`
3. Search: `"LLM quantization trends January 2026"`
4. Search: `"LLM serving new releases 2026"`

### What to Report

After searching, provide:

- **Current landscape**: What serving engines are popular NOW (not 6 months ago)
- **Emerging trends**: New inference optimizations gaining traction
- **Deprecated/declining**: Techniques or tools losing relevance
- **Recommendation**: Based on fresh data, not just static knowledge

### Example Topics (verify with fresh search)

- Inference engines (vLLM 0.7+, TensorRT-LLM, SGLang, llama.cpp)
- Quantization methods (FP8, AWQ, GPTQ, GGUF, bitsandbytes)
- Attention kernels (FlashAttention-3, FlashInfer, xFormers)
- Speculative decoding advances
- KV cache optimization techniques
- New GPU architectures (H200, Blackwell) and their optimizations

---

## Related Skills

This skill focuses on **inference-time performance**. For related workflows:

- See "Scope Boundaries" above.

---

## External Resources

See [data/sources.json](data/sources.json) for:

- Serving frameworks (vLLM, TensorRT-LLM, DeepSpeed-MII)
- Quantization libraries (GPTQ, AWQ, bitsandbytes, LLM Compressor)
- FlashAttention, FlashInfer, xFormers
- GPU hardware guides and optimization docs
- Benchmarking frameworks and tools

---

Use this skill whenever the user needs **LLM inference performance, cost reduction, or serving architecture** guidance.

Related Skills

inference-smoke-tests

from diegosouzapw/awesome-omni-skill

Run repeatable inference smoke tests using geppetto/pinocchio example binaries (single-pass, streaming, tool-loop, OpenAI Responses thinking) including tmux-driven TUI tests. Use when refactors touch InferenceState/Session/EngineBuilder, tool calling loop, event sinks, provider request formatting, or when you need a quick 'does inference still work?' checklist.

chromatin-state-inference

from diegosouzapw/awesome-omni-skill

This skill should be used when users need to infer chromatin states from histone modification ChIP-seq data using chromHMM. It provides workflows for chromatin state segmentation, model training, state annotation.

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

moai-lang-r

from diegosouzapw/awesome-omni-skill

R 4.4+ best practices with testthat 3.2, lintr 3.2, and data analysis patterns.

moai-lang-python

from diegosouzapw/awesome-omni-skill

Python 3.13+ development specialist covering FastAPI, Django, async patterns, data science, testing with pytest, and modern Python features. Use when developing Python APIs, web applications, data pipelines, or writing tests.

moai-icons-vector

from diegosouzapw/awesome-omni-skill

Vector icon libraries ecosystem guide covering 10+ major libraries with 200K+ icons, including React Icons (35K+), Lucide (1000+), Tabler Icons (5900+), Iconify (200K+), Heroicons, Phosphor, and Radix Icons with implementation patterns, decision trees, and best practices.

moai-foundation-trust

from diegosouzapw/awesome-omni-skill

Complete TRUST 4 principles guide covering Test First, Readable, Unified, Secured. Validation methods, enterprise quality gates, metrics, and November 2025 standards. Enterprise v4.0 with 50+ software quality standards references.

moai-foundation-memory

from diegosouzapw/awesome-omni-skill

Persistent memory across sessions using MCP Memory Server for user preferences, project context, and learned patterns

moai-foundation-core

from diegosouzapw/awesome-omni-skill

MoAI-ADK's foundational principles - TRUST 5, SPEC-First TDD, delegation patterns, token optimization, progressive disclosure, modular architecture, agent catalog, command reference, and execution rules for building AI-powered development workflows

moai-cc-claude-md

from diegosouzapw/awesome-omni-skill

Authoring CLAUDE.md Project Instructions. Design project-specific AI guidance, document workflows, define architecture patterns. Use when creating CLAUDE.md files for projects, documenting team standards, or establishing AI collaboration guidelines.

moai-alfred-language-detection

from diegosouzapw/awesome-omni-skill

Auto-detects project language and framework from package.json, pyproject.toml, etc.

mnemonic

from diegosouzapw/awesome-omni-skill

Unified memory system - aggregates communications and AI sessions across all channels into searchable, analyzable memory