vLLM — High-Throughput LLM Inference Engine
You are an expert in vLLM, the high-throughput LLM serving engine. You help developers deploy open-source models (Llama, Mistral, Qwen, Phi, Gemma) with PagedAttention for efficient memory management, continuous batching, tensor parallelism for multi-GPU, OpenAI-compatible API, and quantization support — achieving 2-24x higher throughput than HuggingFace Transformers for production LLM serving.
Best use case
vLLM — High-Throughput LLM Inference Engine is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
You are an expert in vLLM, the high-throughput LLM serving engine. You help developers deploy open-source models (Llama, Mistral, Qwen, Phi, Gemma) with PagedAttention for efficient memory management, continuous batching, tensor parallelism for multi-GPU, OpenAI-compatible API, and quantization support — achieving 2-24x higher throughput than HuggingFace Transformers for production LLM serving.
Teams using vLLM — High-Throughput LLM Inference Engine should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/vllm/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How vLLM — High-Throughput LLM Inference Engine Compares
| Feature / Agent | vLLM — High-Throughput LLM Inference Engine | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
You are an expert in vLLM, the high-throughput LLM serving engine. You help developers deploy open-source models (Llama, Mistral, Qwen, Phi, Gemma) with PagedAttention for efficient memory management, continuous batching, tensor parallelism for multi-GPU, OpenAI-compatible API, and quantization support — achieving 2-24x higher throughput than HuggingFace Transformers for production LLM serving.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# vLLM — High-Throughput LLM Inference Engine
You are an expert in vLLM, the high-throughput LLM serving engine. You help developers deploy open-source models (Llama, Mistral, Qwen, Phi, Gemma) with PagedAttention for efficient memory management, continuous batching, tensor parallelism for multi-GPU, OpenAI-compatible API, and quantization support — achieving 2-24x higher throughput than HuggingFace Transformers for production LLM serving.
## Core Capabilities
### Server Deployment
```bash
# Start OpenAI-compatible API server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--quantization awq \
--api-key my-secret-key
# Multi-GPU (tensor parallelism)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1 \
--max-num-seqs 256
# With Docker
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct
```
### OpenAI-Compatible Client
```typescript
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8000/v1",
apiKey: "my-secret-key",
});
// Chat completion
const response = await client.chat.completions.create({
model: "meta-llama/Llama-3.1-8B-Instruct",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Write a Python fibonacci function" },
],
temperature: 0.7,
max_tokens: 1024,
});
// Streaming
const stream = await client.chat.completions.create({
model: "meta-llama/Llama-3.1-8B-Instruct",
messages: [{ role: "user", content: "Explain quantum computing" }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
// Embeddings
const embeddings = await client.embeddings.create({
model: "BAAI/bge-large-en-v1.5",
input: ["Your text here"],
});
```
### Python Offline Inference
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
quantization="awq",
gpu_memory_utilization=0.9,
max_model_len=8192,
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
# Batch inference (processes all prompts efficiently)
prompts = [
"Explain machine learning in simple terms",
"Write a haiku about programming",
"What is the capital of France?",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Output: {output.outputs[0].text}")
print(f"Tokens/sec: {len(output.outputs[0].token_ids) / output.metrics.finished_time:.1f}")
```
## Installation
```bash
pip install vllm
# Requires: CUDA 12.1+, PyTorch 2.4+
# GPU: NVIDIA A100, H100, L40S, RTX 4090 recommended
```
## Best Practices
1. **PagedAttention** — vLLM's core innovation; manages KV cache like OS virtual memory, eliminates waste
2. **Continuous batching** — Processes new requests immediately without waiting; maximizes GPU utilization
3. **Quantization** — Use AWQ or GPTQ for 4-bit inference; 2-3x more throughput, minimal quality loss
4. **Tensor parallelism** — Split model across GPUs with `--tensor-parallel-size`; serve 70B+ models
5. **OpenAI compatibility** — Drop-in replacement for OpenAI API; any OpenAI SDK client works unchanged
6. **GPU memory** — Set `--gpu-memory-utilization 0.9` for max throughput; leave 10% for overhead
7. **Max sequences** — Tune `--max-num-seqs` based on your workload; higher = more concurrent requests
8. **Prefix caching** — Enable for shared system prompts; reuses KV cache across requests with same prefixRelated Skills
Socratic Method: The Dialectic Engine
This skill transforms Claude into a Socratic agent — a cognitive partner who guides
vertex-engine-inspector
Inspect and validate Vertex AI Agent Engine deployments including Code Execution Sandbox, Memory Bank, A2A protocol compliance, and security posture. Generates production readiness scores. Use when asked to inspect, validate, or audit an Agent Engine deployment. Trigger with "inspect agent engine", "validate agent engine deployment", "check agent engine config", "audit agent engine security", "agent engine readiness check", "vertex engine health", or "reasoning engine status".
triton-inference-config
Triton Inference Config - Auto-activating skill for ML Deployment. Triggers on: triton inference config, triton inference config Part of the ML Deployment skill category.
throughput-calculator
Throughput Calculator - Auto-activating skill for Performance Testing. Triggers on: throughput calculator, throughput calculator Part of the Performance Testing skill category.
analyzing-system-throughput
This skill enables Claude to analyze and optimize system throughput. It is triggered when the user requests throughput analysis, performance improvements, or bottleneck identification. The skill uses the `throughput-analyzer` plugin to assess request throughput, data throughput, concurrency limits, queue processing, and resource saturation. Use this skill when the user mentions "analyze throughput", "optimize performance", "identify bottlenecks", or asks about system capacity. It helps determine limiting factors and evaluate scaling strategies.
inference-latency-profiler
Inference Latency Profiler - Auto-activating skill for ML Deployment. Triggers on: inference latency profiler, inference latency profiler Part of the ML Deployment skill category.
engineering-features-for-machine-learning
This skill empowers Claude to perform feature engineering tasks for machine learning. It creates, selects, and transforms features to improve model performance. Use this skill when the user requests feature creation, feature selection, feature transformation, or any request that involves improving the features used in a machine learning model. Trigger terms include "feature engineering", "feature selection", "feature transformation", "create features", "select features", "transform features", "improve model performance", and similar phrases related to feature manipulation.
feature-engineering-helper
Feature Engineering Helper - Auto-activating skill for ML Training. Triggers on: feature engineering helper, feature engineering helper Part of the ML Training skill category.
clade-model-inference
Stream Claude responses, use system prompts, handle multi-turn conversations, Use when working with model-inference patterns. and process structured output with the Messages API. Trigger with "anthropic streaming", "claude messages api", "claude inference", "stream claude response".
conducting-chaos-engineering
This skill enables Claude to design and execute chaos engineering experiments to test system resilience. It is used when the user requests help with failure injection, latency simulation, resource exhaustion testing, or resilience validation. The skill is triggered by discussions of chaos experiments (GameDays), failure injection strategies, resilience testing, and validation of recovery mechanisms like circuit breakers and retry logic. It leverages tools like Chaos Mesh, Gremlin, Toxiproxy, and AWS FIS to simulate real-world failures and assess system behavior.
batch-inference-pipeline
Batch Inference Pipeline - Auto-activating skill for ML Deployment. Triggers on: batch inference pipeline, batch inference pipeline Part of the ML Deployment skill category.
adk-engineer
Execute software engineer specializing in creating production-ready ADK agents with best practices, code structure, testing, and deployment automation. Use when asked to "build ADK agent", "create agent code", or "engineer ADK application". Trigger with relevant phrases based on skill purpose.