vLLM — High-Throughput LLM Inference Engine

You are an expert in vLLM, the high-throughput LLM serving engine. You help developers deploy open-source models (Llama, Mistral, Qwen, Phi, Gemma) with PagedAttention for efficient memory management, continuous batching, tensor parallelism for multi-GPU, OpenAI-compatible API, and quantization support — achieving 2-24x higher throughput than HuggingFace Transformers for production LLM serving.

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

vLLM — High-Throughput LLM Inference Engine is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using vLLM — High-Throughput LLM Inference Engine should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/vllm/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/vllm/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/vllm/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How vLLM — High-Throughput LLM Inference Engine Compares

Feature / Agent	vLLM — High-Throughput LLM Inference Engine	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# vLLM — High-Throughput LLM Inference Engine

You are an expert in vLLM, the high-throughput LLM serving engine. You help developers deploy open-source models (Llama, Mistral, Qwen, Phi, Gemma) with PagedAttention for efficient memory management, continuous batching, tensor parallelism for multi-GPU, OpenAI-compatible API, and quantization support — achieving 2-24x higher throughput than HuggingFace Transformers for production LLM serving.

## Core Capabilities

### Server Deployment

```bash
# Start OpenAI-compatible API server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --quantization awq \
  --api-key my-secret-key

# Multi-GPU (tensor parallelism)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 1 \
  --max-num-seqs 256

# With Docker
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct
```

### OpenAI-Compatible Client

```typescript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: "my-secret-key",
});

// Chat completion
const response = await client.chat.completions.create({
  model: "meta-llama/Llama-3.1-8B-Instruct",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Write a Python fibonacci function" },
  ],
  temperature: 0.7,
  max_tokens: 1024,
});

// Streaming
const stream = await client.chat.completions.create({
  model: "meta-llama/Llama-3.1-8B-Instruct",
  messages: [{ role: "user", content: "Explain quantum computing" }],
  stream: true,
});
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

// Embeddings
const embeddings = await client.embeddings.create({
  model: "BAAI/bge-large-en-v1.5",
  input: ["Your text here"],
});
```

### Python Offline Inference

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    quantization="awq",
    gpu_memory_utilization=0.9,
    max_model_len=8192,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

# Batch inference (processes all prompts efficiently)
prompts = [
    "Explain machine learning in simple terms",
    "Write a haiku about programming",
    "What is the capital of France?",
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Output: {output.outputs[0].text}")
    print(f"Tokens/sec: {len(output.outputs[0].token_ids) / output.metrics.finished_time:.1f}")
```

## Installation

```bash
pip install vllm
# Requires: CUDA 12.1+, PyTorch 2.4+
# GPU: NVIDIA A100, H100, L40S, RTX 4090 recommended
```

## Best Practices

1. **PagedAttention** — vLLM's core innovation; manages KV cache like OS virtual memory, eliminates waste
2. **Continuous batching** — Processes new requests immediately without waiting; maximizes GPU utilization
3. **Quantization** — Use AWQ or GPTQ for 4-bit inference; 2-3x more throughput, minimal quality loss
4. **Tensor parallelism** — Split model across GPUs with `--tensor-parallel-size`; serve 70B+ models
5. **OpenAI compatibility** — Drop-in replacement for OpenAI API; any OpenAI SDK client works unchanged
6. **GPU memory** — Set `--gpu-memory-utilization 0.9` for max throughput; leave 10% for overhead
7. **Max sequences** — Tune `--max-num-seqs` based on your workload; higher = more concurrent requests
8. **Prefix caching** — Enable for shared system prompts; reuses KV cache across requests with same prefix

Related Skills

Socratic Method: The Dialectic Engine

from ComeOnOliver/skillshub

This skill transforms Claude into a Socratic agent — a cognitive partner who guides

vertex-engine-inspector

from ComeOnOliver/skillshub

Inspect and validate Vertex AI Agent Engine deployments including Code Execution Sandbox, Memory Bank, A2A protocol compliance, and security posture. Generates production readiness scores. Use when asked to inspect, validate, or audit an Agent Engine deployment. Trigger with "inspect agent engine", "validate agent engine deployment", "check agent engine config", "audit agent engine security", "agent engine readiness check", "vertex engine health", or "reasoning engine status".

triton-inference-config

from ComeOnOliver/skillshub

Triton Inference Config - Auto-activating skill for ML Deployment. Triggers on: triton inference config, triton inference config Part of the ML Deployment skill category.

throughput-calculator

from ComeOnOliver/skillshub

Throughput Calculator - Auto-activating skill for Performance Testing. Triggers on: throughput calculator, throughput calculator Part of the Performance Testing skill category.

analyzing-system-throughput

from ComeOnOliver/skillshub

This skill enables Claude to analyze and optimize system throughput. It is triggered when the user requests throughput analysis, performance improvements, or bottleneck identification. The skill uses the `throughput-analyzer` plugin to assess request throughput, data throughput, concurrency limits, queue processing, and resource saturation. Use this skill when the user mentions "analyze throughput", "optimize performance", "identify bottlenecks", or asks about system capacity. It helps determine limiting factors and evaluate scaling strategies.

inference-latency-profiler

from ComeOnOliver/skillshub

Inference Latency Profiler - Auto-activating skill for ML Deployment. Triggers on: inference latency profiler, inference latency profiler Part of the ML Deployment skill category.

engineering-features-for-machine-learning

from ComeOnOliver/skillshub

This skill empowers Claude to perform feature engineering tasks for machine learning. It creates, selects, and transforms features to improve model performance. Use this skill when the user requests feature creation, feature selection, feature transformation, or any request that involves improving the features used in a machine learning model. Trigger terms include "feature engineering", "feature selection", "feature transformation", "create features", "select features", "transform features", "improve model performance", and similar phrases related to feature manipulation.

feature-engineering-helper

from ComeOnOliver/skillshub

Feature Engineering Helper - Auto-activating skill for ML Training. Triggers on: feature engineering helper, feature engineering helper Part of the ML Training skill category.

clade-model-inference

from ComeOnOliver/skillshub

Stream Claude responses, use system prompts, handle multi-turn conversations, Use when working with model-inference patterns. and process structured output with the Messages API. Trigger with "anthropic streaming", "claude messages api", "claude inference", "stream claude response".

conducting-chaos-engineering

from ComeOnOliver/skillshub

This skill enables Claude to design and execute chaos engineering experiments to test system resilience. It is used when the user requests help with failure injection, latency simulation, resource exhaustion testing, or resilience validation. The skill is triggered by discussions of chaos experiments (GameDays), failure injection strategies, resilience testing, and validation of recovery mechanisms like circuit breakers and retry logic. It leverages tools like Chaos Mesh, Gremlin, Toxiproxy, and AWS FIS to simulate real-world failures and assess system behavior.

batch-inference-pipeline

from ComeOnOliver/skillshub

Batch Inference Pipeline - Auto-activating skill for ML Deployment. Triggers on: batch inference pipeline, batch inference pipeline Part of the ML Deployment skill category.

adk-engineer

from ComeOnOliver/skillshub

Execute software engineer specializing in creating production-ready ADK agents with best practices, code structure, testing, and deployment automation. Use when asked to "build ADK agent", "create agent code", or "engineer ADK application". Trigger with relevant phrases based on skill purpose.