openrouter-load-balancing
Distribute OpenRouter requests across multiple keys and models for high throughput. Use when scaling beyond single-key rate limits or building high-availability systems. Triggers: 'openrouter load balance', 'openrouter scaling', 'distribute openrouter requests', 'multiple api keys'.
Best use case
openrouter-load-balancing is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Distribute OpenRouter requests across multiple keys and models for high throughput. Use when scaling beyond single-key rate limits or building high-availability systems. Triggers: 'openrouter load balance', 'openrouter scaling', 'distribute openrouter requests', 'multiple api keys'.
Teams using openrouter-load-balancing should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/openrouter-load-balancing/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How openrouter-load-balancing Compares
| Feature / Agent | openrouter-load-balancing | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Distribute OpenRouter requests across multiple keys and models for high throughput. Use when scaling beyond single-key rate limits or building high-availability systems. Triggers: 'openrouter load balance', 'openrouter scaling', 'distribute openrouter requests', 'multiple api keys'.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# OpenRouter Load Balancing
## Overview
A single OpenRouter API key has rate limits (requests/minute and tokens/minute). To scale beyond those limits, distribute requests across multiple keys. OpenRouter also provides server-side load balancing via provider routing and the `:nitro` variant for low-latency inference. This skill covers multi-key rotation, health-based routing, circuit breakers, and concurrent request patterns.
## Multi-Key Round Robin
```python
import os, itertools, time, logging
from openai import OpenAI, RateLimitError
from dataclasses import dataclass, field
log = logging.getLogger("openrouter.lb")
@dataclass
class KeyPool:
"""Round-robin API key pool with health tracking."""
keys: list[str]
_cycle: itertools.cycle = field(init=False, repr=False)
_health: dict[str, dict] = field(init=False, default_factory=dict)
def __post_init__(self):
self._cycle = itertools.cycle(self.keys)
self._health = {k: {"errors": 0, "last_error": 0, "healthy": True} for k in self.keys}
def next_key(self) -> str:
"""Get next healthy key."""
attempts = 0
while attempts < len(self.keys):
key = next(self._cycle)
h = self._health[key]
# Recover after 60s cooldown
if not h["healthy"] and time.time() - h["last_error"] > 60:
h["healthy"] = True
h["errors"] = 0
if h["healthy"]:
return key
attempts += 1
# All keys unhealthy -- return any and hope for the best
return next(self._cycle)
def mark_error(self, key: str):
h = self._health[key]
h["errors"] += 1
h["last_error"] = time.time()
if h["errors"] >= 3: # Circuit breaker: 3 errors → unhealthy
h["healthy"] = False
log.warning(f"Key {key[:12]}... marked unhealthy after {h['errors']} errors")
def mark_success(self, key: str):
self._health[key]["errors"] = 0
self._health[key]["healthy"] = True
pool = KeyPool(keys=[
os.environ.get("OPENROUTER_KEY_1", ""),
os.environ.get("OPENROUTER_KEY_2", ""),
os.environ.get("OPENROUTER_KEY_3", ""),
])
def balanced_completion(messages, model="anthropic/claude-3.5-sonnet", **kwargs):
"""Send request using next healthy key from the pool."""
key = pool.next_key()
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=key,
default_headers={"HTTP-Referer": "https://my-app.com", "X-Title": "my-app"},
)
try:
response = client.chat.completions.create(
model=model, messages=messages, **kwargs
)
pool.mark_success(key)
return response
except RateLimitError:
pool.mark_error(key)
# Retry with next key
return balanced_completion(messages, model, **kwargs)
```
## Concurrent Request Processing
```python
import asyncio
from openai import AsyncOpenAI
async def parallel_completions(prompts: list[str], model="openai/gpt-4o-mini",
max_concurrent=5, **kwargs):
"""Process multiple prompts concurrently with rate limiting."""
semaphore = asyncio.Semaphore(max_concurrent)
client = AsyncOpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
default_headers={"HTTP-Referer": "https://my-app.com", "X-Title": "my-app"},
)
async def process_one(prompt: str):
async with semaphore:
response = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
**kwargs,
)
return response.choices[0].message.content
return await asyncio.gather(*[process_one(p) for p in prompts])
# Usage
results = asyncio.run(parallel_completions(
["Summarize X", "Translate Y", "Analyze Z"],
max_concurrent=3,
max_tokens=500,
))
```
## Provider-Level Load Balancing
```python
# OpenRouter can distribute across providers for the same model
response = client.chat.completions.create(
model="anthropic/claude-3.5-sonnet",
messages=[{"role": "user", "content": "Hello"}],
max_tokens=200,
extra_body={
"provider": {
# Let OpenRouter pick the best available provider
"order": ["Anthropic", "AWS Bedrock", "GCP Vertex"],
"allow_fallbacks": True,
},
},
)
```
## Rate Limit Awareness
```python
import requests
def check_rate_limits(api_key: str) -> dict:
"""Check current rate limit status for a key."""
resp = requests.get(
"https://openrouter.ai/api/v1/auth/key",
headers={"Authorization": f"Bearer {api_key}"},
)
data = resp.json()["data"]
return {
"requests_limit": data["rate_limit"]["requests"],
"interval": data["rate_limit"]["interval"],
"credits_used": data["usage"],
"credits_limit": data.get("limit"),
}
# Check all keys in pool
for key in pool.keys:
limits = check_rate_limits(key)
print(f"Key {key[:12]}...: {limits}")
```
## Error Handling
| Error | Cause | Fix |
|-------|-------|-----|
| 429 on all keys | All keys rate-limited simultaneously | Add more keys; implement request queuing |
| Uneven load distribution | Round-robin not accounting for in-flight requests | Use weighted distribution based on current load |
| Key health false positive | Transient error marked key unhealthy | Use sliding window (3 errors in 60s) before marking unhealthy |
| Concurrent request failures | Too many parallel requests | Reduce semaphore limit; add backoff |
## Enterprise Considerations
- Create separate API keys per service/team with individual credit limits for cost isolation
- Use 3+ keys to multiply effective rate limits (each key gets its own quota)
- Implement circuit breakers: mark keys unhealthy after N consecutive errors, recover after cooldown
- Use `asyncio.Semaphore` to control concurrency and prevent overwhelming the API
- Monitor per-key error rates and latency to detect degraded keys early
- Combine multi-key rotation with provider routing for maximum resilience
## References
- [Examples](${CLAUDE_SKILL_DIR}/references/examples.md) | [Errors](${CLAUDE_SKILL_DIR}/references/errors.md)
- [Rate Limits](https://openrouter.ai/docs/api/limits) | [Provider Routing](https://openrouter.ai/docs/features/provider-routing)Related Skills
testing-load-balancers
Validate load balancer behavior, failover, and traffic distribution. Use when performing specialized testing. Trigger with phrases like "test load balancer", "validate failover", or "check traffic distribution".
windsurf-load-scale
Scale Windsurf adoption across large organizations with workspace strategies and performance tuning. Use when rolling out Windsurf to 50+ developers, managing large monorepo workspaces, or planning enterprise-scale deployment. Trigger with phrases like "windsurf at scale", "windsurf large team", "windsurf monorepo", "windsurf organization", "windsurf 100 developers".
vercel-load-scale
Load test and scale Vercel deployments with concurrency tuning and capacity planning. Use when running performance tests, planning for traffic spikes, or optimizing serverless function scaling on Vercel. Trigger with phrases like "vercel load test", "vercel scale", "vercel performance test", "vercel capacity", "vercel benchmark".
supabase-load-scale
Scale Supabase projects for production load: read replicas, connection pooling tuning via Supavisor, compute size upgrades, CDN caching for Storage, Edge Function regional deployment, and database table partitioning. Use when preparing for traffic spikes, optimizing connection limits, setting up read replicas for analytics queries, or partitioning large tables. Trigger with phrases like "supabase scale", "supabase read replica", "supabase connection pooling", "supabase compute upgrade", "supabase CDN storage", "supabase edge function regions", "supabase partitioning", "supavisor", "supabase pool mode".
snowflake-load-scale
Implement Snowflake load testing, warehouse scaling, and capacity planning. Use when testing query performance at scale, configuring multi-cluster warehouses, or planning capacity for production Snowflake workloads. Trigger with phrases like "snowflake load test", "snowflake scale", "snowflake capacity", "snowflake benchmark", "snowflake multi-cluster".
shopify-load-scale
Load test Shopify integrations respecting API rate limits, plan capacity with k6, and scale for Shopify Plus burst events (flash sales, BFCM). Trigger with phrases like "shopify load test", "shopify scale", "shopify BFCM", "shopify flash sale", "shopify capacity", "shopify k6 test".
sentry-load-scale
Scale Sentry for high-traffic applications handling millions of events per day. Use when optimizing SDK performance at high volume, implementing adaptive sampling, managing quotas and costs at scale, or deploying Sentry across multi-region infrastructure. Trigger with phrases like "sentry high traffic", "scale sentry", "sentry millions events", "sentry high volume", "sentry quota management", "sentry load test".
salesforce-load-scale
Implement Salesforce load testing, API limit capacity planning, and Bulk API scaling. Use when running performance tests against Salesforce, planning API consumption, or scaling high-volume Salesforce integrations. Trigger with phrases like "salesforce load test", "salesforce scale", "salesforce performance test", "salesforce capacity planning", "salesforce high volume".
retellai-load-scale
Retell AI load scale — AI voice agent and phone call automation. Use when working with Retell AI for voice agents, phone calls, or telephony. Trigger with phrases like "retell load scale", "retellai-load-scale", "voice agent".
replit-load-scale
Load test and scale Replit deployments with Autoscale tuning, Reserved VM sizing, and capacity planning. Use when load testing Replit apps, optimizing Autoscale behavior, or planning capacity for production traffic. Trigger with phrases like "replit load test", "replit scale", "replit capacity", "replit performance test", "replit autoscale tuning".
perplexity-load-scale
Load test Perplexity Sonar API integrations and plan capacity. Use when running performance tests, planning for traffic growth, or benchmarking Perplexity latency under load. Trigger with phrases like "perplexity load test", "perplexity scale", "perplexity performance test", "perplexity capacity", "perplexity benchmark".
openrouter-usage-analytics
Track and analyze OpenRouter API usage patterns, costs, and performance. Use when building dashboards, optimizing spend, or reporting on AI usage. Triggers: 'openrouter analytics', 'openrouter usage', 'openrouter metrics', 'track openrouter spend'.