anth-load-scale
Implement load testing, auto-scaling, and capacity planning for Claude API. Use when running performance benchmarks, planning for traffic spikes, or configuring horizontal scaling for Claude-powered services. Trigger with phrases like "anthropic load test", "claude scaling", "anthropic capacity planning", "scale claude api".
Best use case
anth-load-scale is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Implement load testing, auto-scaling, and capacity planning for Claude API. Use when running performance benchmarks, planning for traffic spikes, or configuring horizontal scaling for Claude-powered services. Trigger with phrases like "anthropic load test", "claude scaling", "anthropic capacity planning", "scale claude api".
Teams using anth-load-scale should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/anth-load-scale/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How anth-load-scale Compares
| Feature / Agent | anth-load-scale | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Implement load testing, auto-scaling, and capacity planning for Claude API. Use when running performance benchmarks, planning for traffic spikes, or configuring horizontal scaling for Claude-powered services. Trigger with phrases like "anthropic load test", "claude scaling", "anthropic capacity planning", "scale claude api".
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Anthropic Load & Scale
## Overview
Capacity planning and load testing for Claude API integrations. Key constraint: your rate limits (RPM/ITPM/OTPM) are the ceiling, not your infrastructure.
## Capacity Planning
```python
# Calculate required tier based on traffic
def plan_capacity(
requests_per_minute: int,
avg_input_tokens: int,
avg_output_tokens: int,
model: str = "claude-sonnet-4-20250514"
) -> dict:
itpm = requests_per_minute * avg_input_tokens
otpm = requests_per_minute * avg_output_tokens
# Estimate monthly cost
pricing = {
"claude-haiku-4-20250514": (0.80, 4.00),
"claude-sonnet-4-20250514": (3.00, 15.00),
"claude-opus-4-20250514": (15.00, 75.00),
}
rates = pricing[model]
cost_per_request = (avg_input_tokens * rates[0] + avg_output_tokens * rates[1]) / 1_000_000
monthly_cost = cost_per_request * requests_per_minute * 60 * 24 * 30
return {
"rpm_needed": requests_per_minute,
"itpm_needed": itpm,
"otpm_needed": otpm,
"cost_per_request": f"${cost_per_request:.4f}",
"monthly_estimate": f"${monthly_cost:,.0f}",
"recommendation": "Contact Anthropic sales for Scale tier" if requests_per_minute > 500 else "Self-serve tiers sufficient",
}
print(plan_capacity(100, 500, 200))
```
## Load Testing Script
```python
import anthropic
import asyncio
import time
from dataclasses import dataclass
@dataclass
class LoadTestResult:
total_requests: int = 0
successful: int = 0
failed: int = 0
rate_limited: int = 0
avg_latency_ms: float = 0
p99_latency_ms: float = 0
total_input_tokens: int = 0
total_output_tokens: int = 0
async def load_test(
concurrency: int = 10,
total_requests: int = 100,
model: str = "claude-haiku-4-20250514"
) -> LoadTestResult:
client = anthropic.Anthropic()
result = LoadTestResult()
latencies = []
semaphore = asyncio.Semaphore(concurrency)
async def single_request():
async with semaphore:
start = time.monotonic()
try:
msg = client.messages.create(
model=model,
max_tokens=64,
messages=[{"role": "user", "content": "Respond with exactly: OK"}]
)
duration = (time.monotonic() - start) * 1000
latencies.append(duration)
result.successful += 1
result.total_input_tokens += msg.usage.input_tokens
result.total_output_tokens += msg.usage.output_tokens
except anthropic.RateLimitError:
result.rate_limited += 1
except Exception:
result.failed += 1
result.total_requests += 1
tasks = [single_request() for _ in range(total_requests)]
await asyncio.gather(*tasks)
if latencies:
latencies.sort()
result.avg_latency_ms = sum(latencies) / len(latencies)
result.p99_latency_ms = latencies[int(len(latencies) * 0.99)]
return result
# Run: asyncio.run(load_test(concurrency=10, total_requests=50))
```
## Scaling Strategies
| Strategy | When | Implementation |
|----------|------|---------------|
| Queue-based processing | > 50 RPM sustained | Redis/SQS queue + worker pool |
| Model routing | Mixed workloads | Haiku for simple, Sonnet for complex |
| Message Batches | Offline processing | 100K requests, 50% cheaper, no RPM impact |
| Prompt caching | Repeated system prompts | 90% input token savings |
| Request coalescing | Duplicate prompts | Cache identical request hashes |
## Horizontal Scaling Pattern
```python
# Multiple application instances sharing the same API key
# Rate limits are per-organization, NOT per-instance
# Use a shared rate limiter (Redis) to coordinate
import redis
r = redis.Redis()
def check_rate_limit(key: str = "claude:rpm", limit: int = 100, window: int = 60) -> bool:
current = r.incr(key)
if current == 1:
r.expire(key, window)
return current <= limit
```
## Error Handling
| Issue | Cause | Fix |
|-------|-------|-----|
| 429 during load test | Exceeded tier limits | Reduce concurrency or upgrade tier |
| Increasing latency under load | Output queue saturation | Reduce max_tokens |
| Uneven request distribution | No load balancing | Use queue for fair distribution |
## Resources
- [Rate Limits](https://docs.anthropic.com/en/api/rate-limits)
- [Service Tiers](https://docs.anthropic.com/en/api/service-tiers)
## Next Steps
For reliability patterns, see `anth-reliability-patterns`.Related Skills
running-load-tests
Create and execute load tests for performance validation using k6, JMeter, and Artillery. Use when validating application performance under load conditions or identifying bottlenecks. Trigger with phrases like "run load test", "create stress test", or "validate performance under load".
load-testing-apis
Execute comprehensive load and stress testing to validate API performance and scalability. Use when validating API performance under load. Trigger with phrases like "load test the API", "stress test API", or "benchmark API performance".
load-test-scenario-planner
Load Test Scenario Planner - Auto-activating skill for Performance Testing. Triggers on: load test scenario planner, load test scenario planner Part of the Performance Testing skill category.
testing-load-balancers
This skill enables Claude to test load balancing strategies. It validates traffic distribution across backend servers, tests failover scenarios when servers become unavailable, verifies sticky sessions, and assesses health check functionality. Use this skill when the user asks to "test load balancer", "validate traffic distribution", "test failover", "verify sticky sessions", or "test health checks". It is specifically designed for testing load balancing configurations using the `load-balancer-tester` plugin.
configuring-load-balancers
This skill configures load balancers, including ALB, NLB, Nginx, and HAProxy. It generates production-ready configurations based on specified requirements and infrastructure. Use this skill when the user asks to "configure load balancer", "create load balancer config", "generate nginx config", "setup HAProxy", or mentions specific load balancer types like "ALB" or "NLB". It's ideal for DevOps tasks, infrastructure automation, and generating load balancer configurations for different environments.
lazy-loading-implementer
Lazy Loading Implementer - Auto-activating skill for Frontend Development. Triggers on: lazy loading implementer, lazy loading implementer Part of the Frontend Development skill category.
incremental-load-setup
Incremental Load Setup - Auto-activating skill for Data Pipelines. Triggers on: incremental load setup, incremental load setup Part of the Data Pipelines skill category.
exa-load-scale
Implement Exa load testing, capacity planning, and scaling strategies. Use when running performance tests, planning capacity for Exa integrations, or designing high-throughput search architectures. Trigger with phrases like "exa load test", "exa scale", "exa capacity", "exa k6", "exa benchmark", "exa throughput".
dataset-loader-creator
Dataset Loader Creator - Auto-activating skill for ML Training. Triggers on: dataset loader creator, dataset loader creator Part of the ML Training skill category.
customerio-load-scale
Implement Customer.io load testing and horizontal scaling. Use when preparing for high traffic, running load tests, or designing queue-based architectures for scale. Trigger: "customer.io load test", "customer.io scale", "customer.io high volume", "customer.io k6", "customer.io performance test".
clay-load-scale
Scale Clay enrichment pipelines for high-volume processing (10K-100K+ leads/month). Use when planning capacity for large enrichment runs, optimizing batch processing, or designing high-volume Clay architectures. Trigger with phrases like "clay scale", "clay high volume", "clay large batch", "clay capacity planning", "clay 100k leads", "clay bulk enrichment".
clade-load-scale
Scale Claude usage for high-throughput applications — batches, queues, Use when working with load-scale patterns. concurrency control, and tier upgrades. Trigger with "anthropic scale", "claude high volume", "anthropic throughput", "scale claude api", "anthropic concurrent requests".