anth-load-scale

Implement load testing, auto-scaling, and capacity planning for Claude API. Use when running performance benchmarks, planning for traffic spikes, or configuring horizontal scaling for Claude-powered services. Trigger with phrases like "anthropic load test", "claude scaling", "anthropic capacity planning", "scale claude api".

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

anth-load-scale is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using anth-load-scale should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/anth-load-scale/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/jeremylongshore/claude-code-plugins-plus-skills/anth-load-scale/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/anth-load-scale/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How anth-load-scale Compares

Feature / Agent	anth-load-scale	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Anthropic Load & Scale

## Overview

Capacity planning and load testing for Claude API integrations. Key constraint: your rate limits (RPM/ITPM/OTPM) are the ceiling, not your infrastructure.

## Capacity Planning

```python
# Calculate required tier based on traffic
def plan_capacity(
    requests_per_minute: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    model: str = "claude-sonnet-4-20250514"
) -> dict:
    itpm = requests_per_minute * avg_input_tokens
    otpm = requests_per_minute * avg_output_tokens

    # Estimate monthly cost
    pricing = {
        "claude-haiku-4-20250514": (0.80, 4.00),
        "claude-sonnet-4-20250514": (3.00, 15.00),
        "claude-opus-4-20250514": (15.00, 75.00),
    }
    rates = pricing[model]
    cost_per_request = (avg_input_tokens * rates[0] + avg_output_tokens * rates[1]) / 1_000_000
    monthly_cost = cost_per_request * requests_per_minute * 60 * 24 * 30

    return {
        "rpm_needed": requests_per_minute,
        "itpm_needed": itpm,
        "otpm_needed": otpm,
        "cost_per_request": f"${cost_per_request:.4f}",
        "monthly_estimate": f"${monthly_cost:,.0f}",
        "recommendation": "Contact Anthropic sales for Scale tier" if requests_per_minute > 500 else "Self-serve tiers sufficient",
    }

print(plan_capacity(100, 500, 200))
```

## Load Testing Script

```python
import anthropic
import asyncio
import time
from dataclasses import dataclass

@dataclass
class LoadTestResult:
    total_requests: int = 0
    successful: int = 0
    failed: int = 0
    rate_limited: int = 0
    avg_latency_ms: float = 0
    p99_latency_ms: float = 0
    total_input_tokens: int = 0
    total_output_tokens: int = 0

async def load_test(
    concurrency: int = 10,
    total_requests: int = 100,
    model: str = "claude-haiku-4-20250514"
) -> LoadTestResult:
    client = anthropic.Anthropic()
    result = LoadTestResult()
    latencies = []
    semaphore = asyncio.Semaphore(concurrency)

    async def single_request():
        async with semaphore:
            start = time.monotonic()
            try:
                msg = client.messages.create(
                    model=model,
                    max_tokens=64,
                    messages=[{"role": "user", "content": "Respond with exactly: OK"}]
                )
                duration = (time.monotonic() - start) * 1000
                latencies.append(duration)
                result.successful += 1
                result.total_input_tokens += msg.usage.input_tokens
                result.total_output_tokens += msg.usage.output_tokens
            except anthropic.RateLimitError:
                result.rate_limited += 1
            except Exception:
                result.failed += 1
            result.total_requests += 1

    tasks = [single_request() for _ in range(total_requests)]
    await asyncio.gather(*tasks)

    if latencies:
        latencies.sort()
        result.avg_latency_ms = sum(latencies) / len(latencies)
        result.p99_latency_ms = latencies[int(len(latencies) * 0.99)]

    return result

# Run: asyncio.run(load_test(concurrency=10, total_requests=50))
```

## Scaling Strategies

| Strategy | When | Implementation |
|----------|------|---------------|
| Queue-based processing | > 50 RPM sustained | Redis/SQS queue + worker pool |
| Model routing | Mixed workloads | Haiku for simple, Sonnet for complex |
| Message Batches | Offline processing | 100K requests, 50% cheaper, no RPM impact |
| Prompt caching | Repeated system prompts | 90% input token savings |
| Request coalescing | Duplicate prompts | Cache identical request hashes |

## Horizontal Scaling Pattern

```python
# Multiple application instances sharing the same API key
# Rate limits are per-organization, NOT per-instance
# Use a shared rate limiter (Redis) to coordinate

import redis

r = redis.Redis()

def check_rate_limit(key: str = "claude:rpm", limit: int = 100, window: int = 60) -> bool:
    current = r.incr(key)
    if current == 1:
        r.expire(key, window)
    return current <= limit
```

## Error Handling

| Issue | Cause | Fix |
|-------|-------|-----|
| 429 during load test | Exceeded tier limits | Reduce concurrency or upgrade tier |
| Increasing latency under load | Output queue saturation | Reduce max_tokens |
| Uneven request distribution | No load balancing | Use queue for fair distribution |

## Resources

- [Rate Limits](https://docs.anthropic.com/en/api/rate-limits)
- [Service Tiers](https://docs.anthropic.com/en/api/service-tiers)

## Next Steps

For reliability patterns, see `anth-reliability-patterns`.

Related Skills

running-load-tests

from ComeOnOliver/skillshub

Create and execute load tests for performance validation using k6, JMeter, and Artillery. Use when validating application performance under load conditions or identifying bottlenecks. Trigger with phrases like "run load test", "create stress test", or "validate performance under load".

load-testing-apis

from ComeOnOliver/skillshub

Execute comprehensive load and stress testing to validate API performance and scalability. Use when validating API performance under load. Trigger with phrases like "load test the API", "stress test API", or "benchmark API performance".

load-test-scenario-planner

from ComeOnOliver/skillshub

Load Test Scenario Planner - Auto-activating skill for Performance Testing. Triggers on: load test scenario planner, load test scenario planner Part of the Performance Testing skill category.

testing-load-balancers

from ComeOnOliver/skillshub

This skill enables Claude to test load balancing strategies. It validates traffic distribution across backend servers, tests failover scenarios when servers become unavailable, verifies sticky sessions, and assesses health check functionality. Use this skill when the user asks to "test load balancer", "validate traffic distribution", "test failover", "verify sticky sessions", or "test health checks". It is specifically designed for testing load balancing configurations using the `load-balancer-tester` plugin.

configuring-load-balancers

from ComeOnOliver/skillshub

This skill configures load balancers, including ALB, NLB, Nginx, and HAProxy. It generates production-ready configurations based on specified requirements and infrastructure. Use this skill when the user asks to "configure load balancer", "create load balancer config", "generate nginx config", "setup HAProxy", or mentions specific load balancer types like "ALB" or "NLB". It's ideal for DevOps tasks, infrastructure automation, and generating load balancer configurations for different environments.

lazy-loading-implementer

from ComeOnOliver/skillshub

Lazy Loading Implementer - Auto-activating skill for Frontend Development. Triggers on: lazy loading implementer, lazy loading implementer Part of the Frontend Development skill category.

incremental-load-setup

from ComeOnOliver/skillshub

Incremental Load Setup - Auto-activating skill for Data Pipelines. Triggers on: incremental load setup, incremental load setup Part of the Data Pipelines skill category.

exa-load-scale

from ComeOnOliver/skillshub

Implement Exa load testing, capacity planning, and scaling strategies. Use when running performance tests, planning capacity for Exa integrations, or designing high-throughput search architectures. Trigger with phrases like "exa load test", "exa scale", "exa capacity", "exa k6", "exa benchmark", "exa throughput".

dataset-loader-creator

from ComeOnOliver/skillshub

Dataset Loader Creator - Auto-activating skill for ML Training. Triggers on: dataset loader creator, dataset loader creator Part of the ML Training skill category.

customerio-load-scale

from ComeOnOliver/skillshub

Implement Customer.io load testing and horizontal scaling. Use when preparing for high traffic, running load tests, or designing queue-based architectures for scale. Trigger: "customer.io load test", "customer.io scale", "customer.io high volume", "customer.io k6", "customer.io performance test".

clay-load-scale

from ComeOnOliver/skillshub

Scale Clay enrichment pipelines for high-volume processing (10K-100K+ leads/month). Use when planning capacity for large enrichment runs, optimizing batch processing, or designing high-volume Clay architectures. Trigger with phrases like "clay scale", "clay high volume", "clay large batch", "clay capacity planning", "clay 100k leads", "clay bulk enrichment".

clade-load-scale

from ComeOnOliver/skillshub

Scale Claude usage for high-throughput applications — batches, queues, Use when working with load-scale patterns. concurrency control, and tier upgrades. Trigger with "anthropic scale", "claude high volume", "anthropic throughput", "scale claude api", "anthropic concurrent requests".