freellmapi-proxy

OpenAI-compatible proxy aggregating 14 free-tier LLM providers with automatic failover and per-key rate tracking.

22 stars

Best use case

freellmapi-proxy is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

OpenAI-compatible proxy aggregating 14 free-tier LLM providers with automatic failover and per-key rate tracking.

Teams using freellmapi-proxy should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/freellmapi-proxy/SKILL.md --create-dirs "https://raw.githubusercontent.com/Aradotso/trending-skills/main/skills/freellmapi-proxy/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/freellmapi-proxy/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How freellmapi-proxy Compares

Feature / Agentfreellmapi-proxyStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

OpenAI-compatible proxy aggregating 14 free-tier LLM providers with automatic failover and per-key rate tracking.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# FreeLLMAPI Proxy

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

FreeLLMAPI is a self-hosted OpenAI-compatible proxy that aggregates free-tier API keys from ~14 AI providers (Google, Groq, Cerebras, SambaNova, NVIDIA, Mistral, OpenRouter, GitHub Models, Hugging Face, Cohere, Cloudflare, Zhipu, Moonshot, MiniMax) behind a single `/v1/chat/completions` endpoint. It handles automatic failover on 429/5xx, per-key rate tracking, sticky sessions for multi-turn conversations, and AES-256-GCM encrypted key storage.

---

## Installation

**Prerequisites:** Node.js 20+, npm.

```bash
git clone https://github.com/tashfeenahmed/freellmapi.git
cd freellmapi
npm install

# Generate encryption key and set up environment
cp .env.example .env
echo "ENCRYPTION_KEY=$(node -e "console.log(require('crypto').randomBytes(32).toString('hex'))")" >> .env

# Development (server + Vite dashboard on :5173)
npm run dev

# Production build
npm run build
node server/dist/index.js   # serves API + dashboard on :3001
```

---

## Environment Variables

```bash
# .env
ENCRYPTION_KEY=<64-char hex string>   # Required — AES-256 key for provider key storage
PORT=3001                              # Optional — defaults to 3001
NODE_ENV=production                    # Optional
```

Never commit `.env`. The `ENCRYPTION_KEY` protects all stored provider API keys.

---

## Key Commands

```bash
npm run dev        # Start Express server + Vite dashboard in watch mode
npm run build      # Compile TypeScript server + build React dashboard
npm run lint       # ESLint across server/ and client/
npm run test       # Run test suite
```

---

## Provider Setup

1. Open the dashboard at `http://localhost:5173` (dev) or `http://localhost:3001` (prod).
2. Navigate to **Keys** page.
3. Add raw API keys for each provider you have. Keys are encrypted before SQLite storage.
4. Navigate to **Fallback Chain** to reorder provider priority.
5. Copy your unified `freellmapi-…` bearer token from the **Keys** page header.

**Supported providers and what to put in:**

| Provider | Where to get a free key |
|---|---|
| Google Gemini | https://ai.google.dev |
| Groq | https://groq.com |
| Cerebras | https://cerebras.ai |
| SambaNova | https://cloud.sambanova.ai |
| NVIDIA NIM | https://build.nvidia.com |
| Mistral | https://mistral.ai |
| OpenRouter | https://openrouter.ai |
| GitHub Models | https://github.com/marketplace/models |
| Hugging Face | https://huggingface.co |
| Cohere | https://cohere.com |
| Cloudflare Workers AI | https://developers.cloudflare.com/workers-ai |
| Zhipu | https://bigmodel.cn |
| Moonshot | https://platform.moonshot.cn |
| MiniMax | https://platform.minimax.io |

---

## Using the API

### Python (openai SDK)

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3001/v1",
    api_key="freellmapi-your-unified-key",  # from dashboard Keys page
)

# Let the router pick the best available provider
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Explain async/await in Python in two sentences."}],
)

print(response.choices[0].message.content)
# Which provider actually served this request:
print("Routed via:", response.headers.get("x-routed-via"))
```

### Request a specific model

```python
# Request a specific model — router finds a provider that has it
response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[{"role": "user", "content": "Write a haiku about SQLite."}],
)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "List 5 TypeScript best practices."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()
```

### curl

```bash
# Non-streaming
curl http://localhost:3001/v1/chat/completions \
  -H "Authorization: Bearer $FREELLMAPI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Streaming
curl http://localhost:3001/v1/chat/completions \
  -H "Authorization: Bearer $FREELLMAPI_KEY" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Count to 5 slowly"}],
    "stream": true
  }'

# List available models
curl http://localhost:3001/v1/models \
  -H "Authorization: Bearer $FREELLMAPI_KEY"
```

### TypeScript / Node.js

```typescript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:3001/v1",
  apiKey: process.env.FREELLMAPI_KEY,
});

async function chat(userMessage: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "auto",
    messages: [{ role: "user", content: userMessage }],
  });
  return response.choices[0].message.content ?? "";
}

// Streaming version
async function streamChat(userMessage: string): Promise<void> {
  const stream = await client.chat.completions.create({
    model: "auto",
    messages: [{ role: "user", content: userMessage }],
    stream: true,
  });

  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content;
    if (delta) process.stdout.write(delta);
  }
  console.log();
}
```

---

## Tool Calling

Tool calling works across all supported providers. OpenAI-compatible providers receive requests verbatim; Gemini requests are automatically translated to `functionDeclarations`/`functionResponse` format and back.

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3001/v1",
    api_key="freellmapi-your-unified-key",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                },
                "required": ["city"],
            },
        },
    }
]

# Step 1: Model requests a tool call
first = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's the weather in Karachi?"}],
    tools=tools,
    tool_choice="required",
)

call = first.choices[0].message.tool_calls[0]
print(f"Tool requested: {call.function.name}({call.function.arguments})")

# Step 2: Execute the tool locally, feed result back
final = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "user", "content": "What's the weather in Karachi?"},
        first.choices[0].message,  # assistant message with tool_calls
        {
            "role": "tool",
            "tool_call_id": call.id,
            "content": '{"temp_c": 32, "condition": "sunny"}',
        },
    ],
    tools=tools,
)

print(final.choices[0].message.content)
```

### Streaming tool calls

```python
stream = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's the weather in Karachi?"}],
    tools=tools,
    tool_choice="required",
    stream=True,
)

tool_call_chunks = []
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.tool_calls:
        tool_call_chunks.extend(delta.tool_calls)
    if chunk.choices[0].finish_reason == "tool_calls":
        print("Tool call complete — assemble chunks and execute")
```

---

## Multi-turn Conversations (Sticky Sessions)

The proxy keeps multi-turn conversations on the same model for 30 minutes to avoid hallucination spikes from mid-conversation model switches. Pass a consistent `session_id` in requests if the provider supports it, or rely on the proxy's automatic session tracking.

```python
messages = [{"role": "system", "content": "You are a helpful coding assistant."}]

# Turn 1
messages.append({"role": "user", "content": "Write a Python function to flatten a nested list."})
resp1 = client.chat.completions.create(model="auto", messages=messages)
assistant_msg = resp1.choices[0].message
messages.append({"role": "assistant", "content": assistant_msg.content})
print(assistant_msg.content)

# Turn 2 — sticky session keeps same provider
messages.append({"role": "user", "content": "Now add type hints to that function."})
resp2 = client.chat.completions.create(model="auto", messages=messages)
print(resp2.choices[0].message.content)
```

---

## LangChain Integration

```python
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
import os

llm = ChatOpenAI(
    model="auto",
    openai_api_base="http://localhost:3001/v1",
    openai_api_key=os.environ["FREELLMAPI_KEY"],
    streaming=True,
)

response = llm.invoke([HumanMessage(content="Summarise the CAP theorem in one paragraph.")])
print(response.content)
```

---

## Response Headers

Every response includes diagnostic headers:

| Header | Description |
|---|---|
| `X-Routed-Via` | `<platform>/<model>` — which provider served the request |
| `X-Fallback-Attempts` | Number of providers tried before success (only present if > 0) |

```python
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "hi"}],
)
# Headers are on the raw httpx response:
raw = response._response  # openai SDK exposes underlying httpx response
print(raw.headers.get("x-routed-via"))        # e.g. "groq/llama-4-scout"
print(raw.headers.get("x-fallback-attempts")) # e.g. "2"
```

---

## How the Router Works

```
Request arrives
      │
      ▼
Router scans fallback chain (priority order)
      │
      ├─ For each model: is there a healthy key under all rate caps?
      │     RPM / RPD / TPM / TPD tracked per (platform, model, key)
      │
      ├─ Picks first viable (platform, model, key) tuple
      │
      ├─ Decrypts key in-memory, calls provider SDK
      │
      └─ On 429 / 5xx / timeout:
            Put key on cooldown → retry next model (up to 20 attempts)
```

**Rate limit tracking:** The router tracks `RPM`, `RPD`, `TPM`, and `TPD` counters per `(platform, model, key)` triple. When a key hits a cap it's cooled down automatically and the next viable key/model is tried.

**Health checks:** Background probes classify each key as `healthy`, `rate_limited`, `invalid`, or `error`. The router skips non-healthy keys without making a live request.

---

## Dashboard Pages

| Page | Purpose |
|---|---|
| **Keys** | Add/remove provider credentials, view health status, copy unified API key |
| **Fallback Chain** | Drag to reorder provider priority |
| **Playground** | Interactive chat showing which provider served each message + latency |
| **Analytics** | Request volume, success rate, token counts, latency, per-provider breakdown (24h/7d/30d) |

---

## Production Deployment (Raspberry Pi / Linux)

```bash
# Build
npm run build

# Install PM2
npm install -g pm2

# Start
pm2 start server/dist/index.js --name freellmapi
pm2 save
pm2 startup

# nginx reverse proxy (optional)
# /etc/nginx/sites-available/freellmapi
server {
    listen 80;
    server_name your.domain.com;
    location / {
        proxy_pass http://localhost:3001;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_buffering off;          # Required for SSE streaming
        proxy_cache_control no-cache; # Required for SSE streaming
    }
}
```

Memory footprint: ~40 MB RSS at idle on a Pi 4.

---

## Adding a New Provider

Create a new adapter in `server/src/providers/`:

```typescript
// server/src/providers/myprovider.ts
import type { ProviderAdapter, ChatRequest, ChatResponse } from "../types";

export const myProviderAdapter: ProviderAdapter = {
  name: "myprovider",
  models: ["my-model-v1", "my-model-v2"],

  async chat(request: ChatRequest, apiKey: string): Promise<ChatResponse> {
    // Call provider API, return OpenAI-shaped response
    const res = await fetch("https://api.myprovider.com/v1/chat", {
      method: "POST",
      headers: {
        Authorization: `Bearer ${apiKey}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model: request.model,
        messages: request.messages,
      }),
    });
    const data = await res.json();
    return {
      id: data.id,
      object: "chat.completion",
      choices: [{ message: data.choices[0].message, finish_reason: "stop", index: 0 }],
      usage: data.usage,
    };
  },

  async *stream(request: ChatRequest, apiKey: string): AsyncGenerator<string> {
    // Yield SSE chunks
  },
};
```

Register in `server/src/providers/index.ts` and add rate limit caps to the router config.

---

## Troubleshooting

**"No healthy keys available"**
- Check the Keys dashboard — all keys may be rate-limited or invalid.
- Wait for cooldown (usually a few minutes for RPM limits) or add more keys.
- Verify the key is valid by testing it directly against the provider's API.

**Requests always fall back to the same provider**
- Check the Fallback Chain order in the dashboard.
- Ensure keys for higher-priority providers are marked `healthy`.

**Streaming stops mid-response**
- If behind nginx, ensure `proxy_buffering off` is set.
- Check provider-side token/minute caps — the stream may be cut by a mid-stream rate limit.

**`ENCRYPTION_KEY` error on startup**
- Ensure `ENCRYPTION_KEY` in `.env` is exactly 64 hex characters (32 bytes).
- Regenerate: `node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"`

**Tool calls not working with a specific provider**
- Not all free-tier models support function calling. Check the provider's docs.
- Try `model="auto"` — the router will pick a tool-capable model.
- Gemini tool calls are auto-translated; others pass through as-is.

**High latency on first request**
- Health checks run periodically in the background. The first request after startup may probe a few keys. Subsequent requests are faster.

---

## Limitations

- Text-only — no vision/multimodal inputs
- No embeddings (`/v1/embeddings`)
- No image generation (`/v1/images/*`)
- No audio/speech (`/v1/audio/*`)
- No legacy completions (`/v1/completions`)
- No moderation (`/v1/moderations`)
- `n > 1` not supported (single completion per request)
- Single-user by design — no per-user billing or multi-tenant auth
- Personal/experimental use only — review each provider's ToS before production use

Related Skills

tg-ws-proxy-telegram-socks5

22
from Aradotso/trending-skills

Local SOCKS5 proxy server that accelerates Telegram Desktop by routing traffic through WebSocket connections to Telegram DCs

tavily-key-generator-proxy

22
from Aradotso/trending-skills

Auto batch-register Tavily API keys via browser automation and pool them behind a unified proxy gateway with web console

masterhttprelayvpn-proxy

22
from Aradotso/trending-skills

Domain-fronted HTTP/SOCKS5 proxy tunneling traffic through Google Apps Script with MITM TLS interception and DPI evasion

freebuff2api-openai-proxy

22
from Aradotso/trending-skills

OpenAI-compatible proxy server for Freebuff that translates standard OpenAI API requests into Freebuff's backend format with multi-token rotation and Docker deployment.

deepclaude-proxy

22
from Aradotso/trending-skills

Use Claude Code's autonomous agent loop with DeepSeek V4 Pro, OpenRouter, or any Anthropic-compatible backend at up to 17x lower cost.

crabtrap-llm-proxy

22
from Aradotso/trending-skills

LLM-as-a-judge HTTP/HTTPS proxy that secures AI agents by intercepting and evaluating outbound requests against security policies before they reach external APIs.

cc-gateway-ai-proxy

22
from Aradotso/trending-skills

Deploy and configure CC Gateway, a reverse proxy that normalizes Claude Code device fingerprints and telemetry for privacy-preserving API proxying

```markdown

22
from Aradotso/trending-skills

---

zeroboot-vm-sandbox

22
from Aradotso/trending-skills

Sub-millisecond VM sandboxes for AI agents using copy-on-write KVM forking via Zeroboot

yourvpndead-vpn-detection

22
from Aradotso/trending-skills

Android app that detects VPN/proxy servers (VLESS/xray/sing-box) via local SOCKS5 vulnerability, exposing exit IPs and server configs without root

xata-postgres-platform

22
from Aradotso/trending-skills

Expert skill for Xata open-source cloud-native Postgres platform with copy-on-write branching, scale-to-zero, and Kubernetes deployment

x-mentor-skill-nuwa

22
from Aradotso/trending-skills

AI-powered X (Twitter) content strategy skill that distills methodologies from 6 top creators + open-source algorithm data into actionable writing, growth, and monetization guidance.