cloudflare-workers-ai
Run LLMs and AI models on Cloudflare's global GPU network with Workers AI. Includes Llama, Flux image generation, BGE embeddings, and streaming support with AI Gateway for caching and logging. Use when: implementing LLM inference, generating images with Flux/Stable Diffusion, building RAG with embeddings, streaming AI responses, using AI Gateway for cost tracking, or troubleshooting AI_ERROR, rate limits, model not found, token limits, or neurons exceeded. Keywords: workers ai, cloudflare ai, ai bindings, llm workers, @cf/meta/llama, workers ai models, ai inference, cloudflare llm, ai streaming, text generation ai, ai embeddings, image generation ai, workers ai rag, ai gateway, llama workers, flux image generation, stable diffusion workers, vision models ai, ai chat completion, AI_ERROR, rate limit ai, model not found, token limit exceeded, neurons exceeded, ai quota exceeded, streaming failed, model unavailable, workers ai hono, ai gateway workers, vercel ai sdk workers, openai compatible workers, workers ai vectorize
Best use case
cloudflare-workers-ai is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Run LLMs and AI models on Cloudflare's global GPU network with Workers AI. Includes Llama, Flux image generation, BGE embeddings, and streaming support with AI Gateway for caching and logging. Use when: implementing LLM inference, generating images with Flux/Stable Diffusion, building RAG with embeddings, streaming AI responses, using AI Gateway for cost tracking, or troubleshooting AI_ERROR, rate limits, model not found, token limits, or neurons exceeded. Keywords: workers ai, cloudflare ai, ai bindings, llm workers, @cf/meta/llama, workers ai models, ai inference, cloudflare llm, ai streaming, text generation ai, ai embeddings, image generation ai, workers ai rag, ai gateway, llama workers, flux image generation, stable diffusion workers, vision models ai, ai chat completion, AI_ERROR, rate limit ai, model not found, token limit exceeded, neurons exceeded, ai quota exceeded, streaming failed, model unavailable, workers ai hono, ai gateway workers, vercel ai sdk workers, openai compatible workers, workers ai vectorize
Teams using cloudflare-workers-ai should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/cloudflare-workers-ai/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How cloudflare-workers-ai Compares
| Feature / Agent | cloudflare-workers-ai | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Run LLMs and AI models on Cloudflare's global GPU network with Workers AI. Includes Llama, Flux image generation, BGE embeddings, and streaming support with AI Gateway for caching and logging. Use when: implementing LLM inference, generating images with Flux/Stable Diffusion, building RAG with embeddings, streaming AI responses, using AI Gateway for cost tracking, or troubleshooting AI_ERROR, rate limits, model not found, token limits, or neurons exceeded. Keywords: workers ai, cloudflare ai, ai bindings, llm workers, @cf/meta/llama, workers ai models, ai inference, cloudflare llm, ai streaming, text generation ai, ai embeddings, image generation ai, workers ai rag, ai gateway, llama workers, flux image generation, stable diffusion workers, vision models ai, ai chat completion, AI_ERROR, rate limit ai, model not found, token limit exceeded, neurons exceeded, ai quota exceeded, streaming failed, model unavailable, workers ai hono, ai gateway workers, vercel ai sdk workers, openai compatible workers, workers ai vectorize
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Cloudflare Workers AI - Complete Reference
Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.
**Status**: Production Ready ✅
**Last Updated**: 2025-10-21
**Dependencies**: cloudflare-worker-base (for Worker setup)
**Latest Versions**: wrangler@4.43.0, @cloudflare/workers-types@4.20251014.0
---
## Table of Contents
1. [Quick Start (5 minutes)](#quick-start-5-minutes)
2. [Workers AI API Reference](#workers-ai-api-reference)
3. [Model Selection Guide](#model-selection-guide)
4. [Common Patterns](#common-patterns)
5. [AI Gateway Integration](#ai-gateway-integration)
6. [Rate Limits & Pricing](#rate-limits--pricing)
7. [Production Checklist](#production-checklist)
---
## Quick Start (5 minutes)
### 1. Add AI Binding
**wrangler.jsonc:**
```jsonc
{
"ai": {
"binding": "AI"
}
}
```
### 2. Run Your First Model
```typescript
export interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
prompt: 'What is Cloudflare?',
});
return Response.json(response);
},
};
```
### 3. Add Streaming (Recommended)
```typescript
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true, // Always use streaming for text generation!
});
return new Response(stream, {
headers: { 'content-type': 'text/event-stream' },
});
```
**Why streaming?**
- Prevents buffering large responses in memory
- Faster time-to-first-token
- Better user experience for long-form content
- Avoids Worker timeout issues
---
## Workers AI API Reference
### `env.AI.run()`
Run an AI model inference.
**Signature:**
```typescript
async env.AI.run(
model: string,
inputs: ModelInputs,
options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>
```
**Parameters:**
- `model` (string, required) - Model ID (e.g., `@cf/meta/llama-3.1-8b-instruct`)
- `inputs` (object, required) - Model-specific inputs
- `options` (object, optional) - Additional options
- `gateway` (object) - AI Gateway configuration
- `id` (string) - Gateway ID
- `skipCache` (boolean) - Skip AI Gateway cache
**Returns:**
- Non-streaming: `Promise<ModelOutput>` - JSON response
- Streaming: `ReadableStream` - Server-sent events stream
---
### Text Generation Models
**Input Format:**
```typescript
{
messages?: Array<{ role: 'system' | 'user' | 'assistant'; content: string }>;
prompt?: string; // Deprecated, use messages
stream?: boolean; // Default: false
max_tokens?: number; // Max tokens to generate
temperature?: number; // 0.0-1.0, default varies by model
top_p?: number; // 0.0-1.0
top_k?: number;
}
```
**Output Format (Non-Streaming):**
```typescript
{
response: string; // Generated text
}
```
**Example:**
```typescript
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is TypeScript?' },
],
stream: false,
});
console.log(response.response);
```
---
### Text Embeddings Models
**Input Format:**
```typescript
{
text: string | string[]; // Single text or array of texts
}
```
**Output Format:**
```typescript
{
shape: number[]; // [batch_size, embedding_dimensions]
data: number[][]; // Array of embedding vectors
}
```
**Example:**
```typescript
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: ['Hello world', 'Cloudflare Workers'],
});
console.log(embeddings.shape); // [2, 768]
console.log(embeddings.data[0]); // [0.123, -0.456, ...]
```
---
### Image Generation Models
**Input Format:**
```typescript
{
prompt: string; // Text description
num_steps?: number; // Default: 20
guidance?: number; // CFG scale, default: 7.5
strength?: number; // For img2img, default: 1.0
image?: number[][]; // For img2img (base64 or array)
}
```
**Output Format:**
- Binary image data (PNG/JPEG)
**Example:**
```typescript
const imageStream = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'A beautiful sunset over mountains',
});
return new Response(imageStream, {
headers: { 'content-type': 'image/png' },
});
```
---
### Vision Models
**Input Format:**
```typescript
{
messages: Array<{
role: 'user' | 'assistant';
content: Array<{ type: 'text' | 'image_url'; text?: string; image_url?: { url: string } }>;
}>;
}
```
**Example:**
```typescript
const response = await env.AI.run('@cf/meta/llama-3.2-11b-vision-instruct', {
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'What is in this image?' },
{ type: 'image_url', image_url: { url: 'data:image/png;base64,iVBOR...' } },
],
},
],
});
```
---
## Model Selection Guide
### Text Generation (LLMs)
| Model | Best For | Rate Limit | Size |
|-------|----------|------------|------|
| `@cf/meta/llama-3.1-8b-instruct` | General purpose, fast | 300/min | 8B |
| `@cf/meta/llama-3.2-1b-instruct` | Ultra-fast, simple tasks | 300/min | 1B |
| `@cf/qwen/qwen1.5-14b-chat-awq` | High quality, complex reasoning | 150/min | 14B |
| `@cf/deepseek-ai/deepseek-r1-distill-qwen-32b` | Coding, technical content | 300/min | 32B |
| `@hf/thebloke/mistral-7b-instruct-v0.1-awq` | Fast, efficient | 400/min | 7B |
### Text Embeddings
| Model | Dimensions | Best For | Rate Limit |
|-------|-----------|----------|------------|
| `@cf/baai/bge-base-en-v1.5` | 768 | General purpose RAG | 3000/min |
| `@cf/baai/bge-large-en-v1.5` | 1024 | High accuracy search | 1500/min |
| `@cf/baai/bge-small-en-v1.5` | 384 | Fast, low storage | 3000/min |
### Image Generation
| Model | Best For | Rate Limit | Speed |
|-------|----------|------------|-------|
| `@cf/black-forest-labs/flux-1-schnell` | High quality, photorealistic | 720/min | Fast |
| `@cf/stabilityai/stable-diffusion-xl-base-1.0` | General purpose | 720/min | Medium |
| `@cf/lykon/dreamshaper-8-lcm` | Artistic, stylized | 720/min | Fast |
### Vision Models
| Model | Best For | Rate Limit |
|-------|----------|------------|
| `@cf/meta/llama-3.2-11b-vision-instruct` | Image understanding | 720/min |
| `@cf/unum/uform-gen2-qwen-500m` | Fast image captioning | 720/min |
---
## Common Patterns
### Pattern 1: Chat Completion with History
```typescript
app.post('/chat', async (c) => {
const { messages } = await c.req.json<{
messages: Array<{ role: string; content: string }>;
}>();
const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages,
stream: true,
});
return new Response(response, {
headers: { 'content-type': 'text/event-stream' },
});
});
```
---
### Pattern 2: RAG (Retrieval Augmented Generation)
```typescript
// Step 1: Generate embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: [userQuery],
});
const vector = embeddings.data[0];
// Step 2: Search Vectorize
const matches = await env.VECTORIZE.query(vector, { topK: 3 });
// Step 3: Build context from matches
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');
// Step 4: Generate response with context
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{
role: 'system',
content: `Answer using this context:\n${context}`,
},
{ role: 'user', content: userQuery },
],
stream: true,
});
return new Response(response, {
headers: { 'content-type': 'text/event-stream' },
});
```
---
### Pattern 3: Structured Output with Zod
```typescript
import { z } from 'zod';
const RecipeSchema = z.object({
name: z.string(),
ingredients: z.array(z.string()),
instructions: z.array(z.string()),
prepTime: z.number(),
});
app.post('/recipe', async (c) => {
const { dish } = await c.req.json<{ dish: string }>();
const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{
role: 'user',
content: `Generate a recipe for ${dish}. Return ONLY valid JSON matching this schema: ${JSON.stringify(RecipeSchema.shape)}`,
},
],
});
// Parse and validate
const recipe = RecipeSchema.parse(JSON.parse(response.response));
return c.json(recipe);
});
```
---
### Pattern 4: Image Generation + R2 Storage
```typescript
app.post('/generate-image', async (c) => {
const { prompt } = await c.req.json<{ prompt: string }>();
// Generate image
const imageStream = await c.env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt,
});
const imageBytes = await new Response(imageStream).bytes();
// Store in R2
const key = `images/${Date.now()}.png`;
await c.env.BUCKET.put(key, imageBytes, {
httpMetadata: { contentType: 'image/png' },
});
return c.json({
success: true,
url: `https://your-domain.com/${key}`,
});
});
```
---
## AI Gateway Integration
AI Gateway provides caching, logging, and analytics for AI requests.
**Setup:**
```typescript
const response = await env.AI.run(
'@cf/meta/llama-3.1-8b-instruct',
{ prompt: 'Hello' },
{
gateway: {
id: 'my-gateway', // Your gateway ID
skipCache: false, // Use cache
},
}
);
```
**Benefits:**
- ✅ **Cost Tracking** - Monitor neurons usage per request
- ✅ **Caching** - Reduce duplicate inference costs
- ✅ **Logging** - Debug and analyze AI requests
- ✅ **Rate Limiting** - Additional layer of protection
- ✅ **Analytics** - Request patterns and performance
**Access Gateway Logs:**
```typescript
const gateway = env.AI.gateway('my-gateway');
const logId = env.AI.aiGatewayLogId;
// Send feedback
await gateway.patchLog(logId, {
feedback: { rating: 1, comment: 'Great response' },
});
```
---
## Rate Limits & Pricing
### Rate Limits (per minute)
| Task Type | Default Limit | Notes |
|-----------|---------------|-------|
| **Text Generation** | 300/min | Some fast models: 400-1500/min |
| **Text Embeddings** | 3000/min | BGE-large: 1500/min |
| **Image Generation** | 720/min | All image models |
| **Vision Models** | 720/min | Image understanding |
| **Translation** | 720/min | M2M100, Opus MT |
| **Classification** | 2000/min | Text classification |
| **Speech Recognition** | 720/min | Whisper models |
### Pricing (Neurons-Based)
**Free Tier:**
- 10,000 neurons per day
- Resets daily at 00:00 UTC
**Paid Tier:**
- $0.011 per 1,000 neurons
- 10,000 neurons/day included
- Unlimited usage above free allocation
**Example Costs:**
| Model | Input (1M tokens) | Output (1M tokens) |
|-------|-------------------|-------------------|
| Llama 3.2 1B | $0.027 | $0.201 |
| Llama 3.1 8B | $0.088 | $0.606 |
| BGE-base embeddings | $0.005 | N/A |
| Flux image generation | ~$0.011/image | N/A |
---
## Production Checklist
### Before Deploying
- [ ] **Enable AI Gateway** for cost tracking and logging
- [ ] **Implement streaming** for all text generation endpoints
- [ ] **Add rate limit retry** with exponential backoff
- [ ] **Validate input length** to prevent token limit errors
- [ ] **Set appropriate timeouts** (Workers: 30s CPU default, 5m max)
- [ ] **Monitor neurons usage** in Cloudflare dashboard
- [ ] **Test error handling** for model unavailable, rate limits
- [ ] **Add input sanitization** to prevent prompt injection
- [ ] **Configure CORS** if using from browser
- [ ] **Plan for scale** - upgrade to Paid plan if needed
### Error Handling
```typescript
async function runAIWithRetry(
env: Env,
model: string,
inputs: any,
maxRetries = 3
): Promise<any> {
let lastError: Error;
for (let i = 0; i < maxRetries; i++) {
try {
return await env.AI.run(model, inputs);
} catch (error) {
lastError = error as Error;
const message = lastError.message.toLowerCase();
// Rate limit - retry with backoff
if (message.includes('429') || message.includes('rate limit')) {
const delay = Math.pow(2, i) * 1000; // Exponential backoff
await new Promise((resolve) => setTimeout(resolve, delay));
continue;
}
// Other errors - throw immediately
throw error;
}
}
throw lastError!;
}
```
### Monitoring
```typescript
app.use('*', async (c, next) => {
const start = Date.now();
await next();
// Log AI usage
console.log({
path: c.req.path,
duration: Date.now() - start,
logId: c.env.AI.aiGatewayLogId,
});
});
```
---
## OpenAI Compatibility
Workers AI supports OpenAI-compatible endpoints.
**Using OpenAI SDK:**
```typescript
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: env.CLOUDFLARE_API_KEY,
baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`,
});
// Chat completions
const completion = await openai.chat.completions.create({
model: '@cf/meta/llama-3.1-8b-instruct',
messages: [{ role: 'user', content: 'Hello!' }],
});
// Embeddings
const embeddings = await openai.embeddings.create({
model: '@cf/baai/bge-base-en-v1.5',
input: 'Hello world',
});
```
**Endpoints:**
- `/v1/chat/completions` - Text generation
- `/v1/embeddings` - Text embeddings
---
## Vercel AI SDK Integration
```bash
npm install workers-ai-provider ai
```
```typescript
import { createWorkersAI } from 'workers-ai-provider';
import { generateText, streamText } from 'ai';
const workersai = createWorkersAI({ binding: env.AI });
// Generate text
const result = await generateText({
model: workersai('@cf/meta/llama-3.1-8b-instruct'),
prompt: 'Write a poem',
});
// Stream text
const stream = streamText({
model: workersai('@cf/meta/llama-3.1-8b-instruct'),
prompt: 'Tell me a story',
});
```
---
## Limits Summary
| Feature | Limit |
|---------|-------|
| Concurrent requests | No hard limit (rate limits apply) |
| Max input tokens | Varies by model (typically 2K-128K) |
| Max output tokens | Varies by model (typically 512-2048) |
| Streaming chunk size | ~1 KB |
| Image size (output) | ~5 MB |
| Request timeout | Workers timeout applies (30s default, 5m max CPU) |
| Daily free neurons | 10,000 |
| Rate limits | See "Rate Limits & Pricing" section |
---
## References
- [Workers AI Docs](https://developers.cloudflare.com/workers-ai/)
- [Models Catalog](https://developers.cloudflare.com/workers-ai/models/)
- [AI Gateway](https://developers.cloudflare.com/ai-gateway/)
- [Pricing](https://developers.cloudflare.com/workers-ai/platform/pricing/)
- [Limits](https://developers.cloudflare.com/workers-ai/platform/limits/)
- [REST API](https://developers.cloudflare.com/workers-ai/get-started/rest-api/)Related Skills
cloudflare-workflows
Build durable, long-running workflows on Cloudflare Workers with automatic retries, state persistence, and multi-step orchestration. Supports step.do, step.sleep, step.waitForEvent, and runs for hours to days. Use when: creating long-running workflows, implementing retry logic, building event-driven processes, coordinating API calls, scheduling multi-step tasks, or troubleshooting NonRetryableError, I/O context, serialization errors, or workflow execution failures. Keywords: cloudflare workflows, workflows workers, durable execution, workflow step, WorkflowEntrypoint, step.do, step.sleep, workflow retries, NonRetryableError, workflow state, wrangler workflows, workflow events, long-running tasks, step.sleepUntil, step.waitForEvent, workflow bindings
cloudflare-worker-base
Set up Cloudflare Workers with Hono routing, Vite plugin, and Static Assets using production-tested patterns. Prevents 6 errors: export syntax, routing conflicts, HMR crashes, and Service Worker format confusion. Use when: creating Workers projects, configuring Hono or Vite for Workers, deploying with Wrangler, adding Static Assets with SPA fallback, or troubleshooting export syntax, API route conflicts, scheduled handlers, or HMR race conditions. Keywords: Cloudflare Workers, CF Workers, Hono, wrangler, Vite, Static Assets, @cloudflare/vite-plugin, wrangler.jsonc, ES Module, run_worker_first, SPA fallback, API routes, serverless, edge computing, "Cannot read properties of undefined", "Static Assets 404", "A hanging Promise was canceled", "Handler does not export", deployment fails, routing not working, HMR crashes
cloudflare-vectorize
Build semantic search with Cloudflare Vectorize V2 (Sept 2024 GA). Covers V2 breaking changes: async mutations, 5M vectors/index (was 200K), 31ms latency (was 549ms), returnMetadata enum, and V1 deprecation (Dec 2024). Use when: migrating V1→V2, handling async mutations with mutationId, creating metadata indexes before insert, or troubleshooting "returnMetadata must be 'all'", V2 timing issues, metadata index errors, dimension mismatches.
cloudflare-turnstile
Add bot protection with Turnstile (CAPTCHA alternative). Use when: protecting forms, securing login/signup, preventing spam, migrating from reCAPTCHA, integrating with React/Next.js/Hono, implementing E2E tests, or debugging CSP errors, token validation failures, or error codes 100*/300*/600*.
cloudflare-r2
Store objects with R2's S3-compatible storage on Cloudflare's edge. Use when: uploading/downloading files, configuring CORS, generating presigned URLs, multipart uploads, managing metadata, or troubleshooting R2_ERROR, CORS failures, presigned URL issues, or quota errors.
cloudflare-queues
Build async message queues with Cloudflare Queues for background processing. Use when: handling async tasks, batch processing, implementing retries, configuring dead letter queues, managing consumer concurrency, or troubleshooting queue timeout, batch retry, message loss, or throughput exceeded.
cloudflare-mcp-server
Build Model Context Protocol (MCP) servers on Cloudflare Workers - the only platform with official remote MCP support. TypeScript-based with OAuth, Durable Objects, and WebSocket hibernation. Use when: deploying remote MCP servers, implementing OAuth (GitHub/Google), using dual transports (SSE/HTTP), or troubleshooting URL path mismatches, McpAgent exports, OAuth redirects, CORS issues.
cloudflare-kv
Store key-value data globally with Cloudflare KV's edge network. Use when: caching API responses, storing configuration, managing user preferences, handling TTL expiration, or troubleshooting KV_ERROR, 429 rate limits, eventual consistency, or cacheTtl errors.
cloudflare-images
Store and transform images with Cloudflare Images API and transformations. Use when: uploading images, implementing direct creator uploads, creating variants, generating signed URLs, optimizing formats (WebP/AVIF), transforming via Workers, or debugging CORS, multipart, or error codes 9401-9413.
cloudflare-hyperdrive
Connect Workers to PostgreSQL/MySQL with Hyperdrive's global pooling and caching. Use when: connecting to existing databases, setting up connection pools, using node-postgres/mysql2, integrating Drizzle/Prisma, or troubleshooting pool acquisition failures, TLS errors, nodejs_compat missing, or eval() disallowed.
cloudflare-durable-objects
Build stateful Durable Objects for real-time apps, WebSocket servers, coordination, and persistent state. Use when: implementing chat rooms, multiplayer games, rate limiting, session management, WebSocket hibernation, or troubleshooting class export, migration, WebSocket state loss, or binding errors.
cloudflare-d1
Build with D1 serverless SQLite database on Cloudflare's edge. Use when: creating databases, writing SQL migrations, querying D1 from Workers, handling relational data, or troubleshooting D1_ERROR, statement too long, migration failures, or query performance issues.