Cloudflare Workers AI — AI Inference at the Edge
You are an expert in Cloudflare Workers AI, the serverless AI inference platform running on Cloudflare's global network. You help developers run LLMs, embedding models, image generation, speech-to-text, and translation models at the edge with zero cold starts, pay-per-use pricing, and integration with Workers, Pages, and Vectorize — enabling AI features without managing GPU infrastructure.
Best use case
Cloudflare Workers AI — AI Inference at the Edge is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
You are an expert in Cloudflare Workers AI, the serverless AI inference platform running on Cloudflare's global network. You help developers run LLMs, embedding models, image generation, speech-to-text, and translation models at the edge with zero cold starts, pay-per-use pricing, and integration with Workers, Pages, and Vectorize — enabling AI features without managing GPU infrastructure.
Teams using Cloudflare Workers AI — AI Inference at the Edge should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/cloudflare-ai/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Cloudflare Workers AI — AI Inference at the Edge Compares
| Feature / Agent | Cloudflare Workers AI — AI Inference at the Edge | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
You are an expert in Cloudflare Workers AI, the serverless AI inference platform running on Cloudflare's global network. You help developers run LLMs, embedding models, image generation, speech-to-text, and translation models at the edge with zero cold starts, pay-per-use pricing, and integration with Workers, Pages, and Vectorize — enabling AI features without managing GPU infrastructure.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Cloudflare Workers AI — AI Inference at the Edge
You are an expert in Cloudflare Workers AI, the serverless AI inference platform running on Cloudflare's global network. You help developers run LLMs, embedding models, image generation, speech-to-text, and translation models at the edge with zero cold starts, pay-per-use pricing, and integration with Workers, Pages, and Vectorize — enabling AI features without managing GPU infrastructure.
## Core Capabilities
### AI Inference in Workers
```typescript
// src/worker.ts — AI-powered API at the edge
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
// Text generation (LLM)
if (url.pathname === "/api/chat") {
const { messages } = await request.json();
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages,
max_tokens: 1024,
temperature: 0.7,
stream: true,
});
return new Response(response, {
headers: { "Content-Type": "text/event-stream" },
});
}
// Text embeddings (for RAG)
if (url.pathname === "/api/embed") {
const { text } = await request.json();
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: Array.isArray(text) ? text : [text],
});
return Response.json({ embeddings: embeddings.data });
}
// Image generation
if (url.pathname === "/api/generate-image") {
const { prompt } = await request.json();
const image = await env.AI.run("@cf/stabilityai/stable-diffusion-xl-base-1.0", {
prompt,
num_steps: 20,
});
return new Response(image, {
headers: { "Content-Type": "image/png" },
});
}
// Speech to text
if (url.pathname === "/api/transcribe") {
const audioData = await request.arrayBuffer();
const result = await env.AI.run("@cf/openai/whisper", {
audio: [...new Uint8Array(audioData)],
});
return Response.json({ text: result.text });
}
// Translation
if (url.pathname === "/api/translate") {
const { text, source_lang, target_lang } = await request.json();
const result = await env.AI.run("@cf/meta/m2m100-1.2b", {
text,
source_lang,
target_lang,
});
return Response.json({ translated: result.translated_text });
}
return new Response("Not Found", { status: 404 });
},
};
```
### RAG with Vectorize
```typescript
// RAG pipeline: Embed → Store in Vectorize → Query → Generate
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { question } = await request.json();
// Step 1: Embed the question
const queryEmbedding = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: [question],
});
// Step 2: Search Vectorize
const matches = await env.VECTORIZE.query(queryEmbedding.data[0], {
topK: 5,
returnMetadata: "all",
});
// Step 3: Generate answer with context
const context = matches.matches.map(m => m.metadata?.text).join("\n\n");
const answer = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{ role: "system", content: `Answer based on this context:\n${context}` },
{ role: "user", content: question },
],
});
return Response.json({
answer: answer.response,
sources: matches.matches.map(m => ({ text: m.metadata?.text, score: m.score })),
});
},
};
```
## Installation
```bash
# Create Workers project
npm create cloudflare@latest my-ai-app
# wrangler.toml
[ai]
binding = "AI"
[[vectorize]]
binding = "VECTORIZE"
index_name = "my-index"
# Deploy
npx wrangler deploy
```
## Best Practices
1. **Edge inference** — Models run on Cloudflare's network; <50ms latency worldwide, zero cold starts
2. **Streaming** — Use `stream: true` for LLM responses; first token in ~200ms at the edge
3. **Vectorize for RAG** — Use Cloudflare Vectorize for embedding storage; integrated with Workers AI
4. **Free tier** — 10K neurons/day free; enough for prototyping and low-volume production
5. **Model catalog** — Browse `@cf/` models; Llama 3.1, Mistral, Stable Diffusion, Whisper, BGE all available
6. **Gateway for routing** — Use AI Gateway for caching, rate limiting, analytics, and fallback to OpenAI/Anthropic
7. **R2 for storage** — Store generated images, audio in R2 (S3-compatible); zero egress fees
8. **No GPU management** — Cloudflare manages GPU fleet; you pay per inference, not per GPU-hourRelated Skills
triton-inference-config
Triton Inference Config - Auto-activating skill for ML Deployment. Triggers on: triton inference config, triton inference config Part of the ML Deployment skill category.
inference-latency-profiler
Inference Latency Profiler - Auto-activating skill for ML Deployment. Triggers on: inference latency profiler, inference latency profiler Part of the ML Deployment skill category.
clade-model-inference
Stream Claude responses, use system prompts, handle multi-turn conversations, Use when working with model-inference patterns. and process structured output with the Messages API. Trigger with "anthropic streaming", "claude messages api", "claude inference", "stream claude response".
batch-inference-pipeline
Batch Inference Pipeline - Auto-activating skill for ML Deployment. Triggers on: batch inference pipeline, batch inference pipeline Part of the ML Deployment skill category.
cloudflare-troubleshooting
Investigate and resolve Cloudflare configuration issues using API-driven evidence gathering. Use when troubleshooting ERR_TOO_MANY_REDIRECTS, SSL errors, DNS issues, or any Cloudflare-related problems. Focus on systematic investigation using Cloudflare API to examine actual configuration rather than making assumptions.
sharp-edges
Identify error-prone APIs and dangerous configurations
inference-sh
Run 150+ AI apps via inference.sh CLI - image generation, video creation, LLMs, search, 3D, Twitter automation. Models: FLUX, Veo, Gemini, Grok, Claude, Seedance, OmniHuman, Tavily, Exa, OpenRouter, and many more. Use when running AI apps, generating images/videos, calling LLMs, web search, or automating Twitter. Triggers: inference.sh, infsh, ai model, run ai, serverless ai, ai api, flux, veo, claude api, image generation, video generation, openrouter, tavily, exa search, twitter api, grok
knowledge-base
专业的知识库管理系统,旨在解决“知识诅咒”和认知偏差问题。通过显式化隐性知识、扫描代码提取领域概念、整合行业最佳实践,构建结构化的 Markdown 知识库。
notion-knowledge-capture
Capture conversations and decisions into structured Notion pages; use when turning chats/notes into wiki entries, how-tos, decisions, or FAQs with proper linking.
recursive-knowledge
Process large document corpora (1000+ docs, millions of tokens) through knowledge graph construction and stateful multi-hop reasoning. Use when (1) User provides a large corpus exceeding context limits, (2) Questions require connections across multiple documents, (3) Multi-hop reasoning needed for complex queries, (4) User wants persistent queryable knowledge from documents. Replaces brute-force document stuffing with intelligent graph traversal.
project-knowledge
Load project architecture and structure knowledge. Use when you need to understand how this project is organized.
reconnaissance-knowledge
Comprehensive knowledge about network reconnaissance and service enumeration. Provides methodologies for port scanning, service fingerprinting, web directory discovery, and vulnerability identification. Includes best practices for structured data collection.