Cloudflare Workers AI — AI Inference at the Edge

You are an expert in Cloudflare Workers AI, the serverless AI inference platform running on Cloudflare's global network. You help developers run LLMs, embedding models, image generation, speech-to-text, and translation models at the edge with zero cold starts, pay-per-use pricing, and integration with Workers, Pages, and Vectorize — enabling AI features without managing GPU infrastructure.

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

Cloudflare Workers AI — AI Inference at the Edge is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using Cloudflare Workers AI — AI Inference at the Edge should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/cloudflare-ai/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/cloudflare-ai/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/cloudflare-ai/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Cloudflare Workers AI — AI Inference at the Edge Compares

Feature / Agent	Cloudflare Workers AI — AI Inference at the Edge	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Cloudflare Workers AI — AI Inference at the Edge

You are an expert in Cloudflare Workers AI, the serverless AI inference platform running on Cloudflare's global network. You help developers run LLMs, embedding models, image generation, speech-to-text, and translation models at the edge with zero cold starts, pay-per-use pricing, and integration with Workers, Pages, and Vectorize — enabling AI features without managing GPU infrastructure.

## Core Capabilities

### AI Inference in Workers

```typescript
// src/worker.ts — AI-powered API at the edge
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);

    // Text generation (LLM)
    if (url.pathname === "/api/chat") {
      const { messages } = await request.json();

      const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
        messages,
        max_tokens: 1024,
        temperature: 0.7,
        stream: true,
      });

      return new Response(response, {
        headers: { "Content-Type": "text/event-stream" },
      });
    }

    // Text embeddings (for RAG)
    if (url.pathname === "/api/embed") {
      const { text } = await request.json();

      const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
        text: Array.isArray(text) ? text : [text],
      });

      return Response.json({ embeddings: embeddings.data });
    }

    // Image generation
    if (url.pathname === "/api/generate-image") {
      const { prompt } = await request.json();

      const image = await env.AI.run("@cf/stabilityai/stable-diffusion-xl-base-1.0", {
        prompt,
        num_steps: 20,
      });

      return new Response(image, {
        headers: { "Content-Type": "image/png" },
      });
    }

    // Speech to text
    if (url.pathname === "/api/transcribe") {
      const audioData = await request.arrayBuffer();

      const result = await env.AI.run("@cf/openai/whisper", {
        audio: [...new Uint8Array(audioData)],
      });

      return Response.json({ text: result.text });
    }

    // Translation
    if (url.pathname === "/api/translate") {
      const { text, source_lang, target_lang } = await request.json();

      const result = await env.AI.run("@cf/meta/m2m100-1.2b", {
        text,
        source_lang,
        target_lang,
      });

      return Response.json({ translated: result.translated_text });
    }

    return new Response("Not Found", { status: 404 });
  },
};
```

### RAG with Vectorize

```typescript
// RAG pipeline: Embed → Store in Vectorize → Query → Generate
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { question } = await request.json();

    // Step 1: Embed the question
    const queryEmbedding = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
      text: [question],
    });

    // Step 2: Search Vectorize
    const matches = await env.VECTORIZE.query(queryEmbedding.data[0], {
      topK: 5,
      returnMetadata: "all",
    });

    // Step 3: Generate answer with context
    const context = matches.matches.map(m => m.metadata?.text).join("\n\n");

    const answer = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
      messages: [
        { role: "system", content: `Answer based on this context:\n${context}` },
        { role: "user", content: question },
      ],
    });

    return Response.json({
      answer: answer.response,
      sources: matches.matches.map(m => ({ text: m.metadata?.text, score: m.score })),
    });
  },
};
```

## Installation

```bash
# Create Workers project
npm create cloudflare@latest my-ai-app

# wrangler.toml
[ai]
binding = "AI"

[[vectorize]]
binding = "VECTORIZE"
index_name = "my-index"

# Deploy
npx wrangler deploy
```

## Best Practices

1. **Edge inference** — Models run on Cloudflare's network; <50ms latency worldwide, zero cold starts
2. **Streaming** — Use `stream: true` for LLM responses; first token in ~200ms at the edge
3. **Vectorize for RAG** — Use Cloudflare Vectorize for embedding storage; integrated with Workers AI
4. **Free tier** — 10K neurons/day free; enough for prototyping and low-volume production
5. **Model catalog** — Browse `@cf/` models; Llama 3.1, Mistral, Stable Diffusion, Whisper, BGE all available
6. **Gateway for routing** — Use AI Gateway for caching, rate limiting, analytics, and fallback to OpenAI/Anthropic
7. **R2 for storage** — Store generated images, audio in R2 (S3-compatible); zero egress fees
8. **No GPU management** — Cloudflare manages GPU fleet; you pay per inference, not per GPU-hour

Related Skills

triton-inference-config

from ComeOnOliver/skillshub

Triton Inference Config - Auto-activating skill for ML Deployment. Triggers on: triton inference config, triton inference config Part of the ML Deployment skill category.

inference-latency-profiler

from ComeOnOliver/skillshub

Inference Latency Profiler - Auto-activating skill for ML Deployment. Triggers on: inference latency profiler, inference latency profiler Part of the ML Deployment skill category.

clade-model-inference

from ComeOnOliver/skillshub

Stream Claude responses, use system prompts, handle multi-turn conversations, Use when working with model-inference patterns. and process structured output with the Messages API. Trigger with "anthropic streaming", "claude messages api", "claude inference", "stream claude response".

batch-inference-pipeline

from ComeOnOliver/skillshub

Batch Inference Pipeline - Auto-activating skill for ML Deployment. Triggers on: batch inference pipeline, batch inference pipeline Part of the ML Deployment skill category.

cloudflare-troubleshooting

from ComeOnOliver/skillshub

Investigate and resolve Cloudflare configuration issues using API-driven evidence gathering. Use when troubleshooting ERR_TOO_MANY_REDIRECTS, SSL errors, DNS issues, or any Cloudflare-related problems. Focus on systematic investigation using Cloudflare API to examine actual configuration rather than making assumptions.

sharp-edges

from ComeOnOliver/skillshub

Identify error-prone APIs and dangerous configurations

inference-sh

from ComeOnOliver/skillshub

Run 150+ AI apps via inference.sh CLI - image generation, video creation, LLMs, search, 3D, Twitter automation. Models: FLUX, Veo, Gemini, Grok, Claude, Seedance, OmniHuman, Tavily, Exa, OpenRouter, and many more. Use when running AI apps, generating images/videos, calling LLMs, web search, or automating Twitter. Triggers: inference.sh, infsh, ai model, run ai, serverless ai, ai api, flux, veo, claude api, image generation, video generation, openrouter, tavily, exa search, twitter api, grok

knowledge-base

from ComeOnOliver/skillshub

专业的知识库管理系统，旨在解决“知识诅咒”和认知偏差问题。通过显式化隐性知识、扫描代码提取领域概念、整合行业最佳实践，构建结构化的 Markdown 知识库。

notion-knowledge-capture

from ComeOnOliver/skillshub

Capture conversations and decisions into structured Notion pages; use when turning chats/notes into wiki entries, how-tos, decisions, or FAQs with proper linking.

recursive-knowledge

from ComeOnOliver/skillshub

Process large document corpora (1000+ docs, millions of tokens) through knowledge graph construction and stateful multi-hop reasoning. Use when (1) User provides a large corpus exceeding context limits, (2) Questions require connections across multiple documents, (3) Multi-hop reasoning needed for complex queries, (4) User wants persistent queryable knowledge from documents. Replaces brute-force document stuffing with intelligent graph traversal.

project-knowledge

from ComeOnOliver/skillshub

Load project architecture and structure knowledge. Use when you need to understand how this project is organized.

reconnaissance-knowledge

from ComeOnOliver/skillshub

Comprehensive knowledge about network reconnaissance and service enumeration. Provides methodologies for port scanning, service fingerprinting, web directory discovery, and vulnerability identification. Includes best practices for structured data collection.