LocalAI — Self-Hosted OpenAI Alternative

## Overview

25 stars

Best use case

LocalAI — Self-Hosted OpenAI Alternative is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

## Overview

Teams using LocalAI — Self-Hosted OpenAI Alternative should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/localai/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/localai/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/localai/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How LocalAI — Self-Hosted OpenAI Alternative Compares

Feature / Agent	LocalAI — Self-Hosted OpenAI Alternative	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

## Overview

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# LocalAI — Self-Hosted OpenAI Alternative


## Overview


LocalAI, the open-source drop-in replacement for OpenAI's API that runs locally. Helps developers self-host LLMs, image generators, audio transcription, and text-to-speech models with an OpenAI-compatible API — no GPU required, completely offline and private.


## Instructions

### Quick Start with Docker

```bash
# Run LocalAI with Docker (CPU-only, no GPU needed)
docker run -p 8080:8080 \
  -v ./models:/build/models \
  localai/localai:latest-cpu

# With GPU support (NVIDIA CUDA)
docker run -p 8080:8080 --gpus all \
  -v ./models:/build/models \
  localai/localai:latest-gpu-nvidia-cuda-12

# Docker Compose for production
```

```yaml
# docker-compose.yml — Production LocalAI setup
version: "3.8"
services:
  localai:
    image: localai/localai:latest-cpu
    ports:
      - "8080:8080"
    volumes:
      - ./models:/build/models
    environment:
      - THREADS=4                    # CPU threads for inference
      - CONTEXT_SIZE=4096            # Default context window
      - GALLERIES=[{"name":"model-gallery","url":"github:mudler/LocalAI/gallery/index.yaml@master"}]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 30s
      timeout: 10s
```

### Model Installation

```bash
# Install models from the gallery (via API)
curl -X POST http://localhost:8080/models/apply \
  -H "Content-Type: application/json" \
  -d '{"id": "huggingface://TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q5_K_M.gguf"}'

# Or download GGUF files directly into the models directory
wget -P ./models/ \
  https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf

# Create a model configuration
cat > ./models/mistral.yaml << 'EOF'
name: mistral
backend: llama-cpp
parameters:
  model: mistral-7b-instruct-v0.2.Q5_K_M.gguf
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  context_size: 8192
template:
  chat_message: |
    {{.RoleName}}: {{.Content}}
  chat: |
    [INST] {{.Input}} [/INST]
EOF

# List available models
curl http://localhost:8080/v1/models | jq '.data[].id'
```

### OpenAI-Compatible API

```typescript
// src/local-ai.ts — Use LocalAI with OpenAI SDK
import OpenAI from "openai";

const ai = new OpenAI({
  apiKey: "not-needed",
  baseURL: "http://localhost:8080/v1",
});

// Chat completions
async function chat(prompt: string) {
  const response = await ai.chat.completions.create({
    model: "mistral",                      // Model name from config
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: prompt },
    ],
    temperature: 0.7,
  });
  return response.choices[0].message.content;
}

// Embeddings
async function embed(texts: string[]) {
  const response = await ai.embeddings.create({
    model: "text-embedding-ada-002",       // Mapped to local embedding model
    input: texts,
  });
  return response.data.map(d => d.embedding);
}

// Image generation (Stable Diffusion backend)
async function generateImage(prompt: string) {
  const response = await ai.images.generate({
    model: "stablediffusion",
    prompt,
    n: 1,
    size: "512x512",
  });
  return response.data[0].url;
}

// Audio transcription (Whisper backend)
async function transcribe(audioPath: string) {
  const response = await ai.audio.transcriptions.create({
    model: "whisper-1",
    file: fs.createReadStream(audioPath),
  });
  return response.text;
}

// Text-to-speech
async function textToSpeech(text: string) {
  const response = await ai.audio.speech.create({
    model: "tts-1",
    voice: "alloy",
    input: text,
  });
  const buffer = Buffer.from(await response.arrayBuffer());
  fs.writeFileSync("output.mp3", buffer);
}
```

### Multi-Model Configuration

```yaml
# models/chat-model.yaml — Chat model
name: chat
backend: llama-cpp
parameters:
  model: llama-3.1-8b-instruct.Q5_K_M.gguf
  context_size: 8192
  threads: 4
  gpu_layers: 0                            # 0 = CPU only, increase for GPU offloading

---
# models/code-model.yaml — Code completion model
name: code
backend: llama-cpp
parameters:
  model: codellama-7b-instruct.Q5_K_M.gguf
  context_size: 16384
  threads: 4

---
# models/embedding-model.yaml — Embedding model
name: embedding
backend: sentencetransformers
parameters:
  model: all-MiniLM-L6-v2

---
# models/whisper-model.yaml — Audio transcription
name: whisper-1
backend: whisper
parameters:
  model: whisper-base.bin
  language: en
```

### Function Calling

```typescript
// LocalAI supports function calling with compatible models
async function chatWithFunctions(prompt: string) {
  const response = await ai.chat.completions.create({
    model: "mistral",
    messages: [{ role: "user", content: prompt }],
    tools: [
      {
        type: "function",
        function: {
          name: "get_current_weather",
          description: "Get the weather for a location",
          parameters: {
            type: "object",
            properties: {
              location: { type: "string" },
              unit: { type: "string", enum: ["celsius", "fahrenheit"] },
            },
            required: ["location"],
          },
        },
      },
    ],
    tool_choice: "auto",
  });
  return response;
}
```

## Installation

```bash
# Docker (recommended)
docker pull localai/localai:latest-cpu

# Binary (Linux/macOS)
curl -Lo local-ai https://github.com/mudler/LocalAI/releases/latest/download/local-ai-$(uname -s)-$(uname -m)
chmod +x local-ai
./local-ai --models-path ./models

# Homebrew (macOS)
brew install localai
```


## Examples


### Example 1: Integrating Localai into an existing application

**User request:**

```
Add Localai to my Next.js app for the AI chat feature. I want streaming responses.
```

The agent installs the SDK, creates an API route that initializes the Localai client, configures streaming, selects an appropriate model, and wires up the frontend to consume the stream. It handles error cases and sets up proper environment variable management for the API key.

### Example 2: Optimizing model installation performance

**User request:**

```
My Localai calls are slow and expensive. Help me optimize the setup.
```

The agent reviews the current implementation, identifies issues (wrong model selection, missing caching, inefficient prompting, no batching), and applies optimizations specific to Localai's capabilities — adjusting model parameters, adding response caching, and implementing retry logic with exponential backoff.


## Guidelines

1. **CPU is fine for most use cases** — 7B models run well on CPU; GPU helps for 13B+ and image generation
2. **Q5_K_M quantization** — Best balance of quality and speed; Q4_K_M for faster inference, Q6_K for higher quality
3. **One model per purpose** — Run separate models for chat, embedding, and code; don't force one model to do everything
4. **Docker for production** — Use Docker Compose with health checks and restart policies; don't run the binary directly
5. **OpenAI SDK compatibility** — Your existing OpenAI code works with LocalAI; just change the base URL
6. **Context size = memory** — Each model uses ~(context_size × 2MB) RAM; set context_size based on available memory
7. **Thread count = physical cores** — Set `THREADS` to your physical CPU core count; hyperthreading doesn't help inference
8. **Gallery for easy setup** — Use the model gallery for one-click model installation instead of manual GGUF downloads

Related Skills

hosted-agents

from ComeOnOliver/skillshub

This skill should be used when the user asks to "build background agent", "create hosted coding agent", "set up sandboxed execution", "implement multiplayer agent", or mentions background agents, sandboxed VMs, agent infrastructure, Modal sandboxes, self-spawning agents, or remote coding environments.

OpenAI Whisper API (curl)

from ComeOnOliver/skillshub

Transcribe an audio file via OpenAI’s `/v1/audio/transcriptions` endpoint.

OpenAI Image Gen

from ComeOnOliver/skillshub

Generate a handful of “random but structured” prompts and render them via the OpenAI Images API.

../../../engineering-team/self-improving-agent/skills/remember/SKILL.md

from ComeOnOliver/skillshub

No description provided.

../../../engineering-team/self-improving-agent/skills/promote/SKILL.md

from ComeOnOliver/skillshub

No description provided.

hosted-agents-v2-py

from ComeOnOliver/skillshub

Build hosted agents using Azure AI Projects SDK with ImageBasedHostedAgentDefinition. Use when creating container-based agents that run custom code in Azure AI Foundry. Triggers: "ImageBasedHostedAgentDefinition", "hosted agent", "container agent", "create_version", "ProtocolVersionRecord", "AgentProtocol.RESPONSES".

azure-ai-openai-dotnet

from ComeOnOliver/skillshub

Azure OpenAI SDK for .NET. Client library for Azure OpenAI and OpenAI services. Use for chat completions, embeddings, image generation, audio transcription, and assistants. Triggers: "Azure OpenAI", "AzureOpenAIClient", "ChatClient", "chat completions .NET", "GPT-4", "embeddings", "DALL-E", "Whisper", "OpenAI .NET".

azure-hosted-copilot-sdk

from ComeOnOliver/skillshub

Build and deploy GitHub Copilot SDK apps to Azure. USE FOR: build copilot app, create copilot app, copilot SDK, @github/copilot-sdk, scaffold copilot project, copilot-powered app, deploy copilot app, host on azure, azure model, BYOM, bring your own model, use my own model, azure openai model, DefaultAzureCredential, self-hosted model, copilot SDK service, chat app with copilot, copilot-sdk-service template, azd init copilot, CopilotClient, createSession, sendAndWait, GitHub Models API. DO NOT USE FOR: using Copilot (not building with it), Copilot Extensions, Azure Functions without Copilot, general web apps without copilot SDK, Foundry agent hosting (use microsoft-foundry skill), agent evaluation (use microsoft-foundry skill).

self-test

from ComeOnOliver/skillshub

Pattern for testing your own code during implementation. Ensures quality before declaring complete.

self-improving-agent

from ComeOnOliver/skillshub

A universal self-improving agent that learns from ALL skill experiences. Uses multi-memory architecture (semantic + episodic + working) to continuously evolve the codebase. Auto-triggers on skill completion/error with hooks-based self-correction.

scaffolding-openai-agents

from ComeOnOliver/skillshub

Builds AI agents using OpenAI Agents SDK with async/await patterns and multi-agent orchestration. Use when creating tutoring agents, building agent handoffs, implementing tool-calling agents, or orchestrating multiple specialists. Covers Agent class, Runner patterns, function tools, guardrails, and streaming responses. NOT when using raw OpenAI API without SDK or other agent frameworks like LangChain.

openai-docs-skill

from ComeOnOliver/skillshub

Query the OpenAI developer documentation via the OpenAI Docs MCP server using CLI (curl/jq). Use whenever a task involves the OpenAI API (Responses, Chat Completions, Realtime, etc.), OpenAI SDKs, ChatGPT Apps SDK, Codex, MCP integrations, endpoint schemas, parameters, limits, or migrations and you need up-to-date official guidance.