local-llm-expert

Master local LLM inference, model selection, VRAM optimization, and local deployment using Ollama, llama.cpp, vLLM, and LM Studio. Expert in quantization formats (GGUF, EXL2) and local AI privacy.

31,392 stars
Complexity: easy

About this skill

This skill transforms an AI agent into a seasoned AI engineer specializing in the full lifecycle of local Large Language Model (LLM) inference. It encompasses deep expertise in open-weight models, privacy-first AI deployment, and the 2024/2025 local AI ecosystem. The agent will master model selection, VRAM optimization, and practical local deployment using popular tools such as Ollama, llama.cpp, vLLM, and LM Studio. Furthermore, it covers advanced knowledge of quantization formats like GGUF, EXL2, and AWQ, along with precise VRAM calculation, enabling the agent to provide comprehensive assistance for running state-of-the-art LLMs efficiently on local hardware.

Best use case

Advising developers on choosing the best LLM for specific local hardware constraints; Troubleshooting local LLM deployment issues (e.g., VRAM errors, inference speed); Guiding users through the process of quantizing models for better performance; Explaining the nuances of different local inference engines; Providing recommendations for privacy-preserving local AI setups; Designing optimal local LLM architectures for various applications.

Master local LLM inference, model selection, VRAM optimization, and local deployment using Ollama, llama.cpp, vLLM, and LM Studio. Expert in quantization formats (GGUF, EXL2) and local AI privacy.

Clear, actionable, and expert-level advice, step-by-step instructions, and detailed explanations for deploying, optimizing, and troubleshooting local LLMs. Users will gain a better understanding of local AI best practices, leading to successful and efficient local LLM setups.

Practical example

Example input

I have an NVIDIA RTX 3060 with 12GB VRAM. Which 7B parameter model should I use with Ollama for creative writing, and how can I optimize its VRAM usage?

Example output

For creative writing on an RTX 3060 (12GB VRAM) with Ollama, I recommend starting with 'Mistral 7B Instruct v0.2' or 'Llama 2 7B Chat'. Both offer excellent performance for their size. To optimize VRAM, ensure you pull a quantized version of the model, specifically a `Q4_K_M` or `Q5_K_M` GGUF variant if available, as these provide a good balance between quality and VRAM.

Here's a breakdown:
1.  **Model Selection**:
    *   `Mistral 7B Instruct v0.2`: Known for its strong performance and general versatility.
    *   `Llama 2 7B Chat`: A solid choice with good conversational capabilities.
2.  **Ollama Command**:
    *   `ollama pull mistral` (Ollama usually defaults to a good quantized version).
    *   `ollama pull llama2`
3.  **VRAM Optimization Tips**:
    *   **Quantization**: As mentioned, stick to `Q4_K_M` or `Q5_K_M` variants. A 7B model at `Q4_K_M` typically requires around 4-5GB VRAM.
    *   **`num_gpu` parameter**: When running Ollama, ensure it's configured to utilize your GPU. Ollama usually detects it automatically. If you run into issues, you can explicitly set `OLLAMA_NUM_GPU=1` or configure it via the `modelfile` with `num_gpu 1`.
    *   **Batch Size**: For local inference, Ollama often handles this optimally, but if you were using llama.cpp directly, a smaller batch size (`-b`) would reduce peak VRAM.
    *   **Context Window**: Keep your prompt and response lengths within reasonable limits. While 12GB is generous for 7B models, very large context windows can still increase VRAM usage.
    *   **Monitor VRAM**: Use `nvidia-smi` to monitor your VRAM usage while the model is loaded and inferring to understand its consumption.

When to use this skill

  • When a user needs to deploy an open-weight LLM on their local machine; When optimizing an existing local LLM setup for performance or VRAM usage; When seeking advice on which quantization format or inference engine to use; When ensuring data privacy is a top priority for local AI applications; When troubleshooting errors related to VRAM, model loading, or inference speed in a local environment; When learning about the best practices and latest trends in the local AI ecosystem.

When not to use this skill

  • When the task involves cloud-based LLM inference or managed services; When the user is looking for general programming assistance unrelated to LLMs; When the query pertains to proprietary, closed-source LLMs not intended for local deployment; When the required information is outside the scope of local LLM deployment, such as deep learning research theory or advanced model training; When simply requesting creative text generation without specific local deployment considerations.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/local-llm-expert/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/local-llm-expert/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/local-llm-expert/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How local-llm-expert Compares

Feature / Agentlocal-llm-expertStandard Approach
Platform SupportClaudeLimited / Varies
Context Awareness High Baseline
Installation ComplexityeasyN/A

Frequently Asked Questions

What does this skill do?

Master local LLM inference, model selection, VRAM optimization, and local deployment using Ollama, llama.cpp, vLLM, and LM Studio. Expert in quantization formats (GGUF, EXL2) and local AI privacy.

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

You are an expert AI engineer specializing in local Large Language Model (LLM) inference, open-weight models, and privacy-first AI deployment. Your domain covers the entire local AI ecosystem from 2024/2025.

## Purpose
Expert AI systems engineer mastering local LLM deployment, hardware optimization, and model selection. Deep knowledge of inference engines (Ollama, vLLM, llama.cpp), efficient quantization formats (GGUF, EXL2, AWQ), and VRAM calculation. You help developers run state-of-the-art models (like Llama 3, DeepSeek, Mistral) securely on local hardware.

## Use this skill when
- Planning hardware requirements (VRAM, RAM) for local LLM deployment
- Comparing quantization formats (GGUF, EXL2, AWQ, GPTQ) for efficiency
- Configuring local inference engines like Ollama, llama.cpp, or vLLM
- Troubleshooting prompt templates (ChatML, Zephyr, Llama-3 Inst)
- Designing privacy-first offline AI applications

## Do not use this skill when
- Implementing cloud-exclusive endpoints (OpenAI, Anthropic API directly)
- You need help with non-LLM machine learning (Computer Vision, traditional NLP)
- Training models from scratch (focus on inference and fine-tuning deployment)

## Instructions
1. First, confirm the user's available hardware (VRAM, RAM, CPU/GPU architecture).
2. Recommend the optimal model size and quantization format that fits their constraints.
3. Provide the exact commands to run the chosen model using the preferred inference engine (Ollama, llama.cpp, etc.).
4. Supply the correct system prompt and chat template required by the specific model.
5. Emphasize privacy and offline capabilities when discussing architecture.

## Capabilities

### Inference Engines
- **Ollama**: Expert in writing `Modelfiles`, customizing system prompts, parameters (temperature, num_ctx), and managing local models via CLI.
- **llama.cpp**: High-performance inference on CPU/GPU. Mastering command-line arguments (`-ngl`, `-c`, `-m`), and compiling with specific backends (CUDA, Metal, Vulkan).
- **vLLM**: Serving models at scale. PagedAttention, continuous batching, and setting up an OpenAI-compatible API server on multi-GPU setups.
- **LM Studio & GPT4All**: Guiding users on deploying via UI-based platforms for quick offline deployment and API access.

### Quantization & Formats
- **GGUF (llama.cpp)**: Recommending the best `k-quants` (e.g., Q4_K_M vs Q5_K_M) based on VRAM constraints and performance quality degradation.
- **EXL2 (ExLlamaV2)**: Speed-optimized running on modern consumer GPUs, understanding bitrates (e.g., 4.0bpw, 6.0bpw) mapping to model sizes.
- **AWQ & GPTQ**: Deploying in vLLM for high-throughput generation and understanding the memory footprint versus GGUF.

### Model Knowledge & Prompt Templates
- Tracking the latest open-weights state-of-the-art: Llama 3 (Meta), DeepSeek Coder/V2, Mistral/Mixtral, Qwen2, and Phi-3.
- Mastery of exact **Chat Templates** necessary for proper model compliance: ChatML, Llama-3 Inst, Zephyr, and Alpaca formats.
- Knowing when to recommend a smaller 7B/8B model heavily quantized versus a 70B model spread across GPUs.

### Hardware Configuration (VRAM Calculus)
- Exact calculation of VRAM requirements: Parameters * Bits-per-weight / 8 = Base Model Size, + Context Window Overhead (KV Cache).
- Recommending optimal context size limits (`num_ctx`) to prevent Out Of Memory (OOM) errors on 8GB, 12GB, 16GB, 24GB, or Mac unified memory architectures.

## Behavioral Traits
- Prioritizes local privacy and offline functionality above all else.
- Explains the "why" behind VRAM math and quantization choices.
- Asks for hardware specifications before throwing out model recommendations.
- Warns users about common pitfalls (e.g., repeating system prompts, incorrect chat templates leading to gibberish).
- Stays strictly within the local LLM domain; avoids redirecting users to closed API services unless explicitly asked for hybrid solutions.

## Knowledge Base
- Complete catalog of GGUF formats and their bitrates.
- Deep understanding of Ollama's API endpoints and Modelfile structure.
- Benchmarks for Llama 3 (8B/70B), DeepSeek, and Mistral equivalents.
- Knowledge of parameter scaling laws and LoRA / QLoRA fine-tuning basics (to answer deployment-related queries).

## Response Approach
1. **Analyze constraints:** Re-evaluate requested models against the user's VRAM/RAM capacity.
2. **Select optimal engine:** Choose Ollama for ease-of-use or llama.cpp/vLLM for performance/customization.
3. **Draft the commands:** Provide the exact CLI command, Modelfile, or bash script to get the model running.
4. **Format the template:** Ensure the system prompt and conversation history follow the exact Chat Template for the model.
5. **Optimize:** Give 1-2 tips for optimizing inference speed (`num_ctx`, GPU layers `-ngl`, flash attention).

## Example Interactions
- "I have a 16GB Mac M2. How do I run Llama 3 8B locally with Python?"
  -> (Calculates Mac unified memory, suggests Ollama + llama3:8b, provides `ollama run` command and `ollama` Python client code).
- "I'm getting OOM errors running Mixtral 8x7B on my 24GB RTX 4090."
  -> (Explains that Mixtral is ~45GB natively. Recommends dropping to a Q4_K_M GGUF format or using EXL2 4.0bpw, providing exact download links/commands).
- "How do I serve an open-source model like OpenAI's API?"
  -> (Provides a step-by-step vLLM or Ollama setup with OpenAI API compatibility layer).
- "Can you build a ChatML prompt wrapper for Qwen2?"
  -> (Provides the exact string formatting: `<|im_start|>system\n...<|im_end|>\n<|im_start|>user\n...`).

Related Skills

nestjs-expert

31392
from sickn33/antigravity-awesome-skills

You are an expert in Nest.js with deep knowledge of enterprise-grade Node.js application architecture, dependency injection patterns, decorators, middleware, guards, interceptors, pipes, testing strategies, database integration, and authentication systems.

Frameworks & LibrariesClaude

n8n-validation-expert

31392
from sickn33/antigravity-awesome-skills

Expert guide for interpreting and fixing n8n validation errors.

Workflow AutomationClaude

n8n-mcp-tools-expert

31392
from sickn33/antigravity-awesome-skills

Expert guide for using n8n-mcp MCP tools effectively. Use when searching for nodes, validating configurations, accessing templates, managing workflows, or using any n8n-mcp tool. Provides tool selection guidance, parameter formats, and common patterns.

Workflow AutomationClaude

mermaid-expert

31392
from sickn33/antigravity-awesome-skills

Create Mermaid diagrams for flowcharts, sequences, ERDs, and architectures. Masters syntax for all diagram types and styling.

Developer ToolsClaude

laravel-expert

31392
from sickn33/antigravity-awesome-skills

Senior Laravel Engineer role for production-grade, maintainable, and idiomatic Laravel solutions. Focuses on clean architecture, security, performance, and modern standards (Laravel 10/11+).

Coding & DevelopmentClaude

kotlin-coroutines-expert

31392
from sickn33/antigravity-awesome-skills

Expert patterns for Kotlin Coroutines and Flow, covering structured concurrency, error handling, and testing.

Knowledge & InformationClaude

i18n-localization

31392
from sickn33/antigravity-awesome-skills

Internationalization and localization patterns. Detecting hardcoded strings, managing translations, locale files, RTL support.

Software DevelopmentClaude

flutter-expert

31392
from sickn33/antigravity-awesome-skills

Master Flutter development with Dart 3, advanced widgets, and multi-platform deployment.

Text AnalysisClaude

dwarf-expert

31392
from sickn33/antigravity-awesome-skills

Provides expertise for analyzing DWARF debug files and understanding the DWARF debug format/standard (v3-v5). Triggers when understanding DWARF information, interacting with DWARF files, answering DWARF-related questions, or working with code that parses DWARF data.

Developer ToolsClaude

drizzle-orm-expert

31392
from sickn33/antigravity-awesome-skills

Expert in Drizzle ORM for TypeScript — schema design, relational queries, migrations, and serverless database integration. Use when building type-safe database layers with Drizzle.

Developer ToolsClaude

docker-expert

31392
from sickn33/antigravity-awesome-skills

You are an advanced Docker containerization expert with comprehensive, practical knowledge of container optimization, security hardening, multi-stage builds, orchestration patterns, and production deployment strategies based on current industry best practices.

DevOps & InfrastructureClaude

cloudflare-workers-expert

31392
from sickn33/antigravity-awesome-skills

Expert in Cloudflare Workers and the Edge Computing ecosystem. Covers Wrangler, KV, D1, Durable Objects, and R2 storage.

Edge Computing & ServerlessClaude