Build Your Model Serving Skill

Create your model-serving skill from Ollama documentation before learning deployment theory

16 stars

Best use case

Build Your Model Serving Skill is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Create your model-serving skill from Ollama documentation before learning deployment theory

Teams using Build Your Model Serving Skill should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/build-your-model-serving-skill/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/build-your-model-serving-skill/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/build-your-model-serving-skill/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Build Your Model Serving Skill Compares

Feature / Agent	Build Your Model Serving Skill	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Create your model-serving skill from Ollama documentation before learning deployment theory

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Build Your Model Serving Skill

You have a fine-tuned model from Chapter 69. Now you need to deploy it so real users can interact with it. But here is the pattern that separates effective AI-native developers from those who struggle: **build your skill first, then learn the technology**.

In traditional learning, you study deployment options, configure servers, troubleshoot errors, and hope you remember the patterns later. In Skill-First learning, you create a reusable intelligence asset before you even understand the technology deeply. This asset grows with you as you learn, and by the end of the chapter, you own a production-ready skill you can sell or deploy.

This lesson follows the same pattern you used in Part 6, Part 7, and earlier Part 8 chapters. Clone a fresh skills-lab, fetch official documentation, and build your `model-serving` skill from authoritative sources rather than memory.

## Why Skill-First for Model Serving?

Model serving involves multiple components: export formats (GGUF, safetensors), quantization levels (Q4_K_M, Q8_0), inference servers (Ollama, vLLM), and performance tuning (batch sizes, context lengths, memory management). Trying to memorize all configuration options is futile. But encoding it into a skill that you can invoke, test, and improve makes this knowledge permanently accessible and actionable.

| Traditional Approach | Skill-First Approach |
|---------------------|---------------------|
| Read docs, forget configs | Build skill, query anytime |
| Scattered Stack Overflow links | Single authoritative source |
| Knowledge decays over time | Skill improves with use |
| Cannot delegate to AI | AI executes your skill |

By the end of this chapter, your `model-serving` skill will guide you through:
- Model export and format conversion
- Quantization selection for your hardware
- Ollama installation and configuration
- REST API integration with Python
- Performance optimization for latency targets

## Step 1: Clone a Fresh Skills-Lab

Start with a clean environment. This prevents state from previous experiments from affecting your work.

```bash
# Clone the skills-lab repository
git clone https://github.com/panaversity/skills-lab.git ~/skills-lab-ch70

# Navigate to the directory
cd ~/skills-lab-ch70

# Create the skill directory structure
mkdir -p .claude/skills/model-serving
```

**Output:**
```
Cloning into '/Users/you/skills-lab-ch70'...
```

## Step 2: Write Your LEARNING-SPEC.md

Before creating the skill, define what you are trying to accomplish. This specification guides both your learning and the skill you create.

```markdown
# LEARNING-SPEC.md

## What I Want to Learn
Local model serving using Ollama with GGUF models on consumer hardware
(8GB+ RAM, optional GPU).

## Why This Matters
I want to deploy my fine-tuned Task API model locally with:
- Fast response times (<500ms latency)
- No cloud dependency for inference
- REST API for integration with existing applications
- Cost-effective serving without GPU rental fees

## Success Criteria
1. I can export models to GGUF format with appropriate quantization
2. I can configure Ollama to serve custom models
3. I can achieve <500ms latency on consumer hardware
4. I can integrate with Python applications via REST API
5. My skill accurately reflects official Ollama documentation

## Constraints
- Must work on consumer hardware (8GB+ RAM minimum)
- Must use Ollama for local serving
- Must produce REST API endpoints
- Should support both CPU and GPU inference

## Running Example
Deploy the Task API model (fine-tuned in Chapter 64-69) via Ollama
with REST API access for task management applications.
```

Save this file in your skills-lab directory.

## Step 3: Fetch Official Documentation

The skill must be grounded in official documentation, not AI memory which may be outdated or hallucinated.

Use Claude Code or your AI assistant:

```
/fetching-library-docs ollama

Fetch the official Ollama documentation covering:
1. Model import and Modelfile syntax
2. REST API endpoints (/api/generate, /api/chat)
3. Performance tuning options
4. GGUF format requirements
```

Key sources to reference:
- [Ollama GitHub Repository](https://github.com/ollama/ollama)
- [Ollama API Documentation](https://github.com/ollama/ollama/blob/main/docs/api.md)
- [Ollama Modelfile Reference](https://github.com/ollama/ollama/blob/main/docs/modelfile.md)
- [llama.cpp GGUF Format](https://github.com/ggerganov/llama.cpp)

## Step 4: Create Your model-serving Skill

Based on the documentation, create your skill file. Here is a starter template:

```markdown
---
name: model-serving
description: This skill should be used when deploying and serving LLM models locally. Use when exporting to GGUF, configuring Ollama, setting up REST APIs, and optimizing inference performance.
---

# Model Serving Skill

## Purpose

Guide local deployment and serving of LLMs using Ollama with GGUF models
on consumer hardware for production-ready inference.

## When to Use This Skill

Invoke this skill when:
- Exporting fine-tuned models to GGUF format
- Selecting quantization levels for target hardware
- Configuring Ollama with custom Modelfiles
- Setting up REST API endpoints for applications
- Optimizing inference for latency and throughput
- Troubleshooting serving issues

## Hardware Context

**Consumer Hardware (Target):**
- 8GB+ RAM minimum
- Optional GPU (NVIDIA/AMD/Apple Silicon)
- CPU inference fallback available

**Performance Targets:**
- First token: <200ms
- Total latency: <500ms
- Throughput: 10+ requests/second (with batching)

## Quantization Selection Guide

| Quantization | Size (7B) | Quality | Speed | Use Case |
|--------------|-----------|---------|-------|----------|
| Q4_K_M | ~4GB | Good | Fast | **Default choice** |
| Q5_K_M | ~5GB | Better | Moderate | Quality-sensitive |
| Q8_0 | ~8GB | Best | Slower | Maximum quality |

## Ollama REST API

### Generate Endpoint

POST http://localhost:11434/api/generate

### Chat Endpoint

POST http://localhost:11434/api/chat

## Troubleshooting

### Model Not Loading
1. Check GGUF file path is correct
2. Verify sufficient RAM available
3. Check Ollama logs for errors

### Slow Inference
1. Enable GPU if available
2. Reduce context length
3. Use more aggressive quantization
```

Save this to `.claude/skills/model-serving/SKILL.md`.

## Step 5: Verify Your Skill

Test that your skill was created correctly:

```bash
# Check the skill exists
ls -la .claude/skills/model-serving/

# View the skill content
head -50 .claude/skills/model-serving/SKILL.md
```

**Output:**
```
total 8
drwxr-xr-x 3 you staff 96 Jan 1 10:00 .
drwxr-xr-x 3 you staff 96 Jan 1 10:00 ..
-rw-r--r-- 1 you staff 2048 Jan 1 10:00 SKILL.md
```

## What Happens Next

You now have a `model-serving` skill that is grounded in official documentation. As you progress through this chapter:

| Lesson | How Your Skill Improves |
|--------|------------------------|
| L01: Export Formats | Add GGUF vs safetensors decision tree |
| L02: Quantization | Add detailed quality/speed tradeoffs |
| L03: Ollama Setup | Add platform-specific installation notes |
| L04: Local Serving | Add Python client patterns |
| L05: vLLM Theory | Add production architecture context |
| L06: Performance | Add latency optimization techniques |
| L07: Capstone | Validate skill produces working deployment |

Each lesson will include a "Reflect on Your Skill" section where you update and improve this skill based on what you learned.

## Try With AI

Use your AI companion (Claude, ChatGPT, Gemini, or similar).

### Prompt 1: Verify Skill Structure

```
I just created my model-serving skill for Ollama deployment. Review the
structure and tell me:
1. Does it follow the SKILL.md format correctly?
2. Is the content grounded in documentation (not hallucinated)?
3. What sections should I add as I learn more about model serving?

Here is my skill:
[paste your SKILL.md content]
```

**What you are learning**: Critical evaluation of your own skill structure. Your AI partner helps identify gaps before you invest time in an incomplete skill.

### Prompt 2: Connect to Your Hardware

```
I have [describe your hardware: M1 Mac with 16GB RAM / Windows PC with RTX 3060 /
Linux server with 32GB RAM]. Looking at my model-serving skill, what
hardware-specific optimizations should I add? What quantization level
would you recommend for my setup?
```

**What you are learning**: Hardware-aware optimization. Model serving is not one-size-fits-all. Your AI partner helps you anticipate hardware-specific challenges.

### Prompt 3: Validate Against Official Docs

```
Compare my skill's Ollama configuration recommendations against the official
Ollama documentation. Are there any discrepancies? Any best practices
I should add?

Specifically check:
1. Modelfile syntax
2. REST API endpoints
3. Performance tuning options
```

**What you are learning**: Documentation verification. You are building the habit of validating AI-generated content against authoritative sources.

### Safety Note

As you create skills from documentation, remember that AI tools may not have the most current information. Always verify critical configuration values against the official source. The Ollama documentation is updated regularly as the project evolves.

Related Skills

ios-foundation-models-diag

from diegosouzapw/awesome-omni-skill

Use when debugging Foundation Models issues — context exceeded, guardrail violations, slow generation, availability problems, unsupported language, or unexpected output. Systematic diagnostics with production crisis defense.

feedback-observing

from diegosouzapw/awesome-omni-skill

Read agent interaction logs across packs and detect operational patterns. Emits state-transition FeedbackEvents for degradation/recovery.

fair-data-model-assessment

from diegosouzapw/awesome-omni-skill

Assess data models against FAIR principles using RDA-FDMM indicators. Use when: (1) Evaluating vendor-delivered data models for FAIR compliance, (2) Reviewing schemas, ontologies, or data dictionaries before integration, (3) Creating FAIR assessment reports for data governance reviews, (4) Preparing data model documentation for enterprise or regulatory standards, (5) Auditing existing data assets for FAIRness gaps. Covers 41 RDA indicators across Findable, Accessible, Interoperable, Reusable dimensions with maturity scoring (0-4 scale).

data-model

from diegosouzapw/awesome-omni-skill

Generate comprehensive data model documentation with ERD, DTOs, and data flow diagrams

data-model-creation

from diegosouzapw/awesome-omni-skill

Professional rules for AI-driven data modeling and creation. Use this skill when users need to create and manage MySQL databases, design data models using Mermaid ER diagrams, and implement database schemas.

building-with-llms

from diegosouzapw/awesome-omni-skill

Help users build effective AI applications. Use when someone is building with LLMs, writing prompts, designing AI features, implementing RAG, creating agents, running evals, or trying to improve AI output quality.

building-agents

from diegosouzapw/awesome-omni-skill

Expert at creating and modifying Claude Code agents (subagents). Auto-invokes when the user wants to create, update, modify, enhance, validate, or standardize agents, or when modifying agent YAML frontmatter fields (especially 'model', 'tools', 'description'), needs help designing agent architecture, or wants to understand agent capabilities. Also auto-invokes proactively when Claude is about to write agent files (*/agents/*.md), create modular agent architectures, or implement tasks that involve creating agent components.

Build Your Model Merging Skill

from diegosouzapw/awesome-omni-skill

No description provided.

Build Your LLMOps Decision Skill

from diegosouzapw/awesome-omni-skill

No description provided.

Build Your Data Engineering Skill

from diegosouzapw/awesome-omni-skill

Create your LLMOps data engineering skill in one prompt, then learn to improve it throughout the chapter

axiom-foundation-models

from diegosouzapw/awesome-omni-skill

Use when implementing on-device AI with Apple's Foundation Models framework — prevents context overflow, blocking UI, wrong model use cases, and manual JSON parsing when @Generable should be used. iOS 26+, macOS 26+, iPadOS 26+, axiom-visionOS 26+

avalonia-viewmodels-zafiro

from diegosouzapw/awesome-omni-skill

Optimal ViewModel and Wizard creation patterns for Avalonia using Zafiro and ReactiveUI.