runtime-skills

Universal Runtime best practices for PyTorch inference, Transformers models, and FastAPI serving. Covers device management, model loading, memory optimization, and performance tuning.

830 stars

byllama-farm

View on GitHub Installation ↓

Best use case

runtime-skills is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Universal Runtime best practices for PyTorch inference, Transformers models, and FastAPI serving. Covers device management, model loading, memory optimization, and performance tuning.

Teams using runtime-skills should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/runtime-skills/SKILL.md --create-dirs "https://raw.githubusercontent.com/llama-farm/llamafarm/main/.claude/skills/runtime-skills/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/runtime-skills/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How runtime-skills Compares

Feature / Agent	runtime-skills	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Universal Runtime best practices for PyTorch inference, Transformers models, and FastAPI serving. Covers device management, model loading, memory optimization, and performance tuning.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Universal Runtime Skills

Best practices and code review checklists for the Universal Runtime - LlamaFarm's local ML inference server.

## Overview

The Universal Runtime provides OpenAI-compatible endpoints for HuggingFace models:
- Text generation (Causal LMs: GPT, Llama, Mistral, Qwen)
- Text embeddings (BERT, sentence-transformers, ModernBERT)
- Classification, NER, and reranking
- OCR and document understanding
- Anomaly detection

**Directory**: `runtimes/universal/`
**Python**: 3.11+
**Key Dependencies**: PyTorch, Transformers, FastAPI, llama-cpp-python

## Links to Shared Skills

This skill extends the shared Python practices. Always apply these first:

| Topic | File | Priority |
|-------|------|----------|
| Patterns | [python-skills/patterns.md](../python-skills/patterns.md) | Medium |
| Async | [python-skills/async.md](../python-skills/async.md) | High |
| Typing | [python-skills/typing.md](../python-skills/typing.md) | Medium |
| Testing | [python-skills/testing.md](../python-skills/testing.md) | Medium |
| Errors | [python-skills/error-handling.md](../python-skills/error-handling.md) | High |
| Security | [python-skills/security.md](../python-skills/security.md) | Critical |

## Runtime-Specific Checklists

| Topic | File | Key Points |
|-------|------|------------|
| PyTorch | [pytorch.md](pytorch.md) | Device management, dtype, memory cleanup |
| Transformers | [transformers.md](transformers.md) | Model loading, tokenization, inference |
| FastAPI | [fastapi.md](fastapi.md) | API design, streaming, lifespan |
| Performance | [performance.md](performance.md) | Batching, caching, optimizations |

## Architecture

```
runtimes/universal/
├── server.py              # FastAPI app, model caching, endpoints
├── core/
│   └── logging.py         # UniversalRuntimeLogger (structlog)
├── models/
│   ├── base.py            # BaseModel ABC with device management
│   ├── language_model.py  # Transformers text generation
│   ├── gguf_language_model.py  # llama-cpp-python for GGUF
│   ├── encoder_model.py   # Embeddings, classification, NER, reranking
│   └── ...                # OCR, anomaly, document models
├── routers/
│   └── chat_completions/  # Chat completions with streaming
├── utils/
│   ├── device.py          # Device detection (CUDA/MPS/CPU)
│   ├── model_cache.py     # TTL-based model caching
│   ├── model_format.py    # GGUF vs transformers detection
│   └── context_calculator.py  # GGUF context size computation
└── tests/
```

## Key Patterns

### 1. Model Loading with Double-Checked Locking

```python
_model_load_lock = asyncio.Lock()

async def load_encoder(model_id: str, task: str = "embedding"):
    cache_key = f"encoder:{task}:{model_id}"
    if cache_key not in _models:
        async with _model_load_lock:
            # Double-check after acquiring lock
            if cache_key not in _models:
                model = EncoderModel(model_id, device, task=task)
                await model.load()
                _models[cache_key] = model
    return _models.get(cache_key)
```

### 2. Device-Aware Tensor Operations

```python
class BaseModel(ABC):
    def get_dtype(self, force_float32: bool = False):
        if force_float32:
            return torch.float32
        if self.device in ("cuda", "mps"):
            return torch.float16
        return torch.float32

    def to_device(self, tensor: torch.Tensor, dtype=None):
        # Don't change dtype for integer tensors
        if tensor.dtype in (torch.int32, torch.int64, torch.long):
            return tensor.to(device=self.device)
        dtype = dtype or self.get_dtype()
        return tensor.to(device=self.device, dtype=dtype)
```

### 3. TTL-Based Model Caching

```python
_models: ModelCache[BaseModel] = ModelCache(ttl=300)  # 5 min TTL

async def _cleanup_idle_models():
    while True:
        await asyncio.sleep(CLEANUP_CHECK_INTERVAL)
        for cache_key, model in _models.pop_expired():
            await model.unload()
```

### 4. Async Generation with Thread Pools

```python
# GGUF models use blocking llama-cpp, run in executor
self._executor = ThreadPoolExecutor(max_workers=1)

async def generate(self, messages, max_tokens=512, ...):
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(self._executor, self._generate_sync)
```

## Review Priority

When reviewing Universal Runtime code:

1. **Critical** - Security
   - Path traversal prevention in file endpoints
   - Input sanitization for model IDs

2. **High** - Memory & Device
   - Proper CUDA/MPS cache clearing on unload
   - torch.no_grad() for inference
   - Correct dtype for device

3. **Medium** - Performance
   - Model caching patterns
   - Batch processing where applicable
   - Streaming implementation

4. **Low** - Code Style
   - Consistent with patterns.md
   - Proper type hints

Related Skills

typescript-skills

830

from llama-farm/llamafarm

Shared TypeScript best practices for Designer and Electron subsystems.

server-skills

830

from llama-farm/llamafarm

Server-specific best practices for FastAPI, Celery, and Pydantic. Extends python-skills with framework-specific patterns.

react-skills

830

from llama-farm/llamafarm

React 18 patterns for LlamaFarm Designer. Covers components, hooks, TanStack Query, and testing.

rag-skills

830

from llama-farm/llamafarm

RAG-specific best practices for LlamaIndex, ChromaDB, and Celery workers. Covers ingestion, retrieval, embeddings, and performance.

python-skills

830

from llama-farm/llamafarm

Shared Python best practices for LlamaFarm. Covers patterns, async, typing, testing, error handling, and security.

go-skills

830

from llama-farm/llamafarm

Shared Go best practices for LlamaFarm CLI. Covers idiomatic patterns, error handling, and testing.

generate-subsystem-skills

830

from llama-farm/llamafarm

Generate specialized skills for each subsystem in the monorepo. Creates shared language skills and subsystem-specific checklists for high-quality AI code generation.

config-skills

830

from llama-farm/llamafarm

Configuration module patterns for LlamaFarm. Covers Pydantic v2 models, JSONSchema generation, YAML processing, and validation.

common-skills

830

from llama-farm/llamafarm

Best practices for the Common utilities package in LlamaFarm. Covers HuggingFace Hub integration, GGUF model management, and shared utilities.

cli-skills

830

from llama-farm/llamafarm

CLI best practices for LlamaFarm. Covers Cobra, Bubbletea, Lipgloss patterns for Go CLI development.

electron-skills

830

from llama-farm/llamafarm

Electron patterns for LlamaFarm Desktop. Covers main/renderer processes, IPC, security, and packaging.

designer-skills

830

from llama-farm/llamafarm

Designer subsystem patterns for LlamaFarm. Covers React 18, TanStack Query, TailwindCSS, and Radix UI.