remote-ollama-gpu-scheduler

调度远程 Ollama GPU 资源执行批量 embedding 或推理任务，提升多机环境下的算力利用率。

33 stars

Best use case

remote-ollama-gpu-scheduler is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

调度远程 Ollama GPU 资源执行批量 embedding 或推理任务，提升多机环境下的算力利用率。

Teams using remote-ollama-gpu-scheduler should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/remote-ollama-gpu-scheduler/SKILL.md --create-dirs "https://raw.githubusercontent.com/aAAaqwq/AGI-Super-Team/main/skills/remote-ollama-gpu-scheduler/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/remote-ollama-gpu-scheduler/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How remote-ollama-gpu-scheduler Compares

Feature / Agent	remote-ollama-gpu-scheduler	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

调度远程 Ollama GPU 资源执行批量 embedding 或推理任务，提升多机环境下的算力利用率。

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Remote Ollama GPU Scheduler

高效调度远程 Ollama GPU 算力进行批量 embedding 的技能。

## 核心问题

**Node.js fetch 不尊重 NO_PROXY**，导致 Tailscale 流量被本地代理拦截：
```bash
# ❌ 这样没用！Node 忽略 NO_PROXY
NO_PROXY=100.0.0.0/8 node -e "fetch('http://100.x.x.x:11434')..."

# ✅ 必须清空全局代理变量
unset HTTP_PROXY HTTPS_PROXY http_proxy https_proxy
# 或用 undici 的 setGlobalDispatcher 绕过
```

## 架构选择

### 方案1: Ollama Remote Backend（稳定但慢）
```
┌─────────┐    Tailscale     ┌────────────┐
│  Linux  │ ───────────────▶ │ Mac Studio │
│  QMD    │   NO_PROXY       │  Ollama    │
│         │   100.0.0.0/8    │  0.6b/8b   │
└─────────┘                  └────────────┘
```

**配置**:
```bash
export QMD_EMBED_BACKEND=ollama
export QMD_OLLAMA_EMBED_URL=http://100.65.110.126:11434
export QMD_OLLAMA_EMBED_MODEL=qwen3-embedding:0.6b  # 或 :8b
```

**性能**:
- 0.6b: ~2.5 chunks/s (1024-dim)
- 8b: ~1.1 chunks/s (4096-dim)
- 73k chunks: 8h (0.6b) / 19h (8b)

### 方案2: llama-server Backend（快但不稳定）⚠️
```
┌─────────┐    Tailscale     ┌────────────┐
│  Linux  │ ───────────────▶ │ Mac Studio │
│  QMD    │   /v1/embeddings │llama-server│
│         │   OpenAI格式     │  Flash Attn│
└─────────┘                  └────────────┘
```

**配置**:
```bash
export QMD_EMBED_BACKEND=llamaserver
export QMD_LLAMASERVER_URL=http://100.65.110.126:8081
export QMD_LLAMASERVER_CONCURRENCY=4
```

**已知问题**:
- llama.cpp b8352 + Qwen3-0.6b embedding 在 parallel>1 时崩溃
- 错误: `GGML_ASSERT(i01 >= 0 && i01 < ne01) failed`
- 临时方案: 用 `--parallel 1` 或换 llama.cpp b8200

## Mac Studio llama-server 启动命令

```bash
# 0.6b 模型路径
MODEL=/Users/daniel/.ollama/models/blobs/sha256-06507c7b42688469c4e7298b0a1e16deff06caf291cf0a5b278c308249c3e439

# 稳定配置（parallel=1）
/tmp/llama/llama-b8352/llama-server \
  -m $MODEL \
  --embedding \
  --port 8081 \
  --host 0.0.0.0 \
  -ngl 99 \
  --parallel 1 \
  -c 4096 \
  -b 512

# 激进配置（可能崩溃）⚠️
/tmp/llama/llama-b8352/llama-server \
  -m $MODEL \
  --embedding \
  --port 8081 \
  --host 0.0.0.0 \
  -ngl 99 \
  --parallel 8 \
  -c 16384 \
  -b 2048
```

## 性能基准（Mac Studio M4 Max）

| 后端 | 模型 | 并行度 | 速度 | 维度 | 稳定性 |
|------|------|--------|------|------|--------|
| Ollama | qwen3:0.6b | 1 | 2.5 c/s | 1024 | ✅ |
| Ollama | qwen3:8b | 1 | 1.1 c/s | 4096 | ✅ |
| llama-server | 0.6b | 1 | ~15 c/s* | 1024 | ⚠️ |
| llama-server | 0.6b | 8 | ~120 c/s* | 1024 | ❌ crash |

*估算值，实际测试中 parallel>1 会崩溃

## QMD Embedding 代码修改

在 `/path/to/qmd/dist/llm.js` 添加 `llamaserver` 后端：

```javascript
// --- llama-server Backend (OpenAI-compatible) ---
const _LLAMASERVER_URL = process.env.QMD_LLAMASERVER_URL || "";
const _LLAMASERVER_CONCURRENCY = parseInt(process.env.QMD_LLAMASERVER_CONCURRENCY || "4", 10);

async function _llamaserverEmbed(text) {
    const url = `${_LLAMASERVER_URL}/v1/embeddings`;
    const resp = await fetch(url, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ model: "qwen3-embedding", input: text }),
        signal: AbortSignal.timeout(300000),
    });
    const data = await resp.json();
    const vec = data?.data?.[0]?.embedding;
    return vec ? { embedding: vec, model: "llamaserver:qwen3" } : null;
}

async function _llamaserverEmbedBatch(texts) {
    const CONC = _LLAMASERVER_CONCURRENCY;
    const subSize = Math.ceil(texts.length / CONC);
    const subBatches = [];
    for (let i = 0; i < texts.length; i += subSize) {
        subBatches.push(texts.slice(i, i + subSize));
    }
    
    const results = await Promise.all(subBatches.map(async (batch) => {
        const resp = await fetch(`${_LLAMASERVER_URL}/v1/embeddings`, {
            method: "POST",
            headers: { "Content-Type": "application/json" },
            body: JSON.stringify({ model: "qwen3-embedding", input: batch }),
            signal: AbortSignal.timeout(600000),
        });
        const data = await resp.json();
        return (data?.data || []).sort((a,b) => a.index - b.index)
            .map(d => d.embedding ? { embedding: d.embedding, model: "llamaserver:qwen3" } : null);
    }));
    
    return results.flat();
}
```

## 故障排查

### 1. fetch failed / connection refused
```bash
# 检查代理
env | grep -i proxy

# 检查 Tailscale
tailscale status | grep studio

# 测试直连
curl --noproxy "*" http://100.65.110.126:11434/api/tags
```

### 2. embedding 维度不匹配
```
Error: dimension mismatch (expected 4096, got 1024)
```
解决: `qmd embed -f` 强制重建索引

### 3. llama-server 崩溃
```bash
# 查看崩溃日志
tail -20 /tmp/llama-server.log | grep -E 'ASSERT|exception|abort'

# 降级参数
--parallel 1 -c 4096 -b 512
```

## 推荐配置

**稳定优先**: 用 Ollama backend + 0.6b 模型
```bash
export QMD_EMBED_BACKEND=ollama
export QMD_OLLAMA_EMBED_URL=http://100.65.110.126:11434
export QMD_OLLAMA_EMBED_MODEL=qwen3-embedding:0.6b
```

**速度优先**: 等待 llama.cpp 修复后使用 llamaserver backend

---

*创建日期: 2026-03-15*
*适用场景: 批量 embedding、知识库构建、向量搜索*

Related Skills

remote-openclaw-deploy

from aAAaqwq/AGI-Super-Team

通用远程部署 OpenClaw Agent 项目。支持任意定制化 agent 团队、跨 macOS/Linux、多渠道（飞书/Telegram/Discord）、deploy.json 声明式配置注入。一个脚本完成从零到可用的全流程。

remote-access

from aAAaqwq/AGI-Super-Team

ttyd + Tailscale for mobile terminal access

wemp-operator

from aAAaqwq/AGI-Super-Team

> 微信公众号全功能运营——草稿/发布/评论/用户/素材/群发/统计/菜单/二维码 API 封装

Content & Documentation

zsxq-smart-publish

from aAAaqwq/AGI-Super-Team

Publish and manage content on 知识星球 (zsxq.com). Supports talk posts, Q&A, long articles, file sharing, digest/bookmark, homework tasks, and tag management. Use when publishing content to 知识星球, creating/editing posts, uploading files/images/audio, managing digests, batch publishing, or formatting content for 知识星球.

zoom-automation

from aAAaqwq/AGI-Super-Team

Automate Zoom meeting creation, management, recordings, webinars, and participant tracking via Rube MCP (Composio). Always search tools first for current schemas.

zoho-crm-automation

from aAAaqwq/AGI-Super-Team

Automate Zoho CRM tasks via Rube MCP (Composio): create/update records, search contacts, manage leads, and convert leads. Always search tools first for current schemas.

ziliu-publisher

from aAAaqwq/AGI-Super-Team

字流(Ziliu) - AI驱动的多平台内容分发工具。用于一次创作、智能适配排版、一键分发到16+平台（公众号/知乎/小红书/B站/抖音/微博/X等）。当用户需要多平台发布、内容排版、格式适配时使用。触发词：字流、ziliu、多平台发布、一键分发、内容分发、排版发布。

zhihu-post-skill

from aAAaqwq/AGI-Super-Team

> 知乎文章发布——知乎平台内容创作与发布自动化

zendesk-automation

from aAAaqwq/AGI-Super-Team

Automate Zendesk tasks via Rube MCP (Composio): tickets, users, organizations, replies. Always search tools first for current schemas.

youtube-knowledge-extractor

from aAAaqwq/AGI-Super-Team

This skill performs deep analysis of YouTube videos through **both information channels** Multimodal YouTube video analysis through both audio (transcript) and visual (frame extraction + image analysis) channels. Especially powerful for HowTo videos, tutorials, demos, and explainer videos where what is SHOWN (screenshots, UI demos, diagrams, code, physical actions) is just as important as what is SAID. Use this skill whenever a user wants to analyze, summarize, or create step-by-step guides from YouTube videos, or when they share a YouTube URL and want to understand what happens in the video. Triggers on requests like "Analyze this YouTube video", "Create a step-by-step guide from this video", "What does this video show?", "Summarize this tutorial", or any YouTube URL shared with analysis intent.

youtube-factory

from aAAaqwq/AGI-Super-Team

Generate complete YouTube videos from a single prompt - script, voiceover, stock footage, captions, thumbnail. Self-contained, no external modules. 100% free tools.

youtube-automation

from aAAaqwq/AGI-Super-Team

Automate YouTube tasks via Rube MCP (Composio): upload videos, manage playlists, search content, get analytics, and handle comments. Always search tools first for current schemas.