stt-tts-service

Lightweight local speech-to-text and text-to-speech service for OpenClaw

16 stars

Best use case

stt-tts-service is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Lightweight local speech-to-text and text-to-speech service for OpenClaw

Teams using stt-tts-service should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/stt-tts-service/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/development/stt-tts-service/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/stt-tts-service/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How stt-tts-service Compares

Feature / Agentstt-tts-serviceStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Lightweight local speech-to-text and text-to-speech service for OpenClaw

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# STT-TTS Service

A lightweight, local speech-to-text (STT) and text-to-speech (TTS) service that runs on any device connected to your OpenClaw server. Perfect for voice-enabled workflows and flexible resource allocation.

## Features

- **Speech-to-Text**: Transcribe audio using faster-whisper (4x faster than OpenAI Whisper)
- **Text-to-Speech**: Generate natural speech using piper-tts or pyttsx3 fallback
- **100% Local**: No cloud APIs, works offline after initial model download
- **Flexible Deployment**: Run on any device - Raspberry Pi, laptop, or GPU server
- **HTTP API**: Simple REST endpoints for easy integration

## Quick Start

### Installation

```bash
# Clone or download this skill
cd stt-tts-service

# Install dependencies
pip install -r requirements.txt

# Start the service
python main.py
```

### Docker Deployment

```bash
docker build -t stt-tts-service .
docker run -p 8765:8765 stt-tts-service
```

## API Endpoints

### POST /stt - Speech to Text

Transcribe audio files to text.

```bash
curl -X POST http://localhost:8765/stt \
  -F "audio=@recording.wav"
```

**Response:**
```json
{
  "text": "Hello, this is the transcribed text.",
  "language": "en",
  "duration": 3.5
}
```

### POST /tts - Text to Speech

Convert text to audio.

```bash
curl -X POST http://localhost:8765/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice": "default"}' \
  --output speech.wav
```

**Parameters:**
- `text` (required): Text to synthesize
- `voice` (optional): Voice ID to use
- `speed` (optional): Speech rate multiplier (0.5-2.0)

### GET /health

Health check endpoint.

```bash
curl http://localhost:8765/health
```

### GET /models

List available models and voices.

```bash
curl http://localhost:8765/models
```

## WebSocket Streaming (Real-time Voice)

For real-time voice conversations, use WebSocket endpoints:

### WS /ws/stt - Streaming Speech-to-Text

Stream audio and receive transcriptions in real-time.

```javascript
const ws = new WebSocket('ws://localhost:8765/ws/stt');

// Send audio chunks (16kHz, 16-bit, mono PCM)
ws.send(audioBuffer);

// Receive transcriptions
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(data.text);  // Transcribed text
};

// Flush remaining audio
ws.send(JSON.stringify({action: "flush"}));
```

### WS /ws/tts - Streaming Text-to-Speech

Send text and receive audio chunks in real-time.

```javascript
const ws = new WebSocket('ws://localhost:8765/ws/tts');

// Send text to synthesize
ws.send(JSON.stringify({text: "Hello world"}));

// Receive audio chunks
ws.onmessage = (event) => {
  if (event.data instanceof Blob) {
    // Audio chunk - play it
    playAudio(event.data);
  }
};
```

### WS /ws/voice - Full Duplex Voice Conversation

Stream audio input and receive audio output for real-time voice-to-voice.

```javascript
const ws = new WebSocket('ws://localhost:8765/ws/voice');

// Stream microphone audio
navigator.mediaDevices.getUserMedia({audio: true})
  .then(stream => {
    // Send audio chunks to WebSocket
  });

// Handle responses
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === "transcript") {
    // User's speech transcribed - send to your AI
    sendToAI(data.text);
  }
};

// Send AI response to be spoken
ws.send(JSON.stringify({action: "speak", text: aiResponse}));
```

## Configuration

Set environment variables or edit `config.py`:

| Variable | Default | Description |
|----------|---------|-------------|
| `STT_MODEL` | `base` | Whisper model: tiny, base, small, medium |
| `TTS_ENGINE` | `auto` | TTS engine: piper, pyttsx3, auto |
| `DEVICE` | `auto` | Compute device: cpu, cuda, auto |
| `HOST` | `0.0.0.0` | Server bind address |
| `PORT` | `8765` | Server port |

## Model Sizes

| STT Model | Size | Speed | Accuracy |
|-----------|------|-------|----------|
| tiny | ~75MB | Fastest | Basic |
| base | ~150MB | Fast | Good |
| small | ~500MB | Medium | Better |
| medium | ~1.5GB | Slower | Best |

## OpenClaw Integration

Register this service with your OpenClaw server:

```bash
openclaw service register http://device-ip:8765
```

Then use in your workflows:
```yaml
- action: stt
  input: ${audio_file}
  output: transcription
  
- action: tts
  input: "Hello, ${user_name}!"
  output: greeting_audio
```

## Requirements

- Python 3.9+
- 2GB RAM minimum (4GB recommended for medium model)
- ~500MB disk space (plus model storage)

Related Skills

service-class-conventions

16
from diegosouzapw/awesome-omni-skill

Defines the structure and implementation of service classes, enforcing the use of interfaces, ServiceImpl classes, DTOs for data transfer, and transactional management.

performing-service-account-credential-rotation

16
from diegosouzapw/awesome-omni-skill

Automate credential rotation for service accounts across Active Directory, cloud platforms, and application databases to eliminate stale secrets and reduce compromise risk.

ocr-web-service-automation

16
from diegosouzapw/awesome-omni-skill

Automate OCR Web Service tasks via Rube MCP (Composio). Always search tools first for current schemas.

nimble-service-skill

16
from diegosouzapw/awesome-omni-skill

Create and edit BLE GATT services with NimBLE. Use when creating, editing, or refactoring BLE services, characteristics, descriptors, or callbacks.

moqui-service-writer

16
from diegosouzapw/awesome-omni-skill

This skill should be used when users need to create, validate, or modify Moqui framework services, entities, and queries. It provides comprehensive guidance for writing correct Moqui XML definitions, following framework patterns and conventions.

microservices-orchestrator

16
from diegosouzapw/awesome-omni-skill

Expert skill for designing, decomposing, and managing microservices architectures. Activates when users need help with microservices design, service decomposition, bounded contexts, API contracts, or transitioning from monolithic to microservices architectures.

managed-db-services

16
from diegosouzapw/awesome-omni-skill

Configure DigitalOcean Managed MySQL, MongoDB, Valkey, Kafka, and OpenSearch for App Platform. Use when setting up non-PostgreSQL databases, configuring trusted sources, or troubleshooting database connectivity.

flox-services

16
from diegosouzapw/awesome-omni-skill

Running services and background processes in Flox environments. Use for service configuration, network services, logging, database setup, and service debugging.

effect-layers-services

16
from diegosouzapw/awesome-omni-skill

Define services, provide layers, compose dependencies, and switch live/test. Use for DI boundaries and app composition.

developing-backend-services

16
from diegosouzapw/awesome-omni-skill

Backend service development best practices. Use when designing, building, or reviewing backend services, REST APIs, gRPC services, microservices, webhooks, message queues, or server-side applications regardless of language or framework.

design-microservices

16
from diegosouzapw/awesome-omni-skill

マイクロサービス設計エージェント - ターゲットアーキテクチャ、変換計画、運用計画の策定。/design-microservices [対象パス] で呼び出し。

backend-service-patterns

16
from diegosouzapw/awesome-omni-skill

Architect scalable backend services using layered architecture, dependency injection, middleware patterns, service classes, and separation of concerns. Use when building API services, implementing business logic layers, creating service classes, setting up middleware chains, implementing dependency injection, designing controller-service-repository patterns, handling cross-cutting concerns, creating domain models, implementing CQRS patterns, or establishing backend architecture standards.