stt-tts-service

Lightweight local speech-to-text and text-to-speech service for OpenClaw

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

stt-tts-service is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Lightweight local speech-to-text and text-to-speech service for OpenClaw

Teams using stt-tts-service should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/stt-tts-service/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/development/stt-tts-service/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/stt-tts-service/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How stt-tts-service Compares

Feature / Agent	stt-tts-service	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Lightweight local speech-to-text and text-to-speech service for OpenClaw

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# STT-TTS Service

A lightweight, local speech-to-text (STT) and text-to-speech (TTS) service that runs on any device connected to your OpenClaw server. Perfect for voice-enabled workflows and flexible resource allocation.

## Features

- **Speech-to-Text**: Transcribe audio using faster-whisper (4x faster than OpenAI Whisper)
- **Text-to-Speech**: Generate natural speech using piper-tts or pyttsx3 fallback
- **100% Local**: No cloud APIs, works offline after initial model download
- **Flexible Deployment**: Run on any device - Raspberry Pi, laptop, or GPU server
- **HTTP API**: Simple REST endpoints for easy integration

## Quick Start

### Installation

```bash
# Clone or download this skill
cd stt-tts-service

# Install dependencies
pip install -r requirements.txt

# Start the service
python main.py
```

### Docker Deployment

```bash
docker build -t stt-tts-service .
docker run -p 8765:8765 stt-tts-service
```

## API Endpoints

### POST /stt - Speech to Text

Transcribe audio files to text.

```bash
curl -X POST http://localhost:8765/stt \
  -F "audio=@recording.wav"
```

**Response:**
```json
{
  "text": "Hello, this is the transcribed text.",
  "language": "en",
  "duration": 3.5
}
```

### POST /tts - Text to Speech

Convert text to audio.

```bash
curl -X POST http://localhost:8765/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice": "default"}' \
  --output speech.wav
```

**Parameters:**
- `text` (required): Text to synthesize
- `voice` (optional): Voice ID to use
- `speed` (optional): Speech rate multiplier (0.5-2.0)

### GET /health

Health check endpoint.

```bash
curl http://localhost:8765/health
```

### GET /models

List available models and voices.

```bash
curl http://localhost:8765/models
```

## WebSocket Streaming (Real-time Voice)

For real-time voice conversations, use WebSocket endpoints:

### WS /ws/stt - Streaming Speech-to-Text

Stream audio and receive transcriptions in real-time.

```javascript
const ws = new WebSocket('ws://localhost:8765/ws/stt');

// Send audio chunks (16kHz, 16-bit, mono PCM)
ws.send(audioBuffer);

// Receive transcriptions
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(data.text);  // Transcribed text
};

// Flush remaining audio
ws.send(JSON.stringify({action: "flush"}));
```

### WS /ws/tts - Streaming Text-to-Speech

Send text and receive audio chunks in real-time.

```javascript
const ws = new WebSocket('ws://localhost:8765/ws/tts');

// Send text to synthesize
ws.send(JSON.stringify({text: "Hello world"}));

// Receive audio chunks
ws.onmessage = (event) => {
  if (event.data instanceof Blob) {
    // Audio chunk - play it
    playAudio(event.data);
  }
};
```

### WS /ws/voice - Full Duplex Voice Conversation

Stream audio input and receive audio output for real-time voice-to-voice.

```javascript
const ws = new WebSocket('ws://localhost:8765/ws/voice');

// Stream microphone audio
navigator.mediaDevices.getUserMedia({audio: true})
  .then(stream => {
    // Send audio chunks to WebSocket
  });

// Handle responses
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === "transcript") {
    // User's speech transcribed - send to your AI
    sendToAI(data.text);
  }
};

// Send AI response to be spoken
ws.send(JSON.stringify({action: "speak", text: aiResponse}));
```

## Configuration

Set environment variables or edit `config.py`:

| Variable | Default | Description |
|----------|---------|-------------|
| `STT_MODEL` | `base` | Whisper model: tiny, base, small, medium |
| `TTS_ENGINE` | `auto` | TTS engine: piper, pyttsx3, auto |
| `DEVICE` | `auto` | Compute device: cpu, cuda, auto |
| `HOST` | `0.0.0.0` | Server bind address |
| `PORT` | `8765` | Server port |

## Model Sizes

| STT Model | Size | Speed | Accuracy |
|-----------|------|-------|----------|
| tiny | ~75MB | Fastest | Basic |
| base | ~150MB | Fast | Good |
| small | ~500MB | Medium | Better |
| medium | ~1.5GB | Slower | Best |

## OpenClaw Integration

Register this service with your OpenClaw server:

```bash
openclaw service register http://device-ip:8765
```

Then use in your workflows:
```yaml
- action: stt
  input: ${audio_file}
  output: transcription
  
- action: tts
  input: "Hello, ${user_name}!"
  output: greeting_audio
```

## Requirements

- Python 3.9+
- 2GB RAM minimum (4GB recommended for medium model)
- ~500MB disk space (plus model storage)

Related Skills

service-class-conventions

from diegosouzapw/awesome-omni-skill

Defines the structure and implementation of service classes, enforcing the use of interfaces, ServiceImpl classes, DTOs for data transfer, and transactional management.

performing-service-account-credential-rotation

from diegosouzapw/awesome-omni-skill

Automate credential rotation for service accounts across Active Directory, cloud platforms, and application databases to eliminate stale secrets and reduce compromise risk.

ocr-web-service-automation

from diegosouzapw/awesome-omni-skill

Automate OCR Web Service tasks via Rube MCP (Composio). Always search tools first for current schemas.

nimble-service-skill

from diegosouzapw/awesome-omni-skill

Create and edit BLE GATT services with NimBLE. Use when creating, editing, or refactoring BLE services, characteristics, descriptors, or callbacks.

moqui-service-writer

from diegosouzapw/awesome-omni-skill

This skill should be used when users need to create, validate, or modify Moqui framework services, entities, and queries. It provides comprehensive guidance for writing correct Moqui XML definitions, following framework patterns and conventions.

microservices-orchestrator

from diegosouzapw/awesome-omni-skill

Expert skill for designing, decomposing, and managing microservices architectures. Activates when users need help with microservices design, service decomposition, bounded contexts, API contracts, or transitioning from monolithic to microservices architectures.

managed-db-services

from diegosouzapw/awesome-omni-skill

Configure DigitalOcean Managed MySQL, MongoDB, Valkey, Kafka, and OpenSearch for App Platform. Use when setting up non-PostgreSQL databases, configuring trusted sources, or troubleshooting database connectivity.

flox-services

from diegosouzapw/awesome-omni-skill

Running services and background processes in Flox environments. Use for service configuration, network services, logging, database setup, and service debugging.

effect-layers-services

from diegosouzapw/awesome-omni-skill

Define services, provide layers, compose dependencies, and switch live/test. Use for DI boundaries and app composition.

developing-backend-services

from diegosouzapw/awesome-omni-skill

Backend service development best practices. Use when designing, building, or reviewing backend services, REST APIs, gRPC services, microservices, webhooks, message queues, or server-side applications regardless of language or framework.

design-microservices

from diegosouzapw/awesome-omni-skill

マイクロサービス設計エージェント - ターゲットアーキテクチャ、変換計画、運用計画の策定。/design-microservices [対象パス] で呼び出し。

backend-service-patterns

from diegosouzapw/awesome-omni-skill

Architect scalable backend services using layered architecture, dependency injection, middleware patterns, service classes, and separation of concerns. Use when building API services, implementing business logic layers, creating service classes, setting up middleware chains, implementing dependency injection, designing controller-service-repository patterns, handling cross-cutting concerns, creating domain models, implementing CQRS patterns, or establishing backend architecture standards.