stt-tts-service
Lightweight local speech-to-text and text-to-speech service for OpenClaw
Best use case
stt-tts-service is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Lightweight local speech-to-text and text-to-speech service for OpenClaw
Teams using stt-tts-service should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/stt-tts-service/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How stt-tts-service Compares
| Feature / Agent | stt-tts-service | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Lightweight local speech-to-text and text-to-speech service for OpenClaw
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# STT-TTS Service
A lightweight, local speech-to-text (STT) and text-to-speech (TTS) service that runs on any device connected to your OpenClaw server. Perfect for voice-enabled workflows and flexible resource allocation.
## Features
- **Speech-to-Text**: Transcribe audio using faster-whisper (4x faster than OpenAI Whisper)
- **Text-to-Speech**: Generate natural speech using piper-tts or pyttsx3 fallback
- **100% Local**: No cloud APIs, works offline after initial model download
- **Flexible Deployment**: Run on any device - Raspberry Pi, laptop, or GPU server
- **HTTP API**: Simple REST endpoints for easy integration
## Quick Start
### Installation
```bash
# Clone or download this skill
cd stt-tts-service
# Install dependencies
pip install -r requirements.txt
# Start the service
python main.py
```
### Docker Deployment
```bash
docker build -t stt-tts-service .
docker run -p 8765:8765 stt-tts-service
```
## API Endpoints
### POST /stt - Speech to Text
Transcribe audio files to text.
```bash
curl -X POST http://localhost:8765/stt \
-F "audio=@recording.wav"
```
**Response:**
```json
{
"text": "Hello, this is the transcribed text.",
"language": "en",
"duration": 3.5
}
```
### POST /tts - Text to Speech
Convert text to audio.
```bash
curl -X POST http://localhost:8765/tts \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "voice": "default"}' \
--output speech.wav
```
**Parameters:**
- `text` (required): Text to synthesize
- `voice` (optional): Voice ID to use
- `speed` (optional): Speech rate multiplier (0.5-2.0)
### GET /health
Health check endpoint.
```bash
curl http://localhost:8765/health
```
### GET /models
List available models and voices.
```bash
curl http://localhost:8765/models
```
## WebSocket Streaming (Real-time Voice)
For real-time voice conversations, use WebSocket endpoints:
### WS /ws/stt - Streaming Speech-to-Text
Stream audio and receive transcriptions in real-time.
```javascript
const ws = new WebSocket('ws://localhost:8765/ws/stt');
// Send audio chunks (16kHz, 16-bit, mono PCM)
ws.send(audioBuffer);
// Receive transcriptions
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log(data.text); // Transcribed text
};
// Flush remaining audio
ws.send(JSON.stringify({action: "flush"}));
```
### WS /ws/tts - Streaming Text-to-Speech
Send text and receive audio chunks in real-time.
```javascript
const ws = new WebSocket('ws://localhost:8765/ws/tts');
// Send text to synthesize
ws.send(JSON.stringify({text: "Hello world"}));
// Receive audio chunks
ws.onmessage = (event) => {
if (event.data instanceof Blob) {
// Audio chunk - play it
playAudio(event.data);
}
};
```
### WS /ws/voice - Full Duplex Voice Conversation
Stream audio input and receive audio output for real-time voice-to-voice.
```javascript
const ws = new WebSocket('ws://localhost:8765/ws/voice');
// Stream microphone audio
navigator.mediaDevices.getUserMedia({audio: true})
.then(stream => {
// Send audio chunks to WebSocket
});
// Handle responses
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "transcript") {
// User's speech transcribed - send to your AI
sendToAI(data.text);
}
};
// Send AI response to be spoken
ws.send(JSON.stringify({action: "speak", text: aiResponse}));
```
## Configuration
Set environment variables or edit `config.py`:
| Variable | Default | Description |
|----------|---------|-------------|
| `STT_MODEL` | `base` | Whisper model: tiny, base, small, medium |
| `TTS_ENGINE` | `auto` | TTS engine: piper, pyttsx3, auto |
| `DEVICE` | `auto` | Compute device: cpu, cuda, auto |
| `HOST` | `0.0.0.0` | Server bind address |
| `PORT` | `8765` | Server port |
## Model Sizes
| STT Model | Size | Speed | Accuracy |
|-----------|------|-------|----------|
| tiny | ~75MB | Fastest | Basic |
| base | ~150MB | Fast | Good |
| small | ~500MB | Medium | Better |
| medium | ~1.5GB | Slower | Best |
## OpenClaw Integration
Register this service with your OpenClaw server:
```bash
openclaw service register http://device-ip:8765
```
Then use in your workflows:
```yaml
- action: stt
input: ${audio_file}
output: transcription
- action: tts
input: "Hello, ${user_name}!"
output: greeting_audio
```
## Requirements
- Python 3.9+
- 2GB RAM minimum (4GB recommended for medium model)
- ~500MB disk space (plus model storage)Related Skills
service-class-conventions
Defines the structure and implementation of service classes, enforcing the use of interfaces, ServiceImpl classes, DTOs for data transfer, and transactional management.
performing-service-account-credential-rotation
Automate credential rotation for service accounts across Active Directory, cloud platforms, and application databases to eliminate stale secrets and reduce compromise risk.
ocr-web-service-automation
Automate OCR Web Service tasks via Rube MCP (Composio). Always search tools first for current schemas.
nimble-service-skill
Create and edit BLE GATT services with NimBLE. Use when creating, editing, or refactoring BLE services, characteristics, descriptors, or callbacks.
moqui-service-writer
This skill should be used when users need to create, validate, or modify Moqui framework services, entities, and queries. It provides comprehensive guidance for writing correct Moqui XML definitions, following framework patterns and conventions.
microservices-orchestrator
Expert skill for designing, decomposing, and managing microservices architectures. Activates when users need help with microservices design, service decomposition, bounded contexts, API contracts, or transitioning from monolithic to microservices architectures.
managed-db-services
Configure DigitalOcean Managed MySQL, MongoDB, Valkey, Kafka, and OpenSearch for App Platform. Use when setting up non-PostgreSQL databases, configuring trusted sources, or troubleshooting database connectivity.
flox-services
Running services and background processes in Flox environments. Use for service configuration, network services, logging, database setup, and service debugging.
effect-layers-services
Define services, provide layers, compose dependencies, and switch live/test. Use for DI boundaries and app composition.
developing-backend-services
Backend service development best practices. Use when designing, building, or reviewing backend services, REST APIs, gRPC services, microservices, webhooks, message queues, or server-side applications regardless of language or framework.
design-microservices
マイクロサービス設計エージェント - ターゲットアーキテクチャ、変換計画、運用計画の策定。/design-microservices [対象パス] で呼び出し。
backend-service-patterns
Architect scalable backend services using layered architecture, dependency injection, middleware patterns, service classes, and separation of concerns. Use when building API services, implementing business logic layers, creating service classes, setting up middleware chains, implementing dependency injection, designing controller-service-repository patterns, handling cross-cutting concerns, creating domain models, implementing CQRS patterns, or establishing backend architecture standards.