type4me-macos-voice-input

MacOS voice input tool with local/cloud ASR engines, LLM text optimization, and fully local storage built in Swift

3,831 stars

Best use case

type4me-macos-voice-input is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

MacOS voice input tool with local/cloud ASR engines, LLM text optimization, and fully local storage built in Swift

Teams using type4me-macos-voice-input should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/type4me-macos-voice-input/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/adisinghstudent/type4me-macos-voice-input/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/type4me-macos-voice-input/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How type4me-macos-voice-input Compares

Feature / Agenttype4me-macos-voice-inputStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

MacOS voice input tool with local/cloud ASR engines, LLM text optimization, and fully local storage built in Swift

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Type4Me macOS Voice Input

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

Type4Me is a macOS voice input tool that captures audio via global hotkey, transcribes it using local (SherpaOnnx/Paraformer/Zipformer) or cloud (Volcengine/Deepgram) ASR engines, optionally post-processes text via LLM, and injects the result into any app. All credentials and history are stored locally — no telemetry, no cloud sync.

## Architecture Overview

```
Type4Me/
├── ASR/                    # ASR engine abstraction
│   ├── ASRProvider.swift          # Provider enum + protocols
│   ├── ASRProviderRegistry.swift  # Plugin registry
│   ├── Providers/                 # Per-vendor config files
│   ├── SherpaASRClient.swift      # Local streaming ASR
│   ├── SherpaOfflineASRClient.swift
│   ├── VolcASRClient.swift        # Volcengine streaming ASR
│   └── DeepgramASRClient.swift    # Deepgram streaming ASR
├── Bridge/                 # SherpaOnnx C API Swift bridge
├── Audio/                  # Audio capture
├── Session/                # Core state machine: record→ASR→inject
├── Input/                  # Global hotkey management
├── Services/               # Credentials, hotwords, model manager
├── Protocol/               # Volcengine WebSocket codec
└── UI/                     # SwiftUI (FloatingBar + Settings)
```

## Installation

### Prerequisites

```bash
# Xcode Command Line Tools
xcode-select --install

# CMake (for local ASR engine)
brew install cmake
```

### Build & Deploy from Source

```bash
git clone https://github.com/joewongjc/type4me.git
cd type4me

# Step 1: Compile SherpaOnnx local engine (~5 min, one-time)
bash scripts/build-sherpa.sh

# Step 2: Build, bundle, sign, install to /Applications, and launch
bash scripts/deploy.sh
```

### Download Pre-built App

Download `Type4Me-v1.2.3.dmg` from releases (cloud ASR only, no local engine):
```
https://github.com/joewongjc/type4me/releases/tag/v1.2.3
```

If macOS blocks the app:
```bash
xattr -d com.apple.quarantine /Applications/Type4Me.app
```

### Download Local ASR Models

```bash
mkdir -p ~/Library/Application\ Support/Type4Me/Models

# Option A: Lightweight ~20MB
tar xjf ~/Downloads/sherpa-onnx-streaming-zipformer-small-ctc-zh-int8-2025-04-01.tar.bz2 \
    -C ~/Library/Application\ Support/Type4Me/Models/

# Option B: Balanced ~236MB (recommended)
tar xjf ~/Downloads/sherpa-onnx-streaming-zipformer-ctc-multi-zh-hans-2023-12-13.tar.bz2 \
    -C ~/Library/Application\ Support/Type4Me/Models/

# Option C: Bilingual Chinese+English ~1GB
tar xjf ~/Downloads/sherpa-onnx-streaming-paraformer-bilingual-zh-en.tar.bz2 \
    -C ~/Library/Application\ Support/Type4Me/Models/
```

Expected structure for Paraformer model:
```
~/Library/Application Support/Type4Me/Models/
└── sherpa-onnx-streaming-paraformer-bilingual-zh-en/
    ├── encoder.int8.onnx
    ├── decoder.int8.onnx
    └── tokens.txt
```

## Key Protocols

### SpeechRecognizer Protocol

Every ASR client must implement this protocol:

```swift
protocol SpeechRecognizer: AnyObject {
    /// Start a new recognition session
    func startRecognition() async throws
    
    /// Feed raw PCM audio data
    func appendAudio(_ buffer: AVAudioPCMBuffer) async
    
    /// Stop and get final result
    func stopRecognition() async throws -> String
    
    /// Cancel without result
    func cancelRecognition() async
    
    /// Streaming partial results (optional)
    var partialResultHandler: ((String) -> Void)? { get set }
}
```

### ASRProviderConfig Protocol

Each vendor's credential definition:

```swift
protocol ASRProviderConfig {
    /// Unique identifier string
    static var providerID: String { get }
    
    /// Display name in Settings UI
    static var displayName: String { get }
    
    /// Credential fields shown in Settings
    static var credentialFields: [CredentialField] { get }
    
    /// Validate credentials before use
    static func validate(_ credentials: [String: String]) -> Bool
    
    /// Create the recognizer instance
    static func createClient(
        credentials: [String: String],
        config: RecognitionConfig
    ) throws -> SpeechRecognizer
}
```

## Adding a New ASR Provider

### Step 1: Create Provider Config

Create `Type4Me/ASR/Providers/OpenAIWhisperProvider.swift`:

```swift
import Foundation

struct OpenAIWhisperProvider: ASRProviderConfig {
    static let providerID = "openai_whisper"
    static let displayName = "OpenAI Whisper"
    
    static let credentialFields: [CredentialField] = [
        CredentialField(
            key: "api_key",
            label: "API Key",
            placeholder: "sk-...",
            isSecret: true
        ),
        CredentialField(
            key: "model",
            label: "Model",
            placeholder: "whisper-1",
            isSecret: false
        )
    ]
    
    static func validate(_ credentials: [String: String]) -> Bool {
        guard let apiKey = credentials["api_key"], !apiKey.isEmpty else {
            return false
        }
        return apiKey.hasPrefix("sk-")
    }
    
    static func createClient(
        credentials: [String: String],
        config: RecognitionConfig
    ) throws -> SpeechRecognizer {
        guard let apiKey = credentials["api_key"] else {
            throw ASRError.missingCredential("api_key")
        }
        let model = credentials["model"] ?? "whisper-1"
        return OpenAIWhisperASRClient(apiKey: apiKey, model: model, config: config)
    }
}
```

### Step 2: Implement the ASR Client

Create `Type4Me/ASR/OpenAIWhisperASRClient.swift`:

```swift
import Foundation
import AVFoundation

final class OpenAIWhisperASRClient: SpeechRecognizer {
    var partialResultHandler: ((String) -> Void)?
    
    private let apiKey: String
    private let model: String
    private let config: RecognitionConfig
    private var audioData: Data = Data()
    
    init(apiKey: String, model: String, config: RecognitionConfig) {
        self.apiKey = apiKey
        self.model = model
        self.config = config
    }
    
    func startRecognition() async throws {
        audioData = Data()
    }
    
    func appendAudio(_ buffer: AVAudioPCMBuffer) async {
        // Convert PCM buffer to raw bytes and accumulate
        guard let channelData = buffer.floatChannelData?[0] else { return }
        let frameCount = Int(buffer.frameLength)
        let bytes = UnsafeBufferPointer(start: channelData, count: frameCount)
        // Convert Float32 PCM to Int16 for Whisper API
        let int16Samples = bytes.map { sample -> Int16 in
            return Int16(max(-32768, min(32767, Int(sample * 32767))))
        }
        int16Samples.withUnsafeBytes { ptr in
            audioData.append(contentsOf: ptr)
        }
    }
    
    func stopRecognition() async throws -> String {
        // Build multipart form request to Whisper API
        var request = URLRequest(url: URL(string: "https://api.openai.com/v1/audio/transcriptions")!)
        request.httpMethod = "POST"
        request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
        
        let boundary = UUID().uuidString
        request.setValue("multipart/form-data; boundary=\(boundary)", 
                        forHTTPHeaderField: "Content-Type")
        
        var body = Data()
        // Append audio file part
        body.append("--\(boundary)\r\n".data(using: .utf8)!)
        body.append("Content-Disposition: form-data; name=\"file\"; filename=\"audio.raw\"\r\n".data(using: .utf8)!)
        body.append("Content-Type: audio/raw\r\n\r\n".data(using: .utf8)!)
        body.append(audioData)
        body.append("\r\n".data(using: .utf8)!)
        // Append model part
        body.append("--\(boundary)\r\n".data(using: .utf8)!)
        body.append("Content-Disposition: form-data; name=\"model\"\r\n\r\n".data(using: .utf8)!)
        body.append("\(model)\r\n".data(using: .utf8)!)
        body.append("--\(boundary)--\r\n".data(using: .utf8)!)
        
        request.httpBody = body
        
        let (data, response) = try await URLSession.shared.data(for: request)
        guard let httpResponse = response as? HTTPURLResponse,
              httpResponse.statusCode == 200 else {
            throw ASRError.networkError("Whisper API returned error")
        }
        
        let result = try JSONDecoder().decode(WhisperResponse.self, from: data)
        return result.text
    }
    
    func cancelRecognition() async {
        audioData = Data()
    }
}

private struct WhisperResponse: Codable {
    let text: String
}
```

### Step 3: Register the Provider

In `Type4Me/ASR/ASRProviderRegistry.swift`, add to the `all` array:

```swift
struct ASRProviderRegistry {
    static let all: [any ASRProviderConfig.Type] = [
        SherpaParaformerProvider.self,
        VolcengineProvider.self,
        DeepgramProvider.self,
        OpenAIWhisperProvider.self,   // ← Add your provider here
    ]
}
```

## Credentials Storage

Credentials are stored at `~/Library/Application Support/Type4Me/credentials.json` with permissions `0600`. Never hardcode secrets — always load via `CredentialStore`:

```swift
// Reading credentials
let store = CredentialStore.shared
let apiKey = store.get(providerID: "openai_whisper", key: "api_key")

// Writing credentials  
store.set(providerID: "openai_whisper", key: "api_key", value: userInputKey)

// Checking if configured
let isConfigured = store.isConfigured(providerID: "openai_whisper", 
                                       fields: OpenAIWhisperProvider.credentialFields)
```

## Custom Processing Modes with Prompt Variables

Processing modes use LLM post-processing with three context variables:

| Variable | Value |
|---|---|
| `{text}` | Recognized speech text |
| `{selected}` | Text selected in active app at record start |
| `{clipboard}` | Clipboard content at record start |

Example custom mode prompts:

```swift
// Translate selection using voice command
let translatePrompt = """
The user selected this text: {selected}
Voice command: {text}
Execute the command on the selected text. Output only the result.
"""

// Code review via voice
let codeReviewPrompt = """
Code to review:
{clipboard}

Review instruction: {text}

Provide focused feedback addressing the instruction.
"""

// Email reply drafting
let emailPrompt = """
Original email: {selected}
My reply intent (spoken): {text}
Write a professional email reply. Output only the email body.
"""
```

## Built-in Processing Modes

```swift
enum ProcessingMode {
    case fast           // Direct ASR output, zero latency
    case performance    // Dual-channel: streaming + offline refinement
    case englishTranslation  // Chinese speech → English text
    case promptOptimize // Raw prompt → optimized prompt via LLM
    case command        // Voice command + selected/clipboard context → LLM action
    case custom(prompt: String)  // User-defined prompt template
}
```

## Session State Machine

The core recording flow in `Session/`:

```
[Idle]
  → hotkey pressed → [Recording] → audio streams to ASR client
  → hotkey released/pressed again → [Processing]
  → ASR returns text → [LLM Post-processing] (if mode requires)
  → [Injecting] → text injected into active app
  → [Idle]
```

## Updating After Source Changes

```bash
cd type4me
git pull
bash scripts/deploy.sh
# SherpaOnnx does NOT need recompiling unless engine version changed
```

## Troubleshooting

### App won't open (security warning)
```bash
xattr -d com.apple.quarantine /Applications/Type4Me.app
```

### Local model not recognized in Settings
Verify the directory structure exactly matches:
```bash
ls ~/Library/Application\ Support/Type4Me/Models/sherpa-onnx-streaming-paraformer-bilingual-zh-en/
# Must show: encoder.int8.onnx  decoder.int8.onnx  tokens.txt
```

### SherpaOnnx build fails
```bash
# Ensure cmake is installed
brew install cmake
# Clean and retry
rm -rf Frameworks/
bash scripts/build-sherpa.sh
```

### New ASR provider not appearing in Settings
- Confirm the provider type is added to `ASRProviderRegistry.all`
- Ensure `providerID` is unique across all providers
- Clean build: `swift package clean && bash scripts/deploy.sh`

### Audio not captured / no floating bar
- Grant microphone permission: System Settings → Privacy & Security → Microphone → Type4Me ✓
- Grant Accessibility permission for text injection: System Settings → Privacy & Security → Accessibility → Type4Me ✓

### Credentials not saving
```bash
# Check file exists and has correct permissions
ls -la ~/Library/Application\ Support/Type4Me/credentials.json
# Should show: -rw------- (0600)
# Fix permissions if needed:
chmod 0600 ~/Library/Application\ Support/Type4Me/credentials.json
```

### Export history to CSV
Open Settings → History → select date range → Export CSV. The SQLite database is at:
```bash
~/Library/Application\ Support/Type4Me/history.db
# Direct query:
sqlite3 ~/Library/Application\ Support/Type4Me/history.db \
  "SELECT datetime(timestamp,'unixepoch'), text FROM records ORDER BY timestamp DESC LIMIT 20;"
```

## System Requirements

- macOS 14.0 (Sonoma) or later
- Apple Silicon (M1/M2/M3/M4) recommended for local ASR inference
- Xcode Command Line Tools + CMake for source builds
- Internet connection only needed for cloud ASR providers

Related Skills

Invoice Generator

3891
from openclaw/skills

Creates professional invoices in markdown and HTML

Workflow & Productivity

brand-voice-generator

3891
from openclaw/skills

Creates consistent brand voice guidelines and content. Generates copy that matches your brand personality across all channels. Perfect for startups building their identity.

Content & Documentation

invoice-ocr

3891
from openclaw/skills

发票 OCR 识别技能。扫描文件夹中的发票文件(PDF/图片),调用阿里云 OCR API 识别发票信息并导出到 Excel 表格。支持 17+ 种发票类型(增值税发票、火车票、出租车票、机票行程单、定额发票、机动车销售发票、过路过桥费发票等)。使用场景:(1) 用户提到"发票识别"、"发票统计"、"发票整理"、"发票汇总" (2) 用户需要批量处理发票 (3) 用户提到阿里云 OCR 识别发票。**重要:首次使用必须先配置阿里云凭证,主动向用户索要 AccessKey ID 和 AccessKey Secret,或引导用户运行 --config 命令自行配置。**

Workflow & Productivity

Bland AI — Voice Calling Skill

3891
from openclaw/skills

Make and manage AI-powered phone calls via the Bland AI API.

Workflow & Productivity

afrexai-invoice-engine

3880
from openclaw/skills

Generate, manage, and track professional invoices with payment terms, recurring billing, overdue automation, and financial reporting. Use when creating invoices, tracking payments, managing clients, or reviewing revenue.

Workflow & Productivity

voice-tts

3891
from openclaw/skills

语音输入(Whisper ASR)+ 语音输出(Edge TTS)技能,支持 agent 专属音色,可调用 send_voice_reply.mjs 发送 Telegram 语音消息。

amber-voice-assistant

3891
from openclaw/skills

AI phone assistant and virtual receptionist for OpenClaw. Answers inbound phone calls, screens callers, makes outbound phone calls, and books appointments — all over Twilio + OpenAI Realtime voice. Full telephone workflow: phone call screening, live call transcripts, CRM contact memory, calendar integration. Ideal for anyone who wants an AI to answer their phone, handle call screening, or make phone calls autonomously. Includes interactive setup wizard, live call dashboard, and human-in-the-loop escalation. Also ships as a Claude Desktop MCP plugin — dial phone numbers, check call history, query CRM, and manage calendar directly from Claude Desktop.

discord-voice

3891
from openclaw/skills

Real-time voice conversations in Discord voice channels with Claude AI

feishu-voice-assistant

3891
from openclaw/skills

Sends voice messages (audio) to Feishu chats using Duby TTS.

invoice-chaser

3891
from openclaw/skills

Automated invoice follow-up sequences that escalate from friendly to firm. Track unpaid invoices, send timed reminder emails with escalating tone, log payment interactions, and generate AR aging reports. Your agent handles the awkward conversations so you don't have to — preserving cash flow and client relationships while you focus on actual work. Configure invoice tracking, email templates per stage (friendly → firm → final notice), timing rules, and let your agent chase payments 24/7. Use when adding invoices, running payment chases, checking status, or generating accounts receivable reports.

voiceclaw

3891
from openclaw/skills

Local voice I/O for OpenClaw agents. Transcribe inbound audio/voice messages using local Whisper (whisper.cpp) and generate voice replies using local Piper TTS. Requires whisper, piper, and ffmpeg pre-installed on the system. All inference runs on-device — no network calls, no cloud APIs, no API keys. Use when an agent receives a voice/audio message and should respond in both voice and text, or when any text response should be synthesized and sent as audio. Triggers on: voice messages, audio attachments, respond in voice, send as audio, speak this, voiceclaw.

anvevoice

3891
from openclaw/skills

Add AI voice assistants to your website. Engage visitors with natural voice conversations, capture leads, automate support, and boost conversions.