asr

Transcribe audio files to text using local speech recognition. Triggers on: "转录", "transcribe", "语音转文字", "ASR", "识别音频", "把这段音频转成文字".

3,891 stars
Complexity: medium

About this skill

The ASR (Automatic Speech Recognition) skill enables AI agents to transcribe audio files into text using local, offline speech recognition models. This capability is ideal for users who need to convert spoken content from audio recordings into written format, particularly when privacy is a concern or internet access is limited. It leverages the `coli asr` tool, ensuring that all processing happens on the user's machine. This skill supports a range of languages, including Chinese, English, Japanese, Korean, and Cantonese (via the sensevoice model), as well as English-only transcription using the whisper model. Users can provide audio file paths, and the agent will process them using the configured local models. The skill is designed for direct command execution and follows specific interaction patterns to guide the user through the transcription process, including prerequisite checks for necessary tools like `coli` and `ffmpeg`. By operating entirely offline, the ASR skill provides a secure and private method for converting audio to text, making it a valuable tool for transcribing sensitive conversations, meeting notes, lectures, or any audio content where cloud-based services are not preferred or feasible.

Best use case

The primary use case for the ASR skill is to convert audio recordings into text transcripts. This benefits professionals who frequently record meetings or interviews, students who record lectures, or anyone needing a written record of spoken content, especially those prioritizing data privacy or working in environments with unreliable internet access. Its offline capability and multi-language support make it particularly useful for diverse users and sensitive tasks.

Transcribe audio files to text using local speech recognition. Triggers on: "转录", "transcribe", "语音转文字", "ASR", "识别音频", "把这段音频转成文字".

The user will receive a complete text transcription of their provided audio file, generated efficiently and securely using local speech recognition models.

Practical example

Example input

Can you please transcribe the audio file located at `/home/user/recordings/meeting_notes.mp3` for me?

Example output

Okay, here is the transcription of your audio file:

"Good morning everyone, and welcome to our weekly team sync. Today, we'll be discussing the Q3 performance metrics and planning for the upcoming project launch."

When to use this skill

  • User wants to transcribe an audio file to text.
  • User provides an audio file path and asks for transcription.
  • User says "转录", "识别", "transcribe", or "语音转文字".

When not to use this skill

  • User wants to synthesize speech from text (use `/tts`).
  • User wants to create a podcast or explainer (use `/podcast` or `/explainer`).

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/marswave-asr/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/0xfango/marswave-asr/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/marswave-asr/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How asr Compares

Feature / AgentasrStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexitymediumN/A

Frequently Asked Questions

What does this skill do?

Transcribe audio files to text using local speech recognition. Triggers on: "转录", "transcribe", "语音转文字", "ASR", "识别音频", "把这段音频转成文字".

How difficult is it to install?

The installation complexity is rated as medium. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

## When to Use

- User wants to transcribe an audio file to text
- User provides an audio file path and asks for transcription
- User says "转录", "识别", "transcribe", "语音转文字"

## When NOT to Use

- User wants to synthesize speech from text (use `/tts`)
- User wants to create a podcast or explainer (use `/podcast` or `/explainer`)

## Purpose

Transcribe audio files to text using `coli asr`, which runs fully offline via local
speech recognition models. No API key required. Supports Chinese, English, Japanese,
Korean, and Cantonese (sensevoice model) or English-only (whisper model).

Run `coli asr --help` for current CLI options and supported flags.

## Hard Constraints

- No shell scripts. Use direct commands only.
- Always read config following `shared/config-pattern.md` before any interaction
- Follow `shared/common-patterns.md` for interaction patterns
- Never ask more than one question at a time

<HARD-GATE>
Use the AskUserQuestion tool for every multiple-choice step — do NOT print options as
plain text. Ask one question at a time. Wait for the user's answer before proceeding.
After all parameters are collected, summarize and ask the user to confirm before
running any transcription.

</HARD-GATE>

## Interaction Flow

### Step 0: Prerequisites Check

Before config setup, silently check the environment:

```bash
COLI_OK=$(which coli 2>/dev/null && echo yes || echo no)
FFMPEG_OK=$(which ffmpeg 2>/dev/null && echo yes || echo no)
MODELS_DIR="$HOME/.coli/models"
MODELS_OK=$([ -d "$MODELS_DIR" ] && ls "$MODELS_DIR" | grep -q sherpa && echo yes || echo no)
```

| Issue | Action |
|-------|--------|
| `coli` not found | Block. Tell user to run `npm install -g @marswave/coli` first |
| `ffmpeg` not found | Warn (WAV files still work). Suggest `brew install ffmpeg` / `sudo apt install ffmpeg` |
| Models not downloaded | Inform user: first transcription will auto-download models (~60MB) to `~/.coli/models/` |

If `coli` is missing, stop here and do not proceed.

### Step 0: Config Setup

Follow `shared/config-pattern.md` Step 0.

Initial defaults:
```bash
# 当前目录:
mkdir -p ".listenhub/asr"
echo '{"model":"sensevoice","polish":true}' > ".listenhub/asr/config.json"
CONFIG_PATH=".listenhub/asr/config.json"

# 全局:
mkdir -p "$HOME/.listenhub/asr"
echo '{"model":"sensevoice","polish":true}' > "$HOME/.listenhub/asr/config.json"
CONFIG_PATH="$HOME/.listenhub/asr/config.json"
```

Config summary display:
```
当前配置 (asr):
  模型:sensevoice / whisper-tiny.en
  润色:开启 / 关闭
```

### Setup Flow (first run or reconfigure)

Ask in order:

1. **model**: "默认使用哪个语音识别模型?"
   - "sensevoice(推荐)" — 支持中英日韩粤,可检测语言、情绪、音频事件
   - "whisper-tiny.en" — 仅英文

3. **polish**: "转录后由 AI 润色文本?(修正标点、去语气词、提升可读性)"
   - "是(推荐)" → `polish: true`
   - "否,保留原始转录" → `polish: false`

Save all answers at once after collecting them.

### Step 1: Get Audio File

If the user hasn't provided a file path, ask:

> "请提供要转录的音频文件路径。"

Verify the file exists before proceeding.

### Step 2: Confirm

```
准备转录:

  文件:{filename}
  模型:{model}
  润色:{是 / 否}

继续?
```

### Step 3: Transcribe

Run `coli asr` with JSON output (to get metadata):

```bash
coli asr -j --model {model} "{file}"
```

On first run, `coli` will automatically download the required model. This may take a
moment — inform the user if models haven't been downloaded yet.

Parse the JSON result to extract `text`, `lang`, `emotion`, `event`, `duration`.

### Step 4: Polish (if enabled)

If `polish` is `true`, take the raw `text` from the transcription result and rewrite
it to fix punctuation, remove filler words, and improve readability. Preserve the
original meaning and speaker intent. Do not summarize or paraphrase.

### Step 5: Present Result

Display the transcript directly in the conversation:

```
转录完成

{transcript text}

─────────────────
语言:{lang} · 情绪:{emotion} · 时长:{duration}s
```

If polished, show the polished version with a note that it was AI-refined. Offer to
show the raw original on request.

### Step 6: Export as Markdown (optional)

After presenting the result, ask:

```
Question: "保存为 Markdown 文件到当前目录?"
Options:
  - "是" — save to current directory
  - "否" — done
```

If yes, write `{audio-filename}-transcript.md` to the **current working directory**
(where the user is running Claude Code). The file should contain the transcript text
(polished version if polish was enabled), with a front-matter header:

```markdown
---
source: {original audio filename}
date: {YYYY-MM-DD}
model: {model used}
duration: {duration}s
lang: {detected language}
---

{transcript text}
```

## Composability

- **Invoked by**: future skills that need to transcribe recorded audio
- **Invokes**: nothing

## Examples

> "帮我转录这个文件 meeting.m4a"

1. Check prerequisites
2. Read config
3. Confirm: meeting.m4a, sensevoice, polish on
4. Run `coli asr -j --model sensevoice "meeting.m4a"`
5. Polish the raw text
6. Display inline

> "transcribe interview.wav, no polish"

1. Check prerequisites
2. Read config
3. Override polish to false for this session
4. Run `coli asr -j --model sensevoice "interview.wav"`
5. Display raw transcript inline

Related Skills

find-skills

3891
from openclaw/skills

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

General Utilities

filesystem

3891
from openclaw/skills

Advanced filesystem operations for listing files, searching content, batch processing, and directory analysis. Supports recursive search, file type filtering, size analysis, and batch operations like copy/move/delete. Use when you need to: list directory contents, search for files by name or content, analyze directory structures, perform batch file operations, or analyze file sizes and distribution.

General Utilities

Budget & Expense Tracker — AI Agent Financial Command Center

3891
from openclaw/skills

Track every dollar, enforce budgets, spot spending patterns, and build wealth — all through natural conversation with your AI agent.

General Utilities

yt-dlp

3891
from openclaw/skills

A robust CLI wrapper for yt-dlp to download videos, playlists, and audio from YouTube and thousands of other sites. Supports format selection, quality control, metadata embedding, and cookie authentication.

General Utilities

time-checker

3891
from openclaw/skills

Check accurate current time, date, and timezone information for any location worldwide using time.is. Use when the user asks "what time is it in X", "current time in Y", or needs to verify timezone offsets.

General Utilities

pihole-ctl

3891
from openclaw/skills

Manage and monitor local Pi-hole instance. Query FTL database for statistics (blocked ads, top clients) and control service via CLI. Use when user asks "how many ads blocked", "pihole status", or "update gravity".

General Utilities

mermaid-architect

3891
from openclaw/skills

Generate beautiful, hand-drawn Mermaid diagrams with robust syntax (quoted labels, ELK layout). Use this skill when the user asks for "diagram", "flowchart", "sequence diagram", or "visualize this process".

General Utilities

memory-cache

3891
from openclaw/skills

High-performance temporary storage system using Redis. Supports namespaced keys (mema:*), TTL management, and session context caching. Use for: (1) Saving agent state, (2) Caching API results, (3) Sharing data between sub-agents.

General Utilities

mema

3891
from openclaw/skills

Mema's personal brain - SQLite metadata index for documents and Redis short-term context buffer. Use for organizing workspace knowledge paths and managing ephemeral session state.

General Utilities

file-organizer-skill

3891
from openclaw/skills

Organize files in directories by grouping them into folders based on their extensions or date. Includes Dry-Run, Recursive, and Undo capabilities.

General Utilities

media-compress

3891
from openclaw/skills

Compress and convert images and videos using ffmpeg. Use when the user wants to reduce file size, change format, resize, or optimize media files. Handles common formats like JPG, PNG, WebP, MP4, MOV, WebM. Triggers on phrases like "compress image", "compress video", "reduce file size", "convert to webp/mp4", "resize image", "make image smaller", "batch compress", "optimize media".

General Utilities

edge-tts

3891
from openclaw/skills

Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.

General Utilities