multiAI Summary Pending
asr
Transcribe audio files to text using local speech recognition. Triggers on: "转录", "transcribe", "语音转文字", "ASR", "识别音频", "把这段音频转成文字".
3,556 stars
byopenclaw
Installation
Claude Code / Cursor / Codex
$curl -o ~/.claude/skills/marswave-asr/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/0xfango/marswave-asr/SKILL.md"
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/marswave-asr/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How asr Compares
| Feature / Agent | asr | Standard Approach |
|---|---|---|
| Platform Support | multi | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Transcribe audio files to text using local speech recognition. Triggers on: "转录", "transcribe", "语音转文字", "ASR", "识别音频", "把这段音频转成文字".
Which AI agents support this skill?
This skill is compatible with multi.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
## When to Use
- User wants to transcribe an audio file to text
- User provides an audio file path and asks for transcription
- User says "转录", "识别", "transcribe", "语音转文字"
## When NOT to Use
- User wants to synthesize speech from text (use `/tts`)
- User wants to create a podcast or explainer (use `/podcast` or `/explainer`)
## Purpose
Transcribe audio files to text using `coli asr`, which runs fully offline via local
speech recognition models. No API key required. Supports Chinese, English, Japanese,
Korean, and Cantonese (sensevoice model) or English-only (whisper model).
Run `coli asr --help` for current CLI options and supported flags.
## Hard Constraints
- No shell scripts. Use direct commands only.
- Always read config following `shared/config-pattern.md` before any interaction
- Follow `shared/common-patterns.md` for interaction patterns
- Never ask more than one question at a time
<HARD-GATE>
Use the AskUserQuestion tool for every multiple-choice step — do NOT print options as
plain text. Ask one question at a time. Wait for the user's answer before proceeding.
After all parameters are collected, summarize and ask the user to confirm before
running any transcription.
</HARD-GATE>
## Interaction Flow
### Step 0: Prerequisites Check
Before config setup, silently check the environment:
```bash
COLI_OK=$(which coli 2>/dev/null && echo yes || echo no)
FFMPEG_OK=$(which ffmpeg 2>/dev/null && echo yes || echo no)
MODELS_DIR="$HOME/.coli/models"
MODELS_OK=$([ -d "$MODELS_DIR" ] && ls "$MODELS_DIR" | grep -q sherpa && echo yes || echo no)
```
| Issue | Action |
|-------|--------|
| `coli` not found | Block. Tell user to run `npm install -g @marswave/coli` first |
| `ffmpeg` not found | Warn (WAV files still work). Suggest `brew install ffmpeg` / `sudo apt install ffmpeg` |
| Models not downloaded | Inform user: first transcription will auto-download models (~60MB) to `~/.coli/models/` |
If `coli` is missing, stop here and do not proceed.
### Step 0: Config Setup
Follow `shared/config-pattern.md` Step 0.
Initial defaults:
```bash
# 当前目录:
mkdir -p ".listenhub/asr"
echo '{"model":"sensevoice","polish":true}' > ".listenhub/asr/config.json"
CONFIG_PATH=".listenhub/asr/config.json"
# 全局:
mkdir -p "$HOME/.listenhub/asr"
echo '{"model":"sensevoice","polish":true}' > "$HOME/.listenhub/asr/config.json"
CONFIG_PATH="$HOME/.listenhub/asr/config.json"
```
Config summary display:
```
当前配置 (asr):
模型:sensevoice / whisper-tiny.en
润色:开启 / 关闭
```
### Setup Flow (first run or reconfigure)
Ask in order:
1. **model**: "默认使用哪个语音识别模型?"
- "sensevoice(推荐)" — 支持中英日韩粤,可检测语言、情绪、音频事件
- "whisper-tiny.en" — 仅英文
3. **polish**: "转录后由 AI 润色文本?(修正标点、去语气词、提升可读性)"
- "是(推荐)" → `polish: true`
- "否,保留原始转录" → `polish: false`
Save all answers at once after collecting them.
### Step 1: Get Audio File
If the user hasn't provided a file path, ask:
> "请提供要转录的音频文件路径。"
Verify the file exists before proceeding.
### Step 2: Confirm
```
准备转录:
文件:{filename}
模型:{model}
润色:{是 / 否}
继续?
```
### Step 3: Transcribe
Run `coli asr` with JSON output (to get metadata):
```bash
coli asr -j --model {model} "{file}"
```
On first run, `coli` will automatically download the required model. This may take a
moment — inform the user if models haven't been downloaded yet.
Parse the JSON result to extract `text`, `lang`, `emotion`, `event`, `duration`.
### Step 4: Polish (if enabled)
If `polish` is `true`, take the raw `text` from the transcription result and rewrite
it to fix punctuation, remove filler words, and improve readability. Preserve the
original meaning and speaker intent. Do not summarize or paraphrase.
### Step 5: Present Result
Display the transcript directly in the conversation:
```
转录完成
{transcript text}
─────────────────
语言:{lang} · 情绪:{emotion} · 时长:{duration}s
```
If polished, show the polished version with a note that it was AI-refined. Offer to
show the raw original on request.
### Step 6: Export as Markdown (optional)
After presenting the result, ask:
```
Question: "保存为 Markdown 文件到当前目录?"
Options:
- "是" — save to current directory
- "否" — done
```
If yes, write `{audio-filename}-transcript.md` to the **current working directory**
(where the user is running Claude Code). The file should contain the transcript text
(polished version if polish was enabled), with a front-matter header:
```markdown
---
source: {original audio filename}
date: {YYYY-MM-DD}
model: {model used}
duration: {duration}s
lang: {detected language}
---
{transcript text}
```
## Composability
- **Invoked by**: future skills that need to transcribe recorded audio
- **Invokes**: nothing
## Examples
> "帮我转录这个文件 meeting.m4a"
1. Check prerequisites
2. Read config
3. Confirm: meeting.m4a, sensevoice, polish on
4. Run `coli asr -j --model sensevoice "meeting.m4a"`
5. Polish the raw text
6. Display inline
> "transcribe interview.wav, no polish"
1. Check prerequisites
2. Read config
3. Override polish to false for this session
4. Run `coli asr -j --model sensevoice "interview.wav"`
5. Display raw transcript inline