audio-video-transcription

用于录音转文本的音视频转写原子 skill，适用于通用行业文档解析场景。

105 stars

Best use case

audio-video-transcription is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

用于录音转文本的音视频转写原子 skill，适用于通用行业文档解析场景。

Teams using audio-video-transcription should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/audio-video-transcription/SKILL.md --create-dirs "https://raw.githubusercontent.com/aifinlab/FinClaw/main/skills/archive/audio-video-transcription/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/audio-video-transcription/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How audio-video-transcription Compares

Feature / Agent	audio-video-transcription	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

用于录音转文本的音视频转写原子 skill，适用于通用行业文档解析场景。

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agent for YouTube Script Writing

Find AI agent skills for YouTube script writing, video research, content outlining, and repeatable channel production workflows.

SKILL.md Source

# 音视频转写 Skill

## 数据来源

本 Skill 支持多种音视频文件输入格式，核心数据来源包括：

### 1. 音频文件类型
- **音频格式**：MP3、WAV、AAC、FLAC、M4A等
- **录音文件**：电话录音、会议录音、访谈录音
- **语音文件**：语音留言、语音指令、语音报告

### 2. 视频文件类型
- **视频格式**：MP4、AVI、MOV、MKV、WMV等
- **视频会议**：会议录像、培训视频、演示视频
- **直播回放**：直播录像、网络研讨会回放

### 3. 音视频特征
- **语言类型**：中文、英文、多语言混合
- **音频质量**：清晰度、背景噪音、采样率
- **视频质量**：分辨率、帧率、清晰度
- **时长范围**：短音频（<5分钟）、长音频（>1小时）

### 4. 数据格式要求
- **文件路径**：本地文件路径或网络文件URL
- **文件编码**：支持常见音视频编码格式
- **文件权限**：需要读取权限

> 说明：本 Skill 不包含文件采集功能，需要用户提供音视频文件。建议文件质量良好，以便进行准确的转写。

---

## 功能

本 Skill 提供全面的音视频转写能力，涵盖多种转写功能：

### 1. 语音识别转写
- **实时转写**：实时识别和转写语音内容
- **批量转写**：批量处理多个音视频文件
- **分段转写**：按时间段分段转写
- **多说话人识别**：识别和区分不同说话人

### 2. 文本输出格式
- **纯文本输出**：输出纯文本转写结果
- **时间戳文本**：输出带时间戳的文本
- **分段文本**：按段落输出转写结果
- **结构化文本**：输出结构化的转写内容

### 3. 语言处理
- **多语言识别**：自动识别音视频中的语言
- **方言识别**：识别方言和口音
- **专业术语识别**：识别金融、法律等专业术语
- **标点符号添加**：自动添加标点符号

### 4. 音频处理
- **噪音降噪**：降低背景噪音影响
- **音量均衡**：均衡不同时段的音量
- **语速调整**：识别和处理不同语速
- **静音检测**：检测和标记静音片段

### 5. 视频处理
- **字幕提取**：提取视频中的字幕
- **画面文字识别**：识别视频画面中的文字
- **场景切换检测**：检测视频场景切换
- **关键帧提取**：提取关键帧画面

### 6. 高级处理功能
- **说话人分离**：分离不同说话人的语音
- **情感识别**：识别语音中的情感色彩
- **关键词提取**：提取转写文本中的关键词
- **摘要生成**：生成转写内容的摘要

---

## 使用示例

### 输出示例
```json
{
  "file_info": {
    "filename": "meeting_recording.mp3",
    "file_size": 51200000,
    "duration": 3600,
    "format": "mp3",
    "language": "zh-CN"
  },
  "transcription": {
    "full_text": "会议转写完整文本内容...",
    "segments": [
      {
        "start_time": 0,
        "end_time": 120,
        "speaker": "speaker_1",
        "text": "大家好，今天我们讨论一下项目进展。",
        "confidence": 0.95
      },
      {
        "start_time": 120,
        "end_time": 240,
        "speaker": "speaker_2",
        "text": "项目目前进展顺利，已完成80%的工作。",
        "confidence": 0.92
      }
    ]
  },
  "speakers": [
    {
      "speaker_id": "speaker_1",
      "name": "张三",
      "total_duration": 1800,
      "segment_count": 15
    },
    {
      "speaker_id": "speaker_2",
      "name": "李四",
      "total_duration": 1200,
      "segment_count": 10
    }
  ],
  "keywords": [
    "项目进展",
    "完成度",
    "下一步计划"
  ],
  "summary": "会议主要讨论了项目进展情况，目前已完成80%的工作，下一步将进行测试和验收。"
}
```

---

## 注意事项与限制

### 1. 文件格式要求
- 支持常见音视频格式
- 文件质量影响转写准确率
- 过大文件可能需要分段处理

### 2. 转写准确性
- 清晰音频转写准确率较高
- 背景噪音会影响识别准确率
- 方言和口音可能影响识别

### 3. 性能考虑
- 长音频处理可能需要较长时间
- 处理时间与音频时长成正比
- 建议对超长音频进行分段处理

### 4. 语言支持
- 主要支持中文和英文
- 其他语言识别准确率可能较低
- 多语言混合可能影响识别

### 5. 使用限制
- 本 Skill 不包含音视频编辑功能
- 转写结果需要人工复核
- 受保护文件可能无法处理

---

## 参考资料
- 见 references/ 目录中的相关文档，包括：
  - 音视频转写方法手册
  - 语音识别算法说明
  - 多说话人识别指南
  - 性能优化指南