augent
Audio intelligence toolkit. Transcribe, search by keyword or meaning, take notes, detect chapters, identify speakers, separate audio, export clips, tag, highlights, visual context, and text-to-speech — all local, all private. 21 MCP tools for audio and video.
Best use case
augent is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Audio intelligence toolkit. Transcribe, search by keyword or meaning, take notes, detect chapters, identify speakers, separate audio, export clips, tag, highlights, visual context, and text-to-speech — all local, all private. 21 MCP tools for audio and video.
Teams using augent should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/augent/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How augent Compares
| Feature / Agent | augent | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Audio intelligence toolkit. Transcribe, search by keyword or meaning, take notes, detect chapters, identify speakers, separate audio, export clips, tag, highlights, visual context, and text-to-speech — all local, all private. 21 MCP tools for audio and video.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agent for YouTube Script Writing
Find AI agent skills for YouTube script writing, video research, content outlining, and repeatable channel production workflows.
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
SKILL.md Source
# Augent — Audio Intelligence for AI Agents
Augent is an MCP server that gives your agent all audio intelligence tools. Transcribe, search, take notes, identify speakers, detect chapters, separate audio, export clips, and generate speech — fully local, fully private.
## Config
```json
{
"mcpServers": {
"augent": {
"command": "augent-mcp"
}
}
}
```
If `augent-mcp` is not in PATH, use the full Python module path:
```json
{
"mcpServers": {
"augent": {
"command": "python3",
"args": ["-m", "augent.mcp"]
}
}
}
```
## Install
**One-liner (recommended):** Installs augent, FFmpeg, yt-dlp, aria2, and configures MCP automatically.
```bash
curl -fsSL https://augent.app/install.sh | bash
```
**Via uv:**
```bash
uv tool install augent
```
For all features (semantic search, speaker diarization, TTS, source separation):
```bash
uv tool install "augent[all]"
```
**Via pip:**
```bash
pip install "augent[all]"
```
**System dependencies:** FFmpeg is required. Install with `brew install ffmpeg` (macOS) or `apt install ffmpeg` (Linux). For fast audio downloads, also install yt-dlp and aria2.
## Tools
Augent exposes 21 MCP tools:
### Core
| Tool | Description |
|------|-------------|
| `download_audio` | Download audio from video URLs at maximum speed. Supports YouTube, Vimeo, TikTok, Twitter/X, SoundCloud, and 1000+ sites. Uses aria2c multi-connection + concurrent fragments. |
| `transcribe_audio` | Full transcription of any audio file with per-segment timestamps. Returns text, language, duration, and segments. Cached by file hash. |
| `search_audio` | Search audio for keywords. Returns timestamped matches with context snippets. Supports clip export. |
| `deep_search` | Semantic search — find moments by meaning, not just keywords. Uses sentence-transformers embeddings. |
| `search_memory` | Search across ALL stored transcriptions in one query. Keyword or semantic mode. |
| `take_notes` | All-in-one: download audio from URL, transcribe, and save formatted notes. Supports 5 styles: tldr, notes, highlight, eye-candy, quiz. |
| `clip_export` | Export a video clip from any URL for a specific time range. Downloads only the requested segment. |
### Analysis
| Tool | Description |
|------|-------------|
| `chapters` | Auto-detect topic chapters with timestamps using embedding similarity. |
| `search_proximity` | Find where two keywords appear near each other (e.g., "startup" within 30 words of "funding"). |
| `identify_speakers` | Speaker diarization — identify who speaks when. No API keys required. |
| `separate_audio` | Isolate vocals from music/noise using Meta's Demucs v4. Feed clean vocals into transcription. |
| `batch_search` | Search multiple audio files in parallel. Ideal for podcast libraries or interview collections. |
### Utilities
| Tool | Description |
|------|-------------|
| `text_to_speech` | Convert text to natural speech using Kokoro TTS. 54 voices, 9 languages. Runs in background. |
| `list_files` | List media files in a directory with size info. |
| `list_memories` | Browse all stored transcriptions by title, duration, and date. |
| `memory_stats` | View memory statistics (file count, total duration). |
| `clear_memory` | Clear the transcription memory to free disk space. |
| `tag` | Add, remove, or list tags on transcriptions. Broad topic categories for organizing memories. |
| `highlights` | Export the best moments from a transcription. Auto mode picks top moments; focused mode finds moments matching a topic. |
| `visual` | Extract visual context from video at moments that matter. Query, auto, manual, and assist modes. Frames saved to Obsidian vault. |
| `rebuild_graph` | Rebuild Obsidian graph view data for all transcriptions. Migrates files, computes wikilinks, generates MOC hubs. |
## Usage Examples
### Take notes from a video
> "Take notes from https://youtube.com/watch?v=xxx"
The agent calls `take_notes` which downloads, transcribes, and returns formatted notes. One tool call does everything.
### Search a podcast for topics
> "Search this podcast for every mention of AI regulation" — provide the file path or URL.
The agent uses `search_audio` for exact keyword matches, or `deep_search` for semantic matches (finds relevant discussion even without exact words).
### Transcribe and identify speakers
> "Transcribe this meeting recording and tell me who said what"
The agent calls `transcribe_audio` then `identify_speakers` to label each segment by speaker.
### Search across all transcriptions
> "Search everything I've ever transcribed for mentions of funding"
The agent uses `search_memory` to search across all stored transcriptions without needing a file path.
### Export a clip
> "Clip the part where they talk about pricing"
The agent uses `search_audio` or `deep_search` to find the moment, then `clip_export` to extract just that segment.
### Separate vocals from noisy audio
> "This recording has music in the background, clean it up and transcribe"
The agent calls `separate_audio` to isolate vocals, then `transcribe_audio` on the clean vocals track.
### Generate speech from text
> "Read these notes aloud"
The agent calls `text_to_speech` to generate an MP3 with natural speech. Supports multiple voices and languages.
## Note Styles
When using `take_notes`, the `style` parameter controls formatting:
| Style | Description |
|-------|-------------|
| `tldr` | Shortest possible summary. One screen. Bold key terms. |
| `notes` | Clean sections with nested bullets (default). |
| `highlight` | Notes with callout blocks for key insights and blockquotes with timestamps. |
| `eye-candy` | Maximum visual formatting — callouts, tables, checklists, blockquotes. |
| `quiz` | Multiple-choice questions with answer key. |
## Model Sizes
`tiny` is the default and handles nearly everything. Only use larger models for heavy accents, poor audio quality, or maximum accuracy needs.
| Model | Speed | Accuracy |
|-------|-------|----------|
| **tiny** | Fastest | Excellent (default) |
| base | Fast | Excellent |
| small | Medium | Superior |
| medium | Slow | Outstanding |
| large | Slowest | Maximum |
## Memory
Transcriptions are stored by file content hash + model size. Same file = instant results on repeat searches. Memory persists at `~/.augent/memory/transcriptions.db`. Source URLs from any platform are permanently stored by file hash. Use `memory_stats` to check usage and `clear_memory` to free space.
## Requirements
- Python 3.10+
- FFmpeg (audio processing)
- yt-dlp + aria2 (optional, for audio downloads)
## Links
- [GitHub](https://github.com/AugentDevs/Augent)
- [Documentation](https://docs.augent.app)
- [Install Script](https://augent.app/install.sh)Related Skills
---
name: article-factory-wechat
humanizer
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.
find-skills
Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.
tavily-search
Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.
baidu-search
Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.
agent-autonomy-kit
Stop waiting for prompts. Keep working.
Meeting Prep
Never walk into a meeting unprepared again. Your agent researches all attendees before calendar events—pulling LinkedIn profiles, recent company news, mutual connections, and conversation starters. Generates a briefing doc with talking points, icebreakers, and context so you show up informed and confident. Triggered automatically before meetings or on-demand. Configure research depth, advance timing, and output format. Walking into meetings blind is amateur hour—missed connections, generic small talk, zero leverage. Use when setting up meeting intelligence, researching specific attendees, generating pre-meeting briefs, or automating your prep workflow.
self-improvement
Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Claude ('No, that's wrong...', 'Actually...'), (3) User requests a capability that doesn't exist, (4) An external API or tool fails, (5) Claude realizes its knowledge is outdated or incorrect, (6) A better approach is discovered for a recurring task. Also review learnings before major tasks.
botlearn-healthcheck
botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.
linkedin-cli
A bird-like LinkedIn CLI for searching profiles, checking messages, and summarizing your feed using session cookies.
notebooklm
Google NotebookLM 非官方 Python API 的 OpenClaw Skill。支持内容生成(播客、视频、幻灯片、测验、思维导图等)、文档管理和研究自动化。当用户需要使用 NotebookLM 生成音频概述、视频、学习材料或管理知识库时触发。
小红书长图文发布 Skill
## 概述