gemini-stt

Transcribe audio files using Google's Gemini API or Vertex AI

3,891 stars

byopenclaw

View on GitHub Installation ↓

Best use case

gemini-stt is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Transcribe audio files using Google's Gemini API or Vertex AI

Teams using gemini-stt should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/gemini-stt/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/araa47/gemini-stt/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/gemini-stt/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How gemini-stt Compares

Feature / Agent	gemini-stt	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Transcribe audio files using Google's Gemini API or Vertex AI

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

AI Agents for Startups

Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

SKILL.md Source

# Gemini Speech-to-Text Skill

Transcribe audio files using Google's Gemini API or Vertex AI. Default model is `gemini-2.0-flash-lite` for fastest transcription.

## Authentication (choose one)

### Option 1: Vertex AI with Application Default Credentials (Recommended)

```bash
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
```

The script will automatically detect and use ADC when available.

### Option 2: Direct Gemini API Key

Set `GEMINI_API_KEY` in environment (e.g., `~/.env` or `~/.clawdbot/.env`)

## Requirements

- Python 3.10+ (no external dependencies)
- Either GEMINI_API_KEY or gcloud CLI with ADC configured

## Supported Formats

- `.ogg` / `.opus` (Telegram voice messages)
- `.mp3`
- `.wav`
- `.m4a`

## Usage

```bash
# Auto-detect auth (tries ADC first, then GEMINI_API_KEY)
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg

# Force Vertex AI
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex

# With a specific model
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro

# Vertex AI with specific project and region
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1

# With Clawdbot media
python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg
```

## Options

| Option | Description |
|--------|-------------|
| `<audio_file>` | Path to the audio file (required) |
| `--model`, `-m` | Gemini model to use (default: `gemini-2.0-flash-lite`) |
| `--vertex`, `-v` | Force use of Vertex AI with ADC |
| `--project`, `-p` | GCP project ID (for Vertex, defaults to gcloud config) |
| `--region`, `-r` | GCP region (for Vertex, default: `us-central1`) |

## Supported Models

Any Gemini model that supports audio input can be used. Recommended models:

| Model | Notes |
|-------|-------|
| `gemini-2.0-flash-lite` | **Default.** Fastest transcription speed. |
| `gemini-2.0-flash` | Fast and cost-effective. |
| `gemini-2.5-flash-lite` | Lightweight 2.5 model. |
| `gemini-2.5-flash` | Balanced speed and quality. |
| `gemini-2.5-pro` | Higher quality, slower. |
| `gemini-3-flash-preview` | Latest flash model. |
| `gemini-3-pro-preview` | Latest pro model, best quality. |

See [Gemini API Models](https://ai.google.dev/gemini-api/docs/models) for the latest list.

## How It Works

1. Reads the audio file and base64 encodes it
2. Auto-detects authentication:
   - If ADC is available (gcloud), uses Vertex AI endpoint
   - Otherwise, uses GEMINI_API_KEY with direct Gemini API
3. Sends to the selected Gemini model with transcription prompt
4. Returns the transcribed text

## Example Integration

For Clawdbot voice message handling:

```bash
# Transcribe incoming voice message
TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH")
echo "User said: $TRANSCRIPT"
```

## Error Handling

The script exits with code 1 and prints to stderr on:
- No authentication available (neither ADC nor GEMINI_API_KEY)
- File not found
- API errors
- Missing GCP project (when using Vertex)

## Notes

- Uses Gemini 2.0 Flash Lite by default for fastest transcription
- No external Python dependencies (uses stdlib only)
- Automatically detects MIME type from file extension
- Prefers Vertex AI with ADC when available (no API key management needed)

Related Skills

enable-chrome-gemini

3891

from openclaw/skills

Set up or repair Gemini in Chrome (Glic) on Windows, macOS, or Linux when enabling it for the first time outside the US or when the sidebar, floating panel, Alt+G shortcut, or top-bar entry disappears. Back up and patch Chrome Local State, restore region/eligibility fields, and check the required Glic flags and Chrome language.

PDF OCR using Gemini LLM

3891

from openclaw/skills

Extract text from PDFs using Google Gemini OCR. Use when extracting text from PDFs, performing OCR on scanned documents, or processing image-based PDFs.

gemini-deep-research

3891

from openclaw/skills

Perform complex, long-running research tasks using Gemini Deep Research Agent. Use when asked to research topics requiring multi-source synthesis, competitive analysis, market research, or comprehensive technical investigations that benefit from systematic web search and analysis.

gemini-computer-use

3891

from openclaw/skills

Build and run Gemini 2.5 Computer Use browser-control agents with Playwright. Use when a user wants to automate web browser tasks via the Gemini Computer Use model, needs an agent loop (screenshot → function_call → action → function_response), or asks to integrate safety confirmation for risky UI actions.

gemini-voice-assistant

3891

from openclaw/skills

Voice-to-voice AI assistant using Gemini Live API. Speak to the AI and get spoken responses. Use when you want to have natural voice conversations with an AI assistant powered by Google's Gemini models.

gemini-assistant

3891

from openclaw/skills

General-purpose AI assistant using Gemini API with voice and text support. Use when you need a smart AI assistant that can answer questions, have conversations, or help with general tasks using Google's Gemini models with audio/text capabilities.

gemini-video-analyzer

3891

from openclaw/skills

Native video analysis using Google Gemini API. Upload and analyze video files — describe scenes, extract text/UI, answer questions about content, transcribe speech, identify objects and actions. Use when: (1) User sends a video file and wants it analyzed, (2) Video summarization or description needed, (3) Extracting text, UI elements, or information from screen recordings, (4) Answering questions about video content, (5) Comparing multiple videos, (6) Analyzing tutorials, demos, or walkthroughs.

gemini-nano-banana

3891

from openclaw/skills

Auto-generated skill for gemini tools via OneKey Gateway.

gemini Models for vwu.ai

3891

from openclaw/skills

vwu.ai 平台上的 gemini 模型调用技能。

---

3891

from openclaw/skills

name: article-factory-wechat

Content & Documentation

humanizer

3891

from openclaw/skills

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Content & Documentation

find-skills

3891

from openclaw/skills

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

General Utilities

gemini-stt

Best use case

When to use this skill

When not to use this skill

Installation

How gemini-stt Compares

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

Related Guides

AI Agents for Marketing

AI Agents for Startups

AI Agents for Coding

SKILL.md Source

Related Skills

enable-chrome-gemini

PDF OCR using Gemini LLM

gemini-deep-research

gemini-computer-use

gemini-voice-assistant

gemini-assistant

gemini-video-analyzer

gemini-nano-banana

gemini Models for vwu.ai

﻿---

humanizer

find-skills

---