gemini-stt

Transcribe audio files using Google's Gemini API or Vertex AI

7 stars

byDemerzels-lab

View on GitHub Installation ↓

Best use case

gemini-stt is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Transcribe audio files using Google's Gemini API or Vertex AI

Teams using gemini-stt should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/gemini-stt/SKILL.md --create-dirs "https://raw.githubusercontent.com/Demerzels-lab/elsamultiskillagent/main/public/skills/araa47/gemini-stt/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/gemini-stt/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How gemini-stt Compares

Feature / Agent	gemini-stt	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Transcribe audio files using Google's Gemini API or Vertex AI

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Gemini Speech-to-Text Skill

Transcribe audio files using Google's Gemini API or Vertex AI. Default model is `gemini-2.0-flash-lite` for fastest transcription.

## Authentication (choose one)

### Option 1: Vertex AI with Application Default Credentials (Recommended)

```bash
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
```

The script will automatically detect and use ADC when available.

### Option 2: Direct Gemini API Key

Set `GEMINI_API_KEY` in environment (e.g., `~/.env` or `~/.clawdbot/.env`)

## Requirements

- Python 3.10+ (no external dependencies)
- Either GEMINI_API_KEY or gcloud CLI with ADC configured

## Supported Formats

- `.ogg` / `.opus` (Telegram voice messages)
- `.mp3`
- `.wav`
- `.m4a`

## Usage

```bash
# Auto-detect auth (tries ADC first, then GEMINI_API_KEY)
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg

# Force Vertex AI
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex

# With a specific model
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro

# Vertex AI with specific project and region
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1

# With Clawdbot media
python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg
```

## Options

| Option | Description |
|--------|-------------|
| `<audio_file>` | Path to the audio file (required) |
| `--model`, `-m` | Gemini model to use (default: `gemini-2.0-flash-lite`) |
| `--vertex`, `-v` | Force use of Vertex AI with ADC |
| `--project`, `-p` | GCP project ID (for Vertex, defaults to gcloud config) |
| `--region`, `-r` | GCP region (for Vertex, default: `us-central1`) |

## Supported Models

Any Gemini model that supports audio input can be used. Recommended models:

| Model | Notes |
|-------|-------|
| `gemini-2.0-flash-lite` | **Default.** Fastest transcription speed. |
| `gemini-2.0-flash` | Fast and cost-effective. |
| `gemini-2.5-flash-lite` | Lightweight 2.5 model. |
| `gemini-2.5-flash` | Balanced speed and quality. |
| `gemini-2.5-pro` | Higher quality, slower. |
| `gemini-3-flash-preview` | Latest flash model. |
| `gemini-3-pro-preview` | Latest pro model, best quality. |

See [Gemini API Models](https://ai.google.dev/gemini-api/docs/models) for the latest list.

## How It Works

1. Reads the audio file and base64 encodes it
2. Auto-detects authentication:
   - If ADC is available (gcloud), uses Vertex AI endpoint
   - Otherwise, uses GEMINI_API_KEY with direct Gemini API
3. Sends to the selected Gemini model with transcription prompt
4. Returns the transcribed text

## Example Integration

For Clawdbot voice message handling:

```bash
# Transcribe incoming voice message
TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH")
echo "User said: $TRANSCRIPT"
```

## Error Handling

The script exits with code 1 and prints to stderr on:
- No authentication available (neither ADC nor GEMINI_API_KEY)
- File not found
- API errors
- Missing GCP project (when using Vertex)

## Notes

- Uses Gemini 2.0 Flash Lite by default for fastest transcription
- No external Python dependencies (uses stdlib only)
- Automatically detects MIME type from file extension
- Prefers Vertex AI with ADC when available (no API key management needed)

Related Skills

gemini-image-gen

from Demerzels-lab/elsamultiskillagent

Generate and edit images via Google Gemini API. Supports Gemini native generation, Imagen 3, style presets, and batch generation with HTML gallery. Zero dependencies — pure Python stdlib.

gemini-nano-banana-pro-portraits

from Demerzels-lab/elsamultiskillagent

Generate ultra-photorealistic portraits using Gemini Nano Banana Pro with comprehensive JSON configuration templates. Use when creating cinematic quality portraits, fitness photography, or realistic character images. Includes complete JSON structure for prompt configuration, subject details, apparel, pose, environment, lighting, and technical specifications.

free-ai-prompt-generator-for-chatgpt-gemini-more-q-6e800b2c

from Demerzels-lab/elsamultiskillagent

Write an AI prompt for a job description that attracts top talent

50-viral-gemini-ai-prompts-ready-to-copy-paste-for-e7b5d316

from Demerzels-lab/elsamultiskillagent

Romantic couple hugging on a beach at sunset, cinematic lighting, soft focus, using reference faces

50-viral-gemini-ai-prompts-ready-to-copy-paste-for-e41bb853

from Demerzels-lab/elsamultiskillagent

Multi-age family playing in a park, golden-hour lighting, candid expressions, using reference photos

50-viral-gemini-ai-prompts-ready-to-copy-paste-for-aefb3d26

from Demerzels-lab/elsamultiskillagent

Polaroid-style portrait of a woman smiling, casual outfit, natural light, using reference face

50-viral-gemini-ai-prompts-ready-to-copy-paste-for-4ac228ab

from Demerzels-lab/elsamultiskillagent

Epic fantasy group portrait in a magical forest, mystical lighting, dynamic poses, remove objects from original photo, using reference faces

50-viral-gemini-ai-prompts-ready-to-copy-paste-for-335a199b

from Demerzels-lab/elsamultiskillagent

Three women posing in urban street fashion, dramatic lighting, stylish hairstyles, using reference faces

zown-gemini-governor

from Demerzels-lab/elsamultiskillagent

A high-fidelity token management and model stabilization skill.

gemini-web-search

from Demerzels-lab/elsamultiskillagent

Use Gemini CLI (@google/gemini-cli) to do web search / fact-finding and return a sourced summary. Use when the user asks “why did X happen today”, “what’s the latest news”, “search the web”, “find sources/links”, or any task requiring up-to-date info. Prefer this over other search tools when Gemini is available but slow; run it with a TTY, wait longer, and verify source quality.

gemini-image-simple

from Demerzels-lab/elsamultiskillagent

Generate and edit images with Gemini API using pure Python stdlib. Zero dependencies - works on locked-down environments where pip/uv aren't available.

gemini-deep-research

from Demerzels-lab/elsamultiskillagent

Perform complex, long-running research tasks using Gemini Deep Research Agent. Use when asked to research topics requiring multi-source synthesis, competitive analysis, market research, or comprehensive technical investigations that benefit from systematic web search and analysis.