google-gemini-media

Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".

533 stars

Best use case

google-gemini-media is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".

Teams using google-gemini-media should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/google-gemini-media/SKILL.md --create-dirs "https://raw.githubusercontent.com/sundial-org/awesome-openclaw-skills/main/skills/google-gemini-media/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/google-gemini-media/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How google-gemini-media Compares

Feature / Agentgoogle-gemini-mediaStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Gemini Multimodal Media (Image/Video/Speech) Skill

## 1. Goals and scope

This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:

- Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
- Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
- Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
- Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
- Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
- Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)

> Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.

---

## 2. Quick routing (decide which capability to use)

1) **Do you need to produce images?**
- Need to generate images from scratch or edit based on an image -> use **Nano Banana image generation** (see Section 5)

2) **Do you need to understand images?**
- Need recognition, description, Q&A, comparison, or info extraction -> use **Image understanding** (see Section 6)

3) **Do you need to produce video?**
- Need to generate an 8-second video (optionally with native audio) -> use **Veo 3.1 video generation** (see Section 7)

4) **Do you need to understand video?**
- Need summaries/Q&A/segment extraction with timestamps -> use **Video understanding** (see Section 8)

5) **Do you need to read text aloud?**
- Need controllable narration, podcast/audiobook style, etc. -> use **Speech generation (TTS)** (see Section 9)

6) **Do you need to understand audio?**
- Need audio descriptions, transcription, time-range transcription, token counting -> use **Audio understanding** (see Section 10)

---

## 3. Unified engineering constraints and I/O spec (must read)

### 3.0 Prerequisites (dependencies and tools)

- Node.js 18+ (match your project version)
- Install SDK (example):
```bash
npm install @google/genai
```
- REST examples only need `curl`; if you need to parse image Base64, install `jq` (optional).

### 3.1 Authentication and environment variables

- Put your API key in `GEMINI_API_KEY`
- REST requests use `x-goog-api-key: $GEMINI_API_KEY`

### 3.2 Two file input modes: Inline vs Files API

**Inline (embedded bytes/Base64)**
- Pros: shorter call chain, good for small files.
- Key constraint: total request size (text prompt + system instructions + embedded bytes) typically has a ~20MB ceiling.

**Files API (upload then reference)**
- Pros: good for large files, reusing the same file, or multi-turn conversations.
- Typical flow:
  1. `files.upload(...)` (SDK) or `POST /upload/v1beta/files` (REST resumable)
  2. Use `file_data` / `file_uri` in `generateContent`

> Engineering suggestion: implement `ensure_file_uri()` so that when a file exceeds a threshold (for example 10-15MB warning) or is reused, you automatically route through the Files API.

### 3.3 Unified handling of binary media outputs

- **Images**: usually returned as `inline_data` (Base64) in response parts; in the SDK use `part.as_image()` or decode Base64 and save as PNG/JPG.
- **Speech (TTS)**: usually returns **PCM** bytes (Base64); save as `.pcm` or wrap into `.wav` (commonly 24kHz, 16-bit, mono).
- **Video (Veo)**: long-running async task; poll the operation; download the file (or use the returned URI).

---

## 4. Model selection matrix (choose by scenario)

> Important: model names, versions, limits, and quotas can change over time. Verify against official docs before use. Last updated: 2026-01-22.

### 4.1 Image generation (Nano Banana)
- **gemini-2.5-flash-image**: optimized for speed/throughput; good for frequent, low-latency generation/editing.
- **gemini-3-pro-image-preview**: stronger instruction following and high-fidelity text rendering; better for professional assets and complex edits.

### 4.2 General image/video/audio understanding
- Docs use `gemini-3-flash-preview` for image, video, and audio understanding (choose stronger models as needed for quality/cost).

### 4.3 Video generation (Veo)
- Example model: `veo-3.1-generate-preview` (generates 8-second video and can natively generate audio).

### 4.4 Speech generation (TTS)
- Example model: `gemini-2.5-flash-preview-tts` (native TTS, currently in preview).

---

## 5. Image generation (Nano Banana)

### 5.1 Text-to-Image

**SDK (Node.js) minimal template**
```js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents:
    "Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
  if (part.text) console.log(part.text);
  if (part.inlineData?.data) {
    fs.writeFileSync("out.png", Buffer.from(part.inlineData.data, "base64"));
  }
}
```

**REST (with imageConfig) minimal template**
```bash
curl -s -X POST   "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent"   -H "x-goog-api-key: $GEMINI_API_KEY"   -H "Content-Type: application/json"   -d '{
    "contents":[{"parts":[{"text":"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme"}]}],
    "generationConfig": {"imageConfig": {"aspectRatio":"16:9"}}
  }'
```

**REST image parsing (Base64 decode)**
```bash
curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"contents":[{"parts":[{"text":"A minimal studio product shot of a nano banana"}]}]}' \
  | jq -r '.candidates[0].content.parts[] | select(.inline_data) | .inline_data.data' \
  | base64 --decode > out.png

# macOS can use: base64 -D > out.png
```

### 5.2 Text-and-Image-to-Image

Use case: given an image, **add/remove/modify elements**, change style, color grading, etc.

**SDK (Node.js) minimal template**
```js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const prompt =
  "Add a nano banana on the table, keep lighting consistent, cinematic tone.";
const imageBase64 = fs.readFileSync("input.png").toString("base64");

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents: [
    { text: prompt },
    { inlineData: { mimeType: "image/png", data: imageBase64 } },
  ],
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
  if (part.inlineData?.data) {
    fs.writeFileSync("edited.png", Buffer.from(part.inlineData.data, "base64"));
  }
}
```

### 5.3 Multi-turn image iteration (Multi-turn editing)

Best practice: use chat for continuous iteration (for example: generate first, then "only edit a specific region/element", then "make variants in the same style").  
To output mixed "text + image" results, set `response_modalities` to `["TEXT", "IMAGE"]`.

### 5.4 ImageConfig

You can set in `generationConfig.imageConfig` or the SDK config:
- `aspectRatio`: e.g. `16:9`, `1:1`.
- `imageSize`: e.g. `2K`, `4K` (higher resolution is usually slower/more expensive and model support can vary).

---

## 6. Image understanding (Image Understanding)

### 6.1 Two ways to provide input images

- **Inline image data**: suitable for small files (total request size < 20MB).
- **Files API upload**: better for large files or reuse across multiple requests.

### 6.2 Inline images (Node.js) minimal template
```js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const imageBase64 = fs.readFileSync("image.jpg").toString("base64");

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: [
    { inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
    { text: "Caption this image, and list any visible brands." },
  ],
});

console.log(response.text);
```

### 6.3 Upload and reference with Files API (Node.js) minimal template
```js
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "image.jpg" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    createPartFromUri(uploaded.uri, uploaded.mimeType),
    "Caption this image.",
  ]),
});

console.log(response.text);
```

### 6.4 Multi-image prompts

Append multiple images as multiple `Part` entries in the same `contents`; you can mix uploaded references and inline bytes.

---

## 7. Video generation (Veo 3.1)

### 7.1 Core features (must know)
- Generates **8-second** high-fidelity video, optionally 720p / 1080p / 4k, and supports native audio generation (dialogue, ambience, SFX).
- Supports:
  - Aspect ratio (16:9 / 9:16)
  - Video extension (extend a generated video; typically limited to 720p)
  - First/last frame control (frame-specific)
  - Up to 3 reference images (image-based direction)

### 7.2 SDK (Node.js) minimal template: async polling + download
```js
import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const prompt =
  "A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience.";
let operation = await ai.models.generateVideos({
  model: "veo-3.1-generate-preview",
  prompt,
  config: { resolution: "1080p" },
});

while (!operation.done) {
  await new Promise((resolve) => setTimeout(resolve, 10_000));
  operation = await ai.operations.getVideosOperation({ operation });
}

const video = operation.response?.generatedVideos?.[0]?.video;
if (!video) throw new Error("No video returned");
await ai.files.download({ file: video, downloadPath: "out.mp4" });
```

### 7.3 REST minimal template: predictLongRunning + poll + download

Key point: Veo REST uses `:predictLongRunning` to return an operation name, then poll `GET /v1beta/{operation_name}`; once done, download from the video URI in the response.

### 7.4 Common controls (recommend a unified wrapper)

- `aspectRatio`: `"16:9"` or `"9:16"`
- `resolution`: `"720p" | "1080p" | "4k"` (higher resolutions are usually slower/more expensive)
- When writing prompts: put dialogue in quotes; explicitly call out SFX and ambience; use cinematography language (camera position, movement, composition, lens effects, mood).
- Negative constraints: if the API supports a negative prompt field, use it; otherwise list elements you do not want to see.

### 7.5 Important limits (engineering fallback needed)

- Latency can vary from seconds to minutes; implement timeouts and retries.
- Generated videos are only retained on the server for a limited time (download promptly).
- Outputs include a SynthID watermark.

**Polling fallback (with timeout/backoff) pseudocode**
```js
const deadline = Date.now() + 300_000; // 5 min
let sleepMs = 2000;
while (!operation.done && Date.now() < deadline) {
  await new Promise((resolve) => setTimeout(resolve, sleepMs));
  sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000);
  operation = await ai.operations.getVideosOperation({ operation });
}
if (!operation.done) throw new Error("video generation timed out");
```

---

## 8. Video understanding (Video Understanding)

### 8.1 Video input options
- **Files API upload**: recommended when file > 100MB, video length > ~1 minute, or you need reuse.
- **Inline video data**: for smaller files.
- **Direct YouTube URL**: can analyze public videos.

### 8.2 Files API (Node.js) minimal template
```js
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp4" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    createPartFromUri(uploaded.uri, uploaded.mimeType),
    "Summarize this video. Provide timestamps for key events.",
  ]),
});

console.log(response.text);
```

### 8.3 Timestamp prompting strategy
- Ask for segmented bullets with "(mm:ss)" timestamps.
- Require "evidence with specific time ranges" and include downstream structured extraction (JSON) in the same prompt if needed.

---

## 9. Speech generation (Text-to-Speech, TTS)

### 9.1 Positioning
- Native TTS: for "precise reading + controllable style" (podcasts, audiobooks, ad voiceover, etc.).
- Distinguish from the Live API: Live API is more interactive and non-structured audio/multimodal conversation; TTS is focused on controlled narration.

### 9.2 Single-speaker TTS (Node.js) minimal template
```js
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-preview-tts",
  contents: [{ parts: [{ text: "Say cheerfully: Have a wonderful day!" }] }],
  config: {
    responseModalities: ["AUDIO"],
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: { voiceName: "Kore" },
      },
    },
  },
});

const data =
  response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data ?? "";
if (!data) throw new Error("No audio returned");
fs.writeFileSync("out.pcm", Buffer.from(data, "base64"));
```

### 9.3 Multi-speaker TTS (max 2 speakers)
Requirements:
- Use `multiSpeakerVoiceConfig`
- Each speaker name must match the dialogue labels in the prompt (e.g., Joe/Jane).

### 9.4 Voice options and language
- `voice_name` supports 30 prebuilt voices (for example Zephyr, Puck, Charon, Kore, etc.).
- The model can auto-detect input language and supports 24 languages (see docs for the list).

### 9.5 "Director notes" (strongly recommended for high-quality voice)
Provide controllable directions for style, pace, accent, etc., but avoid over-constraining.

---

## 10. Audio understanding (Audio Understanding)

### 10.1 Typical tasks
- Describe audio content (including non-speech like birds, alarms, etc.)
- Generate transcripts
- Transcribe specific time ranges
- Count tokens (for cost estimates/segmentation)

### 10.2 Files API (Node.js) minimal template
```js
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp3" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    "Describe this audio clip.",
    createPartFromUri(uploaded.uri, uploaded.mimeType),
  ]),
});

console.log(response.text);
```

### 10.3 Key limits and engineering tips
- Supports common formats: WAV/MP3/AIFF/AAC/OGG/FLAC.
- Audio tokenization: about 32 tokens/second (about 1920 tokens per minute; values may change).
- Total audio length per prompt is capped at 9.5 hours; multi-channel audio is downmixed; audio is resampled (see docs for exact parameters).
- If total request size exceeds 20MB, you must use the Files API.

---

## 11. End-to-end examples (composition)

### Example A: Image generation -> validation via understanding
1) Generate product images with Nano Banana (require negative space, consistent lighting).
2) Use image understanding for self-check: verify text clarity, brand spelling, and unsafe elements.
3) If not satisfied, feed the generated image into text+image editing and iterate.

### Example B: Video generation -> video understanding -> narration script
1) Generate an 8-second shot with Veo (include dialogue or SFX).
2) Download and save (respect retention window).
3) Upload video to video understanding to produce a storyboard + timestamps + narration copy (then feed to TTS).

### Example C: Audio understanding -> time-range transcription -> TTS redub
1) Upload meeting audio and transcribe full content.
2) Transcribe or summarize specific time ranges.
3) Use TTS to generate a "broadcast" version of the summary.

---

## 12. Compliance and risk (must follow)

- Ensure you have the necessary rights to upload images/video/audio; do not generate infringing, deceptive, harassing, or harmful content.
- Generated images and videos include SynthID watermarking; videos may also have regional/person-based generation constraints.
- Production systems must implement timeouts, retries, failure fallbacks, and human review/post-processing for generated content.

---

## 13. Quick reference (Checklist)

- [ ] Pick the right model: image generation (Flash Image / Pro Image Preview), video generation (Veo 3.1), TTS (Gemini 2.5 TTS), understanding (Gemini Flash/Pro).
- [ ] Pick the right input mode: inline for small files; Files API for large/reuse.
- [ ] Parse binary outputs correctly: image/audio via inline_data decode; video via operation polling + download.
- [ ] For video generation: set aspectRatio / resolution, and download promptly (avoid expiration).
- [ ] For TTS: set response_modalities=["AUDIO"]; max 2 speakers; speaker names must match prompt.
- [ ] For audio understanding: countTokens when needed; segment long audio or use Files API.

Related Skills

media-backup

533
from sundial-org/awesome-openclaw-skills

Archive Clawdbot conversation media (photos, videos) to a local folder. Works with any sync service (Dropbox, iCloud, Google Drive, OneDrive).

google-workspace

533
from sundial-org/awesome-openclaw-skills

Gmail, Calendar, Drive, Docs, Sheets — NO Google Cloud Console required. Just OAuth sign-in. Zero setup complexity vs traditional Google API integrations.

google-search

533
from sundial-org/awesome-openclaw-skills

Search the web using Google Custom Search Engine (PSE). Use this when you need live information, documentation, or to research topics and the built-in web_search is unavailable.

google-home

533
from sundial-org/awesome-openclaw-skills

Control Google Nest devices (thermostats, cameras, doorbells) via the Google Smart Device Management API using curl and jq.

google-chat

533
from sundial-org/awesome-openclaw-skills

Send messages to Google Chat spaces and users via webhooks or OAuth. Use when you need to send notifications, alerts, or messages to Google Chat channels (spaces) or direct messages to specific users. Supports both incoming webhooks (for predefined channels) and OAuth 2.0 (for dynamic messaging to any space or user).

google-calendar

533
from sundial-org/awesome-openclaw-skills

Interact with Google Calendar via the Google Calendar API – list upcoming events, create new events, update or delete them. Use this skill when you need programmatic access to your calendar from OpenClaw.

google-ads

533
from sundial-org/awesome-openclaw-skills

Query, audit, and optimize Google Ads campaigns. Supports two modes: (1) API mode for bulk operations with google-ads Python SDK, (2) Browser automation mode for users without API access - just attach a browser tab to ads.google.com. Use when asked to check ad performance, pause campaigns/keywords, find wasted spend, audit conversion tracking, or optimize Google Ads accounts.

gemini

533
from sundial-org/awesome-openclaw-skills

Gemini CLI for one-shot Q&A, summaries, and generation.

gemini-yt-video-transcript

533
from sundial-org/awesome-openclaw-skills

Create a verbatim transcript for a YouTube URL using Google Gemini (speaker labels, paragraph breaks; no time codes). Use when the user asks to transcribe a YouTube video or wants a clean transcript (no timestamps).

gemini-stt

533
from sundial-org/awesome-openclaw-skills

Transcribe audio files using Google's Gemini API or Vertex AI

gemini-image-simple

533
from sundial-org/awesome-openclaw-skills

Generate and edit images with Gemini API using pure Python stdlib. Zero dependencies - works on locked-down environments where pip/uv aren't available.

gemini-deep-research

533
from sundial-org/awesome-openclaw-skills

Perform complex, long-running research tasks using Gemini Deep Research Agent. Use when asked to research topics requiring multi-source synthesis, competitive analysis, market research, or comprehensive technical investigations that benefit from systematic web search and analysis.