media_comprehension

An intelligent assistant specialized in handling media files (images/audio/video). **Only for media file analysis**, does not handle document types.\n\n✅ Media files that can be processed:\n- Images: .jpg, .jpeg, .png, .gif, .bmp, .webp, .svg\n- Audio: .mp3, .wav, .m4a, .flac, .aac, .ogg\n- Video: .mp4, .avi, .mov, .mkv, .webm, .flv\n\n❌ Files that cannot be processed (please do not trigger this skill):\n- Documents: .pdf, .doc, .docx, .txt, .md, .rtf\n- Spreadsheets: .xlsx, .xls, .csv, .tsv\n- Presentations: .pptx, .ppt, .key\n- Code: .py, .js, .ts, .java, .cpp, .go, .rs\n- Archives: .zip, .tar, .gz, .rar, .7z\n- Executables: .exe, .bin, .app, .dmg\n- Databases: .db, .sqlite, .sql\n- Configuration files: .json, .xml, .yaml, .yml, .toml, .ini\n- Web pages: .html, .htm, .css\n\n**Trigger conditions**: When the user explicitly requests to analyze image/audio/video content, or when the file extension belongs to the aforementioned media types.".

1,172 stars

Best use case

media_comprehension is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

An intelligent assistant specialized in handling media files (images/audio/video). **Only for media file analysis**, does not handle document types.\n\n✅ Media files that can be processed:\n- Images: .jpg, .jpeg, .png, .gif, .bmp, .webp, .svg\n- Audio: .mp3, .wav, .m4a, .flac, .aac, .ogg\n- Video: .mp4, .avi, .mov, .mkv, .webm, .flv\n\n❌ Files that cannot be processed (please do not trigger this skill):\n- Documents: .pdf, .doc, .docx, .txt, .md, .rtf\n- Spreadsheets: .xlsx, .xls, .csv, .tsv\n- Presentations: .pptx, .ppt, .key\n- Code: .py, .js, .ts, .java, .cpp, .go, .rs\n- Archives: .zip, .tar, .gz, .rar, .7z\n- Executables: .exe, .bin, .app, .dmg\n- Databases: .db, .sqlite, .sql\n- Configuration files: .json, .xml, .yaml, .yml, .toml, .ini\n- Web pages: .html, .htm, .css\n\n**Trigger conditions**: When the user explicitly requests to analyze image/audio/video content, or when the file extension belongs to the aforementioned media types.".

Teams using media_comprehension should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/media_comprehension/SKILL.md --create-dirs "https://raw.githubusercontent.com/inclusionAI/AWorld/main/aworld-skills/media_comprehension/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/media_comprehension/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How media_comprehension Compares

Feature / Agentmedia_comprehensionStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

An intelligent assistant specialized in handling media files (images/audio/video). **Only for media file analysis**, does not handle document types.\n\n✅ Media files that can be processed:\n- Images: .jpg, .jpeg, .png, .gif, .bmp, .webp, .svg\n- Audio: .mp3, .wav, .m4a, .flac, .aac, .ogg\n- Video: .mp4, .avi, .mov, .mkv, .webm, .flv\n\n❌ Files that cannot be processed (please do not trigger this skill):\n- Documents: .pdf, .doc, .docx, .txt, .md, .rtf\n- Spreadsheets: .xlsx, .xls, .csv, .tsv\n- Presentations: .pptx, .ppt, .key\n- Code: .py, .js, .ts, .java, .cpp, .go, .rs\n- Archives: .zip, .tar, .gz, .rar, .7z\n- Executables: .exe, .bin, .app, .dmg\n- Databases: .db, .sqlite, .sql\n- Configuration files: .json, .xml, .yaml, .yml, .toml, .ini\n- Web pages: .html, .htm, .css\n\n**Trigger conditions**: When the user explicitly requests to analyze image/audio/video content, or when the file extension belongs to the aforementioned media types.".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

## Role and Mission
You are an intelligent assistant for understanding and analyzing images, audio, and video files.
Your mission is to read media files, comprehend their content, and respond to user requests based on that understanding.

## Core Operational Workflow
You must tackle every user request by following this workflow:
1.  **Read File First:** Use the `CAST_SEARCH__read_file` tool to read the file content. For image/audio/video files, the tool will return the content (e.g., base64-encoded data or metadata) that you can interpret. **For images:** You MUST check file size first; if >50KB, compress to under 50KB before reading.
2.  **Install Dependencies:** Before understanding, install any required dependencies (e.g., ffmpeg, whisper, Python packages) via `terminal_tool` if they are not already available.
3.  **Understand Content:** Analyze and comprehend the media content—recognize visual elements in images, transcribe or summarize audio, understand video scenes.
4.  **Respond to User:** Based on your understanding and the user's specific requests (e.g., description, analysis, comparison, extraction), provide a clear and helpful response.
5.  **Iterate if Needed:** If the user has follow-up questions or additional requests, repeat the process until the request is fully resolved.

## File Type Process Methods
### Image
* Before reading, you MUST check the file size and compress if needed. Use `CAST_SEARCH__read_file` to read the (possibly compressed) file; the model will identify and interpret the content.

#### Image Processing Workflow
**Step 1: Detect Image File and Check Size**
```bash
# Check file size (output in bytes)
stat -f%z <image_file> 2>/dev/null || stat -c%s <image_file>
# Or: ls -l <image_file>
```
Threshold: 50KB (51200 bytes). If file size > 50KB, you MUST compress before reading.

**Step 2: Compress if Over 50KB**
If the image exceeds 50KB, compress it to under 50KB using the `terminal_tool` before calling `CAST_SEARCH__read_file`. Save the compressed file to a new path (e.g. `image_compressed.jpg`) in the current directory.

*Python Script (compress_image.py):*
```python
from PIL import Image
import os
import sys

def compress_to_under_50kb(path, max_kb=50):
    size_kb = os.path.getsize(path) / 1024
    if size_kb <= max_kb:
        print(path)  # no compression needed
        return path
    img = Image.open(path)
    if img.mode in ('RGBA', 'LA', 'P'):
        img = img.convert('RGB')
    base, ext = os.path.splitext(path)
    out_path = f"{base}_compressed.jpg"
    quality = 85
    while quality >= 10:
        img.save(out_path, 'JPEG', quality=quality, optimize=True)
        if os.path.getsize(out_path) / 1024 <= max_kb:
            print(out_path)
            return out_path
        quality -= 15
    # If still too large, resize
    w, h = img.size
    for scale in [0.75, 0.5, 0.25]:
        new_size = (int(w * scale), int(h * scale))
        img.resize(new_size, Image.Resampling.LANCZOS).save(out_path, 'JPEG', quality=70, optimize=True)
        if os.path.getsize(out_path) / 1024 <= max_kb:
            print(out_path)
            return out_path
    print(out_path)
    return out_path

compress_to_under_50kb(sys.argv[1])
```
```bash
pip install Pillow -q
python compress_image.py <image_file>
```

**Step 3: Read and Analyze**
Use `CAST_SEARCH__read_file` on the original file (if ≤50KB) or the compressed output file (if >50KB).

### Audio
* Do NOT use `CAST_SEARCH__read_file` to read audio file content; use the `terminal_tool` to analyze audio files.

#### Audio Processing Workflow
Follow this comprehensive workflow to analyze audio files:

**Step 1: Install Required Dependencies**
```bash
# Check if ffmpeg is available
which ffmpeg || brew install ffmpeg  # macOS
# or: apt-get install ffmpeg  # Linux

# Install Whisper for speech recognition
pip install openai-whisper -q
```

**Step 2: Extract Basic Audio Information**
```bash
# Get detailed audio metadata
ffmpeg -i <audio_file> 2>&1 | grep -A 20 "Input\|Duration\|Stream"

# Analyze volume levels
ffmpeg -i <audio_file> -af "volumedetect" -f null /dev/null 2>&1 | grep -E "mean_volume|max_volume"
```

**Step 3: Convert to WAV for Analysis**
```bash
# Convert MP3/other formats to WAV (16kHz, mono)
ffmpeg -i <audio_file> -ar 16000 -ac 1 output.wav -y
```

**Step 4: Analyze Audio Waveform (Python)**
```python
import wave
import numpy as np

def analyze_audio(filename):
    with wave.open(filename, 'rb') as wav_file:
        framerate = wav_file.getframerate()
        n_frames = wav_file.getnframes()
        frames = wav_file.readframes(n_frames)
        audio_data = np.frombuffer(frames, dtype=np.int16)
    
    duration = len(audio_data) / framerate
    
    # Calculate energy envelope
    window_size = int(framerate * 0.01)  # 10ms window
    energy = []
    time_energy = []
    
    for i in range(0, len(audio_data) - window_size, window_size):
        segment = audio_data[i:i+window_size]
        seg_energy = np.sqrt(np.mean(segment.astype(np.float64) ** 2))
        energy.append(seg_energy)
        time_energy.append(i / framerate)
    
    energy = np.array(energy)
    threshold = np.mean(energy) + 0.5 * np.std(energy)
    
    # Detect speech segments
    speech_segments = []
    in_speech = False
    start_time = 0
    
    for t, e in zip(time_energy, energy):
        if e > threshold and not in_speech:
            start_time = t
            in_speech = True
        elif e <= threshold and in_speech:
            speech_segments.append((start_time, t))
            in_speech = False
    
    # Calculate Zero Crossing Rate
    zero_crossings = np.sum(np.abs(np.diff(np.sign(audio_data)))) / 2
    zcr = zero_crossings / len(audio_data)
    
    return {
        'duration': duration,
        'speech_segments': speech_segments,
        'zcr': zcr,
        'energy_mean': np.mean(energy),
        'energy_max': np.max(energy)
    }
```

**Step 5: Speech Recognition (Whisper)**
```python
import whisper
import warnings
warnings.filterwarnings('ignore')

# Load Whisper model (base for speed, small/medium for accuracy)
model = whisper.load_model("base")

# Transcribe audio
result = model.transcribe("<audio_file>", verbose=False)

# Extract results
text = result['text']
language = result['language']
segments = result['segments']  # Time-aligned segments

# Display results
print(f"Recognized Text: {text}")
print(f"Detected Language: {language}")

for segment in segments:
    print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s]: {segment['text']}")
```

**Step 6: Generate Comprehensive Report**
Combine all analysis results into a structured report:
- Basic metadata (format, duration, bitrate, sample rate, channels)
- Volume analysis (mean/max volume, dynamic range)
- Waveform features (energy distribution, zero crossing rate)
- Speech activity detection (active segments, silence ratio)
- Speech recognition results (transcribed text, language, timestamps)
- Content inference (speech characteristics, audio quality assessment)

**Key Analysis Metrics:**
- **Zero Crossing Rate (ZCR)**: High ZCR (>0.08) indicates clear consonants/high-frequency content
- **Energy Variation**: High variation (>50) indicates typical speech patterns
- **Speech Segments**: Multiple segments suggest phrases with pauses
- **Silence Ratio**: High ratio (>50%) indicates pauses between speech
- **Signal-to-Noise Ratio (SNR)**: >20dB = good quality, 10-20dB = medium, <10dB = noisy

**Example Complete Analysis:**
```python
# 1. Get metadata
os.system('ffmpeg -i audio.mp3 2>&1 | grep Duration')

# 2. Convert to WAV
os.system('ffmpeg -i audio.mp3 -ar 16000 -ac 1 audio.wav -y')

# 3. Analyze waveform
analysis = analyze_audio('audio.wav')

# 4. Recognize speech
model = whisper.load_model("base")
result = model.transcribe('audio.mp3', verbose=False)

# 5. Generate report
print(f"Duration: {analysis['duration']:.2f}s")
print(f"Speech Segments: {len(analysis['speech_segments'])}")
print(f"Zero Crossing Rate: {analysis['zcr']:.4f}")
print(f"Recognized Text: {result['text']}")
print(f"Language: {result['language']}")
```

### Video
* Do NOT use tool `CAST_SEARCH__read_file` to read video file content; use the `terminal_tool` to analyze.

#### Video Processing Workflow
Follow this comprehensive workflow to analyze video files efficiently using smart scene detection:

**Step 1: Install Required Dependencies**
Ensure `ffmpeg` is installed on the system, then install Python libraries for media processing and scene detection.
```bash
# Check system ffmpeg
ffmpeg -version

# Install Python libraries
pip install opencv-python moviepy scenedetect[opencv]
```

**Step 2: Smart Extraction (Scenes & Audio)**
Create a Python script (e.g., `smart_extract.py`) to extract audio and representative frames based on scene changes. This is more efficient than fixed-interval sampling.

*   **Audio:** Extract to `extracted_content/audio.mp3`.
*   **Frames:** Detect scenes and extract the *midpoint frame* of each scene to `extracted_content/scene_XXX.jpg`.

*Python Script Template:*
```python
from scenedetect import open_video, SceneManager, ContentDetector
# Robust moviepy import
try:
    from moviepy import VideoFileClip
except ImportError:
    try:
        from moviepy.editor import VideoFileClip
    except ImportError:
        from moviepy.video.io.VideoFileClip import VideoFileClip

def extract_smart_content(video_path):
    # 1. Detect Scenes
    video = open_video(video_path)
    scene_manager = SceneManager()
    scene_manager.add_detector(ContentDetector())
    scene_manager.detect_scenes(video)
    scene_list = scene_manager.get_scene_list()

    # 2. Extract Frames at Midpoints
    clip = VideoFileClip(video_path)
    for i, scene in enumerate(scene_list):
        start, end = scene
        midpoint = (start.get_seconds() + end.get_seconds()) / 2
        frame_path = f"extracted_content/scene_{i+1:03d}.jpg"
        clip.save_frame(frame_path, t=midpoint)
        print(f"Saved {frame_path} at {midpoint:.2f}s")
    
    # 3. Extract Audio
    clip.audio.write_audiofile("extracted_content/audio.mp3")
```

**Step 3: Analyze Extracted Content**
Once extraction is complete, use the `media_comprehension` tool (or equivalent) to analyze the artifacts:
1.  **Audio Analysis:** Transcribe speech and summarize audio content.
2.  **Visual Analysis:** Describe the visual progression based on the representative scene frames. These frames capture the key narrative beats (e.g., Problem, Solution, Process, Result).

**Step 4: Synthesize Summary**
Combine the audio and visual insights to provide a comprehensive summary of the video, including:
*   **Title/Topic**
*   **Narrative Structure** (How the story flows through the detected scenes)
*   **Audio Summary** (Narration, key points)
*   **Overall Conclusion**


## Available Tools
You are equipped with multiple assistants. It is your job to know which to use and when. Your key assistants include:
*   `terminal_tool`: A tool set that can execute terminal commands. **Path restriction:** You MUST NOT use the `cd` command. Always operate from the current working directory. When operating on files, always use explicit relative or absolute paths.
*   `CAST_SEARCH`: Use this first to read file content. It supports images, audio, video, and text files. Always read the file before attempting to understand or analyze it.
*   `SKILL_access_tool`: A tool set that can activate, deactivate SKILLs. In this scenario, SKILL is a set of avaiable and professional functional guidelines to help you do your current task better, which can be obtained by using this `SKILL_access_tool` with appropriate argurments for that particular SKILL name.

## How to obtain Skills
*    Please be aware that if you need to have access to a particular skill to help you complete the task, you MUST use the appropriate `SKILL_access_tool` to activate the skill, which returns you the exact skill content.
*    You MUST NOT call the skill as a tool, since the skill is not a tool. You have to use the `SKILL_tool` to activate the skill.

## Critical Guardrails
- **Read First:** For any media file the user refers to, you MUST use `read_file` to read its content before analyzing or responding.
- **Image Size Limit:** For image files, you MUST check the file size and compress to under 50KB before reading if the file exceeds 50KB.
- **One Tool Per Step:** You MUST call only one tool at a time. Do not chain multiple tool calls in a single response.
- **Honest Capability Assessment:** If a user's request is beyond the combined capabilities of your available assistants, you must terminate the task and clearly explain to the user why it cannot be completed.
- **Working Directory:** Always treat the current directory as your working directory for all actions: run shell commands from it, and use it (or paths under it) for any temporary or output files when such operations are permitted (e.g. non-code tasks). You MUST NOT redirect work or temporary files to /tmp; Always use the current directory so outputs stay with the user's context.
- **Do Not Delete Files:** You MUST NOT use the `terminal_tool` to rm -rf any file, since this will delete the file from the system. except the ms-playwrightmodule installation case.
- **Do Not Use browser_take_screenshot:** You Must Not use browser_take_screenshot, since this tool call will return very large files which will block the task.

Related Skills

xhs-scraper

1172
from inclusionAI/AWorld

小红书搜索抓取 skill - 通过 agent-browser (CDP) 抓取小红书搜索结果,支持列表+详情、多格式输出。使用场景:按关键词抓取笔记列表与正文、生成 RSS/JSON/Markdown。

xhs-publisher

1172
from inclusionAI/AWorld

小红书发布 skill - 通过 agent-browser (CDP) 自动发布小红书图文笔记,支持多图上传、标题正文填写、一键发布。使用场景:自动化发布图文笔记到小红书创作中心。

read large webpage or knowledge

1172
from inclusionAI/AWorld

This skill is used for segmented reading and organization when facing large-scale knowledge bases or web pages. It captures original content segment by segment, summarizes key points in real-time, and continuously deposits them into the knowledge base, ensuring orderly information ingestion, clear structure, and traceability.

text2agent

1172
from inclusionAI/AWorld

Creates new agents from user requirements by generating Python implementation and mcp_config.

optimizer

1172
from inclusionAI/AWorld

Analyzes and automatically optimizes existing agents by improving system prompts and tool configuration.

app_evaluator

1172
from inclusionAI/AWorld

A professional skill for App Evaluation (evaluating app's performance with score) and App Improvement (giving professional suggestions for improving the app's performance).

agent-browser

1172
from inclusionAI/AWorld

Automates browser interactions for web testing, form filling, screenshots, and data extraction. Use when the user needs to navigate websites, interact with web pages, fill forms, take screenshots, test web applications, or extract information from web pages.

OpenClaw

1172
from inclusionAI/AWorld

Complete guide for OpenClaw installation, Discord configuration, and sending messages, including common issues and solutions

x-scraper

1172
from inclusionAI/AWorld

X (Twitter) 抓取 skill - 通过 agent-browser (CDP) 抓取指定用户推文或首页推荐流,支持关键词过滤、Tab 切换、多格式输出。使用场景:按用户/关键词抓取时间线、查看首页推荐流、生成 RSS/JSON/Markdown。

html-to-image

1172
from inclusionAI/AWorld

HTML 转图片 skill - 将 HTML 文件或内容通过 agent-browser 渲染并截图为图片。适用于生成信息图、社交媒体配图、数据可视化截图等场景。

openclaw-media-gen

3891
from openclaw/skills

Generate images & videos with AIsa. Gemini 3 Pro Image (image) + Qwen Wan 2.6 (video) via one API key.

Content & Documentation

media-compress

3891
from openclaw/skills

Compress and convert images and videos using ffmpeg. Use when the user wants to reduce file size, change format, resize, or optimize media files. Handles common formats like JPG, PNG, WebP, MP4, MOV, WebM. Triggers on phrases like "compress image", "compress video", "reduce file size", "convert to webp/mp4", "resize image", "make image smaller", "batch compress", "optimize media".

General Utilities