adb-screen-detection

Screen understanding with OCR and template matching for Android device automation

181 stars

Best use case

adb-screen-detection is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Screen understanding with OCR and template matching for Android device automation

Teams using adb-screen-detection should expect a more consistent output, faster repeated execution, less prompt rewriting, better workflow continuity with your supporting tools.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.
You already have the supporting tools or dependencies needed by this skill.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/adb-screen-detection/SKILL.md --create-dirs "https://raw.githubusercontent.com/majiayu000/claude-skill-registry/main/skills/data/adb-screen-detection/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/adb-screen-detection/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How adb-screen-detection Compares

Feature / Agent	adb-screen-detection	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Screen understanding with OCR and template matching for Android device automation

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

---

## Quick Reference (30 seconds)

**Screen Understanding for Android Automation**

**What It Does**: Provides OCR-based text detection and template matching to understand Android device screens. Enables reliable UI automation by verifying screen state before and after actions.

**Core Capabilities**:
- 📸 **Screen Capture**: ADB screencap with local storage
- 🔍 **OCR Detection**: Tesseract-based text extraction
- 🎯 **Template Matching**: OpenCV-based element detection
- 👆 **Coordinate Tapping**: ADB input tap with verification

**When to Use**:
- Need to verify UI state before taking actions
- Finding UI elements by text or appearance
- Building reliable automation workflows
- Screen-dependent decision making

---

## Scripts

### 1. adb-screen-capture.py

Capture Android device screen and save locally.

```bash
# Basic usage
uv run .claude/skills/adb-screen-detection/scripts/adb-screen-capture.py

# Specify device
uv run .claude/skills/adb-screen-detection/scripts/adb-screen-capture.py --device 127.0.0.1:5555

# Custom output path
uv run .claude/skills/adb-screen-detection/scripts/adb-screen-capture.py --output /tmp/screen.png

# JSON output
uv run .claude/skills/adb-screen-detection/scripts/adb-screen-capture.py --json
```

**Output**:
```json
{
  "device": "127.0.0.1:5555",
  "timestamp": "2025-12-01T10:30:45Z",
  "local_path": "/tmp/screenshot.png",
  "size": [1080, 2400],
  "success": true
}
```

---

### 2. adb-ocr-extract.py

Extract all visible text from device screen using Tesseract OCR.

```bash
# Basic usage (uses most recent screenshot)
uv run .claude/skills/adb-screen-detection/scripts/adb-ocr-extract.py

# Specify screenshot path
uv run .claude/skills/adb-screen-detection/scripts/adb-ocr-extract.py --image /tmp/screen.png

# Search for specific text
uv run .claude/skills/adb-screen-detection/scripts/adb-ocr-extract.py --search "Login"

# JSON output with coordinates
uv run .claude/skills/adb-screen-detection/scripts/adb-ocr-extract.py --json
```

**Output**:
```json
{
  "text": ["Login", "Username", "Password", "Submit"],
  "detected": true,
  "search_found": true,
  "search_term": "Login",
  "coordinates": {
    "Login": [[100, 200, 150, 230]]
  }
}
```

---

### 3. adb-find-element.py

Find UI element by template matching or OCR text search.

```bash
# Find by OCR text
uv run .claude/skills/adb-screen-detection/scripts/adb-find-element.py \
    --method ocr \
    --target "Login Button" \
    --threshold 0.8

# Find by template image
uv run .claude/skills/adb-screen-detection/scripts/adb-find-element.py \
    --method template \
    --template /path/to/template.png \
    --threshold 0.8

# JSON output
uv run .claude/skills/adb-screen-detection/scripts/adb-find-element.py \
    --method ocr \
    --target "Login" \
    --json
```

**Output**:
```json
{
  "found": true,
  "method": "ocr",
  "target": "Login",
  "coordinates": {
    "x": 100,
    "y": 200,
    "width": 150,
    "height": 30
  },
  "confidence": 0.95,
  "message": "Element found at (100, 200)"
}
```

---

### 4. adb-tap-coordinate.py

Tap device screen at specific coordinates.

```bash
# Tap at coordinates
uv run .claude/skills/adb-screen-detection/scripts/adb-tap-coordinate.py \
    --x 100 \
    --y 200 \
    --device 127.0.0.1:5555

# Tap with verification (check screen after tap)
uv run .claude/skills/adb-screen-detection/scripts/adb-tap-coordinate.py \
    --x 100 \
    --y 200 \
    --verify-text "Next Screen" \
    --timeout 5

# JSON output
uv run .claude/skills/adb-screen-detection/scripts/adb-tap-coordinate.py \
    --x 100 \
    --y 200 \
    --json
```

**Output**:
```json
{
  "device": "127.0.0.1:5555",
  "tap": {
    "x": 100,
    "y": 200
  },
  "success": true,
  "verified": true,
  "verify_text": "Next Screen",
  "verification_match": true
}
```

---

## Usage Patterns

### Pattern 1: Verify Screen State Before Action

```bash
# 1. Capture current screen
adb-screen-capture.py

# 2. Check for expected element
adb-find-element.py --method ocr --target "Login Button"

# 3. If found, tap it
adb-tap-coordinate.py --x 100 --y 200 --verify-text "Welcome"
```

### Pattern 2: OCR-Based Automation

```bash
# 1. Capture screen
adb-screen-capture.py

# 2. Extract all text
adb-ocr-extract.py --search "Settings"

# 3. Get coordinates and tap
adb-find-element.py --method ocr --target "Settings"
adb-tap-coordinate.py --x 150 --y 300
```

### Pattern 3: Template-Based Element Detection

```bash
# 1. Have known UI template images in ./templates/
# 2. Capture screen
adb-screen-capture.py

# 3. Match against templates
adb-find-element.py --method template --template ./templates/button.png

# 4. Tap matched location
adb-tap-coordinate.py --x $(jq -r '.coordinates.x') --y $(jq -r '.coordinates.y')
```

---

## Architecture

**Design Principles**:
- **Independent**: Each script can run standalone
- **Chainable**: Scripts output JSON for piping
- **Stateless**: No dependencies between executions
- **Verifiable**: Always verify screen state before proceeding
- **Timeout Protected**: All network operations have timeouts

**Dependency Relationship**:
```
adb-screen-capture.py (foundation)
    ↓
adb-ocr-extract.py (uses capture)
adb-find-element.py (uses capture or templates)
    ↓
adb-tap-coordinate.py (uses find-element for verification)
```

---

## Integration Points

**Used By**:
- `adb-navigation-base` - Wait for elements between actions
- `adb-magisk` - Verify Magisk UI state
- `adb-karrot` - Verify app state during automation
- `adb-workflow-orchestrator` - Screen verification in workflows

**Dependencies**:
- System: `adb` command-line tool
- Python: pytesseract, opencv-python, pillow, numpy

---

## Troubleshooting

### OCR Not Working
- Install Tesseract: `brew install tesseract` (macOS) or `apt-get install tesseract-ocr` (Linux)
- Set TESSDATA_PREFIX: `export TESSDATA_PREFIX=/usr/local/share/tessdata`

### Template Matching Too Strict/Loose
- Adjust `--threshold` parameter (0.0-1.0)
- Higher threshold = stricter matching
- Recommended: 0.8-0.9 for reliable detection

### Device Offline
- Check ADB connection: `adb devices`
- Reconnect: `adb connect <device>`
- Restart ADB: `adb kill-server && adb start-server`

---

## Workflows

This skill includes TOON-based workflow definitions for automation.

### What is TOON?
TOON (Task-Oriented Orchestration Notation) is a structured workflow definition language that pairs with Markdown documentation. Each workflow consists of:
- **[name].toon** - Orchestration logic and execution steps
- **[name].md** - Complete documentation and usage guide

This TOON+MD pairing approach is inspired by the BMAD METHOD pattern, adapted to use TOON instead of YAML for better orchestration support.

### Available Workflows

Workflow files are located in `workflow/` directory:

**Example Workflows (adb-screen-detection):**
- `workflow/screen-verification.toon` - Capture and verify screen state
- `workflow/element-detection.toon` - Find elements via OCR or template matching
- `workflow/screen-monitoring.toon` - Continuous screen monitoring and analysis

### Running a Workflow

Execute any workflow using the ADB workflow orchestrator:

```bash
uv run .claude/skills/adb-workflow-orchestrator/scripts/adb-run-workflow.py \
  --workflow .claude/skills/adb-screen-detection/workflow/screen-verification.toon \
  --param device="127.0.0.1:5555"
```

### Workflow Documentation

Each workflow includes comprehensive documentation in the corresponding `.md` file:
- Purpose and use case
- Prerequisites and requirements
- Available parameters
- Execution phases and steps
- Success criteria
- Error handling and recovery
- Example commands

See the `workflow/` directory for complete TOON file definitions and documentation.

### Creating New Workflows

To create custom workflows for this skill:
1. Create a new `.toon` file in the `workflow/` directory
2. Define phases, steps, and parameters using TOON v4.0 syntax
3. Create corresponding `.md` file with comprehensive documentation
4. Test with the workflow orchestrator

For more information, refer to the TOON specification and the workflow orchestrator documentation.

---

**Version**: 1.0.0
**Status**: ✅ Foundation Tier
**Scripts**: 4 (all MCP-ready)
**Last Updated**: 2025-12-01
**Tier**: 2 (Foundation)

Related Skills

a-share-screener

181

from majiayu000/claude-skill-registry

Screen and filter A-share stocks based on fundamental metrics, technical indicators, capital flow, and custom criteria. Support multiple screening strategies including value investing, growth investing, momentum trading, and dividend hunting. Use when user wants to find stocks meeting specific criteria like "低PE高ROE股票", "北向资金加仓股", "突破年线的股票".

nested-TAD-detection

181

from majiayu000/claude-skill-registry

This skill detects hierarchical (nested) TAD structures from Hi-C contact maps (in .cool or mcool format) using OnTAD, starting from multi-resolution .mcool files. It extracts a user-specified chromosome and resolution, converts the data to a dense matrix, runs OnTAD, and organizes TAD calls and logs for downstream 3D genome analysis.

UMR-LMR-PMD-detection

181

from majiayu000/claude-skill-registry

This pipeline performs genome-wide segmentation of CpG methylation profiles to identify Unmethylated Regions (UMRs), Low-Methylated Regions (LMRs), and Partially Methylated Domains (PMDs) using whole-genome bisulfite sequencing (WGBS) methylation calls. The pipeline provides high-resolution enhancer-like LMRs, promoter-associated UMRs, and large-scale PMDs characteristic of reprogramming, aging, or cancer methylomes, enabling integration with chromatin accessibility, TF binding, and genome architecture analyses.

grail-miner

159

from majiayu000/claude-skill-registry

This skill assists in setting up, managing, and optimizing Grail miners on Bittensor Subnet 81, handling tasks like environment configuration, R2 storage, model checkpoint management, and performance tuning.

DevOps & Infrastructure

lets-go-rss

159

from majiayu000/claude-skill-registry

A lightweight, full-platform RSS subscription manager that aggregates content from YouTube, Vimeo, Behance, Twitter/X, and Chinese platforms like Bilibili, Weibo, and Douyin, featuring deduplication and AI smart classification.

Content & Documentation

ontopo

159

from majiayu000/claude-skill-registry

An AI agent skill to search for Israeli restaurants, check table availability, view menus, and retrieve booking links via the Ontopo platform, acting as an unofficial interface to its data.

General Utilities

astro

159

from majiayu000/claude-skill-registry

This skill provides essential Astro framework patterns, focusing on server-side rendering (SSR), static site generation (SSG), middleware, and TypeScript best practices. It helps AI agents implement secure authentication, manage API routes, and debug rendering behaviors within Astro projects.

Coding & Development

vly-money

159

from majiayu000/claude-skill-registry

Generate crypto payment links for supported tokens and networks, manage access to X402 payment-protected content, and provide direct access to the vly.money wallet interface.

Fintech & CryptoClaude

chrome-debug

159

from majiayu000/claude-skill-registry

This skill empowers AI agents to debug web applications and inspect browser behavior using the Chrome DevTools Protocol (CDP), offering both collaborative (headful) and automated (headless) modes.

Coding & DevelopmentClaude

modal-deployment

159

from majiayu000/claude-skill-registry

Run Python code in the cloud with serverless containers, GPUs, and autoscaling using Modal. This skill enables agents to generate code for deploying ML models, running batch jobs, serving APIs, and scaling compute-intensive workloads.

DevOps & Infrastructure

whisper-transcribe

159

from majiayu000/claude-skill-registry

Transcribes audio and video files to text using OpenAI's Whisper CLI, enhanced with contextual grounding from local markdown files for improved accuracy.

Media Processing

tech-blog

159

from majiayu000/claude-skill-registry

Generates comprehensive technical blog posts, offering detailed explanations of system internals, architecture, and implementation, either through source code analysis or document-driven research.

Content & DocumentationClaude