adb-screen-detection
Screen understanding with OCR and template matching for Android device automation
Best use case
adb-screen-detection is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Screen understanding with OCR and template matching for Android device automation
Teams using adb-screen-detection should expect a more consistent output, faster repeated execution, less prompt rewriting, better workflow continuity with your supporting tools.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
- You already have the supporting tools or dependencies needed by this skill.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/adb-screen-detection/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How adb-screen-detection Compares
| Feature / Agent | adb-screen-detection | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Screen understanding with OCR and template matching for Android device automation
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
---
## Quick Reference (30 seconds)
**Screen Understanding for Android Automation**
**What It Does**: Provides OCR-based text detection and template matching to understand Android device screens. Enables reliable UI automation by verifying screen state before and after actions.
**Core Capabilities**:
- 📸 **Screen Capture**: ADB screencap with local storage
- 🔍 **OCR Detection**: Tesseract-based text extraction
- 🎯 **Template Matching**: OpenCV-based element detection
- 👆 **Coordinate Tapping**: ADB input tap with verification
**When to Use**:
- Need to verify UI state before taking actions
- Finding UI elements by text or appearance
- Building reliable automation workflows
- Screen-dependent decision making
---
## Scripts
### 1. adb-screen-capture.py
Capture Android device screen and save locally.
```bash
# Basic usage
uv run .claude/skills/adb-screen-detection/scripts/adb-screen-capture.py
# Specify device
uv run .claude/skills/adb-screen-detection/scripts/adb-screen-capture.py --device 127.0.0.1:5555
# Custom output path
uv run .claude/skills/adb-screen-detection/scripts/adb-screen-capture.py --output /tmp/screen.png
# JSON output
uv run .claude/skills/adb-screen-detection/scripts/adb-screen-capture.py --json
```
**Output**:
```json
{
"device": "127.0.0.1:5555",
"timestamp": "2025-12-01T10:30:45Z",
"local_path": "/tmp/screenshot.png",
"size": [1080, 2400],
"success": true
}
```
---
### 2. adb-ocr-extract.py
Extract all visible text from device screen using Tesseract OCR.
```bash
# Basic usage (uses most recent screenshot)
uv run .claude/skills/adb-screen-detection/scripts/adb-ocr-extract.py
# Specify screenshot path
uv run .claude/skills/adb-screen-detection/scripts/adb-ocr-extract.py --image /tmp/screen.png
# Search for specific text
uv run .claude/skills/adb-screen-detection/scripts/adb-ocr-extract.py --search "Login"
# JSON output with coordinates
uv run .claude/skills/adb-screen-detection/scripts/adb-ocr-extract.py --json
```
**Output**:
```json
{
"text": ["Login", "Username", "Password", "Submit"],
"detected": true,
"search_found": true,
"search_term": "Login",
"coordinates": {
"Login": [[100, 200, 150, 230]]
}
}
```
---
### 3. adb-find-element.py
Find UI element by template matching or OCR text search.
```bash
# Find by OCR text
uv run .claude/skills/adb-screen-detection/scripts/adb-find-element.py \
--method ocr \
--target "Login Button" \
--threshold 0.8
# Find by template image
uv run .claude/skills/adb-screen-detection/scripts/adb-find-element.py \
--method template \
--template /path/to/template.png \
--threshold 0.8
# JSON output
uv run .claude/skills/adb-screen-detection/scripts/adb-find-element.py \
--method ocr \
--target "Login" \
--json
```
**Output**:
```json
{
"found": true,
"method": "ocr",
"target": "Login",
"coordinates": {
"x": 100,
"y": 200,
"width": 150,
"height": 30
},
"confidence": 0.95,
"message": "Element found at (100, 200)"
}
```
---
### 4. adb-tap-coordinate.py
Tap device screen at specific coordinates.
```bash
# Tap at coordinates
uv run .claude/skills/adb-screen-detection/scripts/adb-tap-coordinate.py \
--x 100 \
--y 200 \
--device 127.0.0.1:5555
# Tap with verification (check screen after tap)
uv run .claude/skills/adb-screen-detection/scripts/adb-tap-coordinate.py \
--x 100 \
--y 200 \
--verify-text "Next Screen" \
--timeout 5
# JSON output
uv run .claude/skills/adb-screen-detection/scripts/adb-tap-coordinate.py \
--x 100 \
--y 200 \
--json
```
**Output**:
```json
{
"device": "127.0.0.1:5555",
"tap": {
"x": 100,
"y": 200
},
"success": true,
"verified": true,
"verify_text": "Next Screen",
"verification_match": true
}
```
---
## Usage Patterns
### Pattern 1: Verify Screen State Before Action
```bash
# 1. Capture current screen
adb-screen-capture.py
# 2. Check for expected element
adb-find-element.py --method ocr --target "Login Button"
# 3. If found, tap it
adb-tap-coordinate.py --x 100 --y 200 --verify-text "Welcome"
```
### Pattern 2: OCR-Based Automation
```bash
# 1. Capture screen
adb-screen-capture.py
# 2. Extract all text
adb-ocr-extract.py --search "Settings"
# 3. Get coordinates and tap
adb-find-element.py --method ocr --target "Settings"
adb-tap-coordinate.py --x 150 --y 300
```
### Pattern 3: Template-Based Element Detection
```bash
# 1. Have known UI template images in ./templates/
# 2. Capture screen
adb-screen-capture.py
# 3. Match against templates
adb-find-element.py --method template --template ./templates/button.png
# 4. Tap matched location
adb-tap-coordinate.py --x $(jq -r '.coordinates.x') --y $(jq -r '.coordinates.y')
```
---
## Architecture
**Design Principles**:
- **Independent**: Each script can run standalone
- **Chainable**: Scripts output JSON for piping
- **Stateless**: No dependencies between executions
- **Verifiable**: Always verify screen state before proceeding
- **Timeout Protected**: All network operations have timeouts
**Dependency Relationship**:
```
adb-screen-capture.py (foundation)
↓
adb-ocr-extract.py (uses capture)
adb-find-element.py (uses capture or templates)
↓
adb-tap-coordinate.py (uses find-element for verification)
```
---
## Integration Points
**Used By**:
- `adb-navigation-base` - Wait for elements between actions
- `adb-magisk` - Verify Magisk UI state
- `adb-karrot` - Verify app state during automation
- `adb-workflow-orchestrator` - Screen verification in workflows
**Dependencies**:
- System: `adb` command-line tool
- Python: pytesseract, opencv-python, pillow, numpy
---
## Troubleshooting
### OCR Not Working
- Install Tesseract: `brew install tesseract` (macOS) or `apt-get install tesseract-ocr` (Linux)
- Set TESSDATA_PREFIX: `export TESSDATA_PREFIX=/usr/local/share/tessdata`
### Template Matching Too Strict/Loose
- Adjust `--threshold` parameter (0.0-1.0)
- Higher threshold = stricter matching
- Recommended: 0.8-0.9 for reliable detection
### Device Offline
- Check ADB connection: `adb devices`
- Reconnect: `adb connect <device>`
- Restart ADB: `adb kill-server && adb start-server`
---
## Workflows
This skill includes TOON-based workflow definitions for automation.
### What is TOON?
TOON (Task-Oriented Orchestration Notation) is a structured workflow definition language that pairs with Markdown documentation. Each workflow consists of:
- **[name].toon** - Orchestration logic and execution steps
- **[name].md** - Complete documentation and usage guide
This TOON+MD pairing approach is inspired by the BMAD METHOD pattern, adapted to use TOON instead of YAML for better orchestration support.
### Available Workflows
Workflow files are located in `workflow/` directory:
**Example Workflows (adb-screen-detection):**
- `workflow/screen-verification.toon` - Capture and verify screen state
- `workflow/element-detection.toon` - Find elements via OCR or template matching
- `workflow/screen-monitoring.toon` - Continuous screen monitoring and analysis
### Running a Workflow
Execute any workflow using the ADB workflow orchestrator:
```bash
uv run .claude/skills/adb-workflow-orchestrator/scripts/adb-run-workflow.py \
--workflow .claude/skills/adb-screen-detection/workflow/screen-verification.toon \
--param device="127.0.0.1:5555"
```
### Workflow Documentation
Each workflow includes comprehensive documentation in the corresponding `.md` file:
- Purpose and use case
- Prerequisites and requirements
- Available parameters
- Execution phases and steps
- Success criteria
- Error handling and recovery
- Example commands
See the `workflow/` directory for complete TOON file definitions and documentation.
### Creating New Workflows
To create custom workflows for this skill:
1. Create a new `.toon` file in the `workflow/` directory
2. Define phases, steps, and parameters using TOON v4.0 syntax
3. Create corresponding `.md` file with comprehensive documentation
4. Test with the workflow orchestrator
For more information, refer to the TOON specification and the workflow orchestrator documentation.
---
**Version**: 1.0.0
**Status**: ✅ Foundation Tier
**Scripts**: 4 (all MCP-ready)
**Last Updated**: 2025-12-01
**Tier**: 2 (Foundation)Related Skills
a-share-screener
Screen and filter A-share stocks based on fundamental metrics, technical indicators, capital flow, and custom criteria. Support multiple screening strategies including value investing, growth investing, momentum trading, and dividend hunting. Use when user wants to find stocks meeting specific criteria like "低PE高ROE股票", "北向资金加仓股", "突破年线的股票".
nested-TAD-detection
This skill detects hierarchical (nested) TAD structures from Hi-C contact maps (in .cool or mcool format) using OnTAD, starting from multi-resolution .mcool files. It extracts a user-specified chromosome and resolution, converts the data to a dense matrix, runs OnTAD, and organizes TAD calls and logs for downstream 3D genome analysis.
UMR-LMR-PMD-detection
This pipeline performs genome-wide segmentation of CpG methylation profiles to identify Unmethylated Regions (UMRs), Low-Methylated Regions (LMRs), and Partially Methylated Domains (PMDs) using whole-genome bisulfite sequencing (WGBS) methylation calls. The pipeline provides high-resolution enhancer-like LMRs, promoter-associated UMRs, and large-scale PMDs characteristic of reprogramming, aging, or cancer methylomes, enabling integration with chromatin accessibility, TF binding, and genome architecture analyses.
ontopo
An AI agent skill to search for Israeli restaurants, check table availability, view menus, and retrieve booking links via the Ontopo platform, acting as an unofficial interface to its data.
tech-blog
Generates comprehensive technical blog posts, offering detailed explanations of system internals, architecture, and implementation, either through source code analysis or document-driven research.
lets-go-rss
A lightweight, full-platform RSS subscription manager that aggregates content from YouTube, Vimeo, Behance, Twitter/X, and Chinese platforms like Bilibili, Weibo, and Douyin, featuring deduplication and AI smart classification.
vly-money
Generate crypto payment links for supported tokens and networks, manage access to X402 payment-protected content, and provide direct access to the vly.money wallet interface.
ux
This AI agent skill provides comprehensive guidance for creating professional and insightful User Experience (UX) designs, covering user research, information architecture, interaction design, visual guidance, and usability evaluation. It aims to produce actionable, user-centered solutions that avoid generic AI aesthetics.
thor-skills
An entry point and router for AI agents to manage various THOR-related cybersecurity tasks, including running scans, analyzing logs, troubleshooting, and maintenance.
grail-miner
This skill assists in setting up, managing, and optimizing Grail miners on Bittensor Subnet 81, handling tasks like environment configuration, R2 storage, model checkpoint management, and performance tuning.
modal-deployment
Run Python code in the cloud with serverless containers, GPUs, and autoscaling using Modal. This skill enables agents to generate code for deploying ML models, running batch jobs, serving APIs, and scaling compute-intensive workloads.
astro
This skill provides essential Astro framework patterns, focusing on server-side rendering (SSR), static site generation (SSG), middleware, and TypeScript best practices. It helps AI agents implement secure authentication, manage API routes, and debug rendering behaviors within Astro projects.