homoglyph-detector
Byte-level Unicode homoglyph detection for identifying invisible character substitutions in code
Best use case
homoglyph-detector is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Byte-level Unicode homoglyph detection for identifying invisible character substitutions in code
Teams using homoglyph-detector should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/homoglyph-detector/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How homoglyph-detector Compares
| Feature / Agent | homoglyph-detector | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Byte-level Unicode homoglyph detection for identifying invisible character substitutions in code
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Homoglyph Detector
Byte-level forensic analysis of code changes to detect Unicode homoglyph substitutions — characters that look identical to ASCII in every editor and diff tool but have different codepoints, silently breaking string comparisons, dictionary lookups, and identifier resolution.
## Purpose
Homoglyph attacks (related to CVE-2021-42574 "Trojan Source") are the highest-stealth trojan technique. A Cyrillic `р` (U+0440) looks identical to a Latin `p` (U+0070) in every font, editor, and diff viewer. The only way to detect it is byte-level analysis via `hexdump`.
This skill pipes git diffs through `hexdump -C` and scans for multi-byte UTF-8 sequences where single-byte ASCII is expected, particularly in string literals used as dictionary keys, variable names, and identifiers.
## Capabilities
### Confusable Character Detection
Scans for these high-risk Unicode confusables:
| Latin | Cyrillic | Greek | UTF-8 Bytes |
|-------|----------|-------|-------------|
| a (61) | а (D0 B0) | α (CE B1) | 1 vs 2 bytes |
| c (63) | с (D1 81) | — | 1 vs 2 bytes |
| e (65) | е (D0 B5) | ε (CE B5) | 1 vs 2 bytes |
| o (6F) | о (D0 BE) | ο (CE BF) | 1 vs 2 bytes |
| p (70) | р (D1 80) | ρ (CF 81) | 1 vs 2 bytes |
| x (78) | х (D1 85) | χ (CF 87) | 1 vs 2 bytes |
| y (79) | у (D1 83) | — | 1 vs 2 bytes |
### Zero-Width Character Detection
- U+200B — Zero-width space
- U+200C — Zero-width non-joiner
- U+200D — Zero-width joiner
- U+FEFF — Byte order mark (in non-BOM position)
### Bidi Control Character Detection (Trojan Source)
- U+200F — Right-to-left mark
- U+200E — Left-to-right mark
- U+202A — Left-to-right embedding
- U+202B — Right-to-left embedding
- U+202C — Pop directional formatting
- U+2066 — Left-to-right isolate
- U+2067 — Right-to-left isolate
### Context-Aware Analysis
- Focuses on **string literals** (dictionary keys, config values)
- Focuses on **identifiers** (variable names, function names, class names)
- Ignores legitimate Unicode in comments, docstrings, and i18n strings
- Compares byte patterns between removed (-) and added (+) diff lines
## Input Schema
```json
{
"type": "object",
"required": ["projectRoot", "changedFiles"],
"properties": {
"projectRoot": {
"type": "string",
"description": "Absolute path to the git repository"
},
"changedFiles": {
"type": "array",
"items": { "type": "string" },
"description": "List of changed file paths to scan"
},
"scanMode": {
"type": "string",
"enum": ["uncommitted", "commit-range", "branch-diff"],
"default": "uncommitted"
},
"baseRef": { "type": "string" },
"headRef": { "type": "string" }
}
}
```
## Output Schema
```json
{
"type": "object",
"required": ["filesScanned", "homoglyphsFound", "verdict"],
"properties": {
"filesScanned": { "type": "number" },
"homoglyphsFound": {
"type": "array",
"items": {
"type": "object",
"properties": {
"file": { "type": "string" },
"line": { "type": "number" },
"byteOffset": { "type": "string" },
"context": { "type": "string" },
"expectedAscii": { "type": "string" },
"actualBytes": { "type": "string" },
"unicodeCodepoint": { "type": "string" },
"scriptName": { "type": "string" },
"impact": { "type": "string" }
}
}
},
"bidiControlChars": { "type": "array" },
"verdict": {
"type": "string",
"enum": ["CLEAN", "HOMOGLYPH_DETECTED"]
}
}
}
```
## Detection Method
```bash
# Step 1: Pipe git diff through hexdump
git diff <file> | hexdump -C
# Step 2: In added (+) lines, look for multi-byte sequences
# where the removed (-) line had single-byte ASCII
#
# Example — Latin 'p' vs Cyrillic 'р':
# Removed: 22 70 70 67 22 | "ppg" | ← 70 = Latin 'p'
# Added: 22 d1 80 70 67 | "..pg" | ← d1 80 = Cyrillic 'р'
#
# The d1 80 bytes where 70 should be = HOMOGLYPH DETECTED
```
## Usage Example
```javascript
skill: {
name: 'homoglyph-detector',
context: {
projectRoot: '/path/to/project',
changedFiles: ['backend/app/prediction/temporal.py'],
scanMode: 'uncommitted'
}
}
```
## Real-World Example
From adversarial drill #6:
- **Attack**: Dictionary key `"ppg"` changed to `"рpg"` (Cyrillic р + Latin pg)
- **Camouflage**: 4 lines of harmless `round()` wrappers added as decoy
- **Impact**: All `dict.get("ppg")` lookups return default `0`, disabling trend detection
- **Detection**: `hexdump -C` revealed bytes `d1 80` where `70` was expected
## Process Files
- `nation-state-trojan-detection.js` — Phase 2: Homoglyph Detection (parallel with semantic analysis)Related Skills
geant4-detector-simulator
Geant4 detector simulation skill for particle transport, detector geometry, and physics process modeling
structural-variant-detector
Structural variant detection skill for identifying CNVs, inversions, translocations, and complex rearrangements
fusion-gene-detector
Gene fusion detection skill for oncology applications with multiple caller integration
memory-leak-detector
Detect memory leaks in desktop applications through heap analysis and object tracking
fairlearn-bias-detector
Fairness assessment skill using Fairlearn for bias detection, mitigation, and compliance reporting.
evidently-drift-detector
Evidently AI skill for data drift detection, model performance monitoring, target drift analysis, and automated reporting for ML systems in production.
code-smell-detector
Automated detection of code smells and anti-patterns to identify refactoring opportunities
terminal-capability-detector
Detect terminal capabilities including color support, TTY status, size, and Unicode support for adaptive CLI output.
prompt-injection-detector
Prompt injection detection and prevention for secure LLM applications
process-builder
Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.
babysitter
Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)
yolo
Run Babysitter autonomously with minimal manual interruption.