homoglyph-detector

Byte-level Unicode homoglyph detection for identifying invisible character substitutions in code

509 stars

Best use case

homoglyph-detector is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Byte-level Unicode homoglyph detection for identifying invisible character substitutions in code

Teams using homoglyph-detector should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/homoglyph-detector/SKILL.md --create-dirs "https://raw.githubusercontent.com/a5c-ai/babysitter/main/library/specializations/security-compliance/skills/homoglyph-detector/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/homoglyph-detector/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How homoglyph-detector Compares

Feature / Agent	homoglyph-detector	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Byte-level Unicode homoglyph detection for identifying invisible character substitutions in code

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Homoglyph Detector

Byte-level forensic analysis of code changes to detect Unicode homoglyph substitutions — characters that look identical to ASCII in every editor and diff tool but have different codepoints, silently breaking string comparisons, dictionary lookups, and identifier resolution.

## Purpose

Homoglyph attacks (related to CVE-2021-42574 "Trojan Source") are the highest-stealth trojan technique. A Cyrillic `р` (U+0440) looks identical to a Latin `p` (U+0070) in every font, editor, and diff viewer. The only way to detect it is byte-level analysis via `hexdump`.

This skill pipes git diffs through `hexdump -C` and scans for multi-byte UTF-8 sequences where single-byte ASCII is expected, particularly in string literals used as dictionary keys, variable names, and identifiers.

## Capabilities

### Confusable Character Detection
Scans for these high-risk Unicode confusables:

| Latin | Cyrillic | Greek | UTF-8 Bytes |
|-------|----------|-------|-------------|
| a (61) | а (D0 B0) | α (CE B1) | 1 vs 2 bytes |
| c (63) | с (D1 81) | — | 1 vs 2 bytes |
| e (65) | е (D0 B5) | ε (CE B5) | 1 vs 2 bytes |
| o (6F) | о (D0 BE) | ο (CE BF) | 1 vs 2 bytes |
| p (70) | р (D1 80) | ρ (CF 81) | 1 vs 2 bytes |
| x (78) | х (D1 85) | χ (CF 87) | 1 vs 2 bytes |
| y (79) | у (D1 83) | — | 1 vs 2 bytes |

### Zero-Width Character Detection
- U+200B — Zero-width space
- U+200C — Zero-width non-joiner
- U+200D — Zero-width joiner
- U+FEFF — Byte order mark (in non-BOM position)

### Bidi Control Character Detection (Trojan Source)
- U+200F — Right-to-left mark
- U+200E — Left-to-right mark
- U+202A — Left-to-right embedding
- U+202B — Right-to-left embedding
- U+202C — Pop directional formatting
- U+2066 — Left-to-right isolate
- U+2067 — Right-to-left isolate

### Context-Aware Analysis
- Focuses on **string literals** (dictionary keys, config values)
- Focuses on **identifiers** (variable names, function names, class names)
- Ignores legitimate Unicode in comments, docstrings, and i18n strings
- Compares byte patterns between removed (-) and added (+) diff lines

## Input Schema

```json
{
  "type": "object",
  "required": ["projectRoot", "changedFiles"],
  "properties": {
    "projectRoot": {
      "type": "string",
      "description": "Absolute path to the git repository"
    },
    "changedFiles": {
      "type": "array",
      "items": { "type": "string" },
      "description": "List of changed file paths to scan"
    },
    "scanMode": {
      "type": "string",
      "enum": ["uncommitted", "commit-range", "branch-diff"],
      "default": "uncommitted"
    },
    "baseRef": { "type": "string" },
    "headRef": { "type": "string" }
  }
}
```

## Output Schema

```json
{
  "type": "object",
  "required": ["filesScanned", "homoglyphsFound", "verdict"],
  "properties": {
    "filesScanned": { "type": "number" },
    "homoglyphsFound": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "file": { "type": "string" },
          "line": { "type": "number" },
          "byteOffset": { "type": "string" },
          "context": { "type": "string" },
          "expectedAscii": { "type": "string" },
          "actualBytes": { "type": "string" },
          "unicodeCodepoint": { "type": "string" },
          "scriptName": { "type": "string" },
          "impact": { "type": "string" }
        }
      }
    },
    "bidiControlChars": { "type": "array" },
    "verdict": {
      "type": "string",
      "enum": ["CLEAN", "HOMOGLYPH_DETECTED"]
    }
  }
}
```

## Detection Method

```bash
# Step 1: Pipe git diff through hexdump
git diff <file> | hexdump -C

# Step 2: In added (+) lines, look for multi-byte sequences
# where the removed (-) line had single-byte ASCII
#
# Example — Latin 'p' vs Cyrillic 'р':
# Removed: 22 70 70 67 22   |  "ppg"  |   ← 70 = Latin 'p'
# Added:   22 d1 80 70 67   |  "..pg" |   ← d1 80 = Cyrillic 'р'
#
# The d1 80 bytes where 70 should be = HOMOGLYPH DETECTED
```

## Usage Example

```javascript
skill: {
  name: 'homoglyph-detector',
  context: {
    projectRoot: '/path/to/project',
    changedFiles: ['backend/app/prediction/temporal.py'],
    scanMode: 'uncommitted'
  }
}
```

## Real-World Example

From adversarial drill #6:
- **Attack**: Dictionary key `"ppg"` changed to `"рpg"` (Cyrillic р + Latin pg)
- **Camouflage**: 4 lines of harmless `round()` wrappers added as decoy
- **Impact**: All `dict.get("ppg")` lookups return default `0`, disabling trend detection
- **Detection**: `hexdump -C` revealed bytes `d1 80` where `70` was expected

## Process Files

- `nation-state-trojan-detection.js` — Phase 2: Homoglyph Detection (parallel with semantic analysis)

Related Skills

geant4-detector-simulator

509

from a5c-ai/babysitter

Geant4 detector simulation skill for particle transport, detector geometry, and physics process modeling

structural-variant-detector

509

from a5c-ai/babysitter

Structural variant detection skill for identifying CNVs, inversions, translocations, and complex rearrangements

fusion-gene-detector

509

from a5c-ai/babysitter

Gene fusion detection skill for oncology applications with multiple caller integration

memory-leak-detector

509

from a5c-ai/babysitter

Detect memory leaks in desktop applications through heap analysis and object tracking

fairlearn-bias-detector

509

from a5c-ai/babysitter

Fairness assessment skill using Fairlearn for bias detection, mitigation, and compliance reporting.

evidently-drift-detector

509

from a5c-ai/babysitter

Evidently AI skill for data drift detection, model performance monitoring, target drift analysis, and automated reporting for ML systems in production.

code-smell-detector

509

from a5c-ai/babysitter

Automated detection of code smells and anti-patterns to identify refactoring opportunities

terminal-capability-detector

509

from a5c-ai/babysitter

Detect terminal capabilities including color support, TTY status, size, and Unicode support for adaptive CLI output.

prompt-injection-detector

509

from a5c-ai/babysitter

Prompt injection detection and prevention for secure LLM applications

process-builder

509

from a5c-ai/babysitter

Scaffold new babysitter process definitions following SDK patterns, proper structure, and best practices. Guides the 3-phase workflow from research to implementation.

Workflow & Productivity

babysitter

509

from a5c-ai/babysitter

Orchestrate via @babysitter. Use this skill when asked to babysit a run, orchestrate a process or whenever it is called explicitly. (babysit, babysitter, orchestrate, orchestrate a run, workflow, etc.)

yolo

509

from a5c-ai/babysitter

Run Babysitter autonomously with minimal manual interruption.