axiom-vision-ref

Use when needing Vision framework API details for hand/body pose, segmentation, text recognition, barcode detection, document scanning, or Visual Intelligence integration. Covers VNRequest types, coordinate conversion, DataScannerViewController, RecognizeDocumentsRequest, SemanticContentDescriptor, IntentValueQuery.

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

axiom-vision-ref is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using axiom-vision-ref should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/axiom-vision-ref/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/CharlesWiltgen/Axiom/axiom-vision-ref/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/axiom-vision-ref/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How axiom-vision-ref Compares

Feature / Agent	axiom-vision-ref	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Vision Framework API Reference

Comprehensive reference for Vision framework computer vision: subject segmentation, hand/body pose detection, person detection, face analysis, text recognition (OCR), barcode detection, and document scanning.

## When to Use This Reference

- **Implementing subject lifting** using VisionKit or Vision
- **Detecting hand/body poses** for gesture recognition or fitness apps
- **Segmenting people** from backgrounds or separating multiple individuals
- **Face detection and landmarks** for AR effects or authentication
- **Combining Vision APIs** to solve complex computer vision problems
- **Looking up specific API signatures** and parameter meanings
- **Recognizing text** in images (OCR) with VNRecognizeTextRequest
- **Detecting barcodes** and QR codes with VNDetectBarcodesRequest
- **Building live scanners** with DataScannerViewController
- **Scanning documents** with VNDocumentCameraViewController
- **Extracting structured document data** with RecognizeDocumentsRequest (iOS 26+)

**Related skills**: See `axiom-vision` for decision trees and patterns, `axiom-vision-diag` for troubleshooting

## Vision Framework Overview

Vision provides computer vision algorithms for still images and video:

**Core workflow**:
1. Create request (e.g., `VNDetectHumanHandPoseRequest()`)
2. Create handler with image (`VNImageRequestHandler(cgImage: image)`)
3. Perform request (`try handler.perform([request])`)
4. Access observations from `request.results`

**Coordinate system**: Lower-left origin, normalized (0.0-1.0) coordinates

**Performance**: Run on background queue - resource intensive, blocks UI if on main thread

## Request Handlers

Vision provides two request handlers for different scenarios.

### VNImageRequestHandler

Analyzes a **single image**. Initialize with the image, perform requests against it, discard.

```swift
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request1, request2])  // Multiple requests, one image
```

**Initialize with**: `CGImage`, `CIImage`, `CVPixelBuffer`, `Data`, or `URL`

**Rule**: One handler per image. Reusing a handler with a different image is unsupported.

### VNSequenceRequestHandler

Analyzes a **sequence of frames** (video, camera feed). Initialize empty, pass each frame to `perform()`. Maintains inter-frame state for temporal smoothing.

```swift
let sequenceHandler = VNSequenceRequestHandler()

// In your camera/video frame callback:
func processFrame(_ pixelBuffer: CVPixelBuffer) throws {
    try sequenceHandler.perform([request], on: pixelBuffer)
}
```

**Rule**: Create once, reuse across frames. The handler tracks state between calls.

### When to Use Which

| Use Case | Handler |
|----------|---------|
| Single photo or screenshot | `VNImageRequestHandler` |
| Video stream or camera frames | `VNSequenceRequestHandler` |
| Temporal smoothing (pose, segmentation) | `VNSequenceRequestHandler` |
| One-off analysis of a CVPixelBuffer | `VNImageRequestHandler` |

### Requests That Benefit from Sequence Handling

These requests use inter-frame state when run through `VNSequenceRequestHandler`:
- `VNDetectHumanBodyPoseRequest` — Smoother joint tracking
- `VNDetectHumanHandPoseRequest` — Smoother landmark tracking
- `VNGeneratePersonSegmentationRequest` — Temporally consistent masks
- `VNGeneratePersonInstanceMaskRequest` — Stable person identity across frames
- `VNDetectDocumentSegmentationRequest` — Stable document edges
- Any `VNStatefulRequest` subclass — Designed for sequences

### Common Mistake

Creating a new `VNImageRequestHandler` per video frame discards temporal context. Pose landmarks jitter, segmentation masks flicker, and you lose the smoothing that sequence handling provides.

```swift
// Wrong — loses temporal context every frame
func processFrame(_ buffer: CVPixelBuffer) throws {
    let handler = VNImageRequestHandler(cvPixelBuffer: buffer)
    try handler.perform([poseRequest])
}

// Right — maintains inter-frame state
let sequenceHandler = VNSequenceRequestHandler()
func processFrame(_ buffer: CVPixelBuffer) throws {
    try sequenceHandler.perform([poseRequest], on: buffer)
}
```

## Subject Segmentation APIs

### VNGenerateForegroundInstanceMaskRequest

**Availability**: iOS 17+, macOS 14+, tvOS 17+, visionOS 1+

Generates class-agnostic instance mask of foreground objects (people, pets, buildings, food, shoes, etc.)

#### Basic Usage

```swift
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)

try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}
```

#### InstanceMaskObservation

**allInstances**: `IndexSet` containing all foreground instance indices (excludes background 0)

**instanceMask**: `CVPixelBuffer` with UInt8 labels (0 = background, 1+ = instance indices)

**instanceAtPoint(_:)**: Returns instance index at normalized point

```swift
let point = CGPoint(x: 0.5, y: 0.5)  // Center of image
let instance = observation.instanceAtPoint(point)

if instance == 0 {
    print("Background tapped")
} else {
    print("Instance \(instance) tapped")
}
```

#### Generating Masks

**createScaledMask(for:croppedToInstancesContent:)**

Parameters:
- `for`: `IndexSet` of instances to include
- `croppedToInstancesContent`:
  - `false` = Output matches input resolution (for compositing)
  - `true` = Tight crop around selected instances

Returns: Single-channel floating-point `CVPixelBuffer` (soft segmentation mask)

```swift
// All instances, full resolution
let mask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false
)

// Single instance, cropped
let instances = IndexSet(integer: 1)
let croppedMask = try observation.createScaledMask(
    for: instances,
    croppedToInstancesContent: true
)
```

#### Instance Mask Hit Testing

Access raw pixel buffer to map tap coordinates to instance labels:

```swift
let instanceMask = observation.instanceMask

CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }

let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let width = CVPixelBufferGetWidth(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)

// Convert normalized tap to pixel coordinates
let pixelPoint = VNImagePointForNormalizedPoint(
    CGPoint(x: normalizedX, y: normalizedY),
    width: imageWidth,
    height: imageHeight
)

// Calculate byte offset
let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)

// Read instance label
let label = UnsafeRawPointer(baseAddress!).load(
    fromByteOffset: offset,
    as: UInt8.self
)

let instances = label == 0 ? observation.allInstances : IndexSet(integer: Int(label))
```

## VisionKit Subject Lifting

### ImageAnalysisInteraction (iOS)

**Availability**: iOS 16+, iPadOS 16+

Adds system-like subject lifting UI to views:

```swift
let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject  // Or .automatic
imageView.addInteraction(interaction)
```

**Interaction types**:
- `.automatic`: Subject lifting + Live Text + data detectors
- `.imageSubject`: Subject lifting only (no interactive text)

### ImageAnalysisOverlayView (macOS)

**Availability**: macOS 13+

```swift
let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)
```

### Programmatic Access

#### ImageAnalyzer

```swift
let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])

let analysis = try await analyzer.analyze(image, configuration: configuration)
```

#### ImageAnalysis

**subjects**: `[Subject]` - All subjects in image

**highlightedSubjects**: `Set<Subject>` - Currently highlighted (user long-pressed)

**subject(at:)**: Async lookup of subject at normalized point (returns `nil` if none)

```swift
// Get all subjects
let subjects = analysis.subjects

// Look up subject at tap
if let subject = try await analysis.subject(at: tapPoint) {
    // Process subject
}

// Change highlight state
analysis.highlightedSubjects = Set([subjects[0], subjects[1]])
```

#### Subject Struct

**image**: `UIImage`/`NSImage` - Extracted subject with transparency

**bounds**: `CGRect` - Subject boundaries in image coordinates

```swift
// Single subject image
let subjectImage = subject.image

// Composite multiple subjects
let compositeImage = try await analysis.image(for: [subject1, subject2])
```

**Out-of-process**: VisionKit analysis happens out-of-process (performance benefit, image size limited)

## Person Segmentation APIs

### VNGeneratePersonSegmentationRequest

**Availability**: iOS 15+, macOS 12+

Returns single mask containing **all people** in image:

```swift
let request = VNGeneratePersonSegmentationRequest()
// Configure quality level if needed
try handler.perform([request])

guard let observation = request.results?.first as? VNPixelBufferObservation else {
    return
}

let personMask = observation.pixelBuffer  // CVPixelBuffer
```

### VNGeneratePersonInstanceMaskRequest

**Availability**: iOS 17+, macOS 14+

Returns **separate masks for up to 4 people**:

```swift
let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

// Same InstanceMaskObservation API as foreground instance masks
let allPeople = observation.allInstances  // Up to 4 people (1-4)

// Get mask for person 1
let person1Mask = try observation.createScaledMask(
    for: IndexSet(integer: 1),
    croppedToInstancesContent: false
)
```

**Limitations**:
- Segments up to 4 people
- With >4 people: may miss people or combine them (typically background people)
- Use `VNDetectFaceRectanglesRequest` to count faces if you need to handle crowded scenes

## Hand Pose Detection

### VNDetectHumanHandPoseRequest

**Availability**: iOS 14+, macOS 11+

Detects **21 hand landmarks** per hand:

```swift
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 2  // Default: 2, increase if needed

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

for observation in request.results as? [VNHumanHandPoseObservation] ?? [] {
    // Process each hand
}
```

**Performance note**: `maximumHandCount` affects latency. Pose computed only for hands ≤ maximum. Set to lowest acceptable value.

### Hand Landmarks (21 points)

**Wrist**: 1 landmark

**Thumb** (4 landmarks):
- `.thumbTip`
- `.thumbIP` (interphalangeal joint)
- `.thumbMP` (metacarpophalangeal joint)
- `.thumbCMC` (carpometacarpal joint)

**Fingers** (4 landmarks each):
- Tip (`.indexTip`, `.middleTip`, `.ringTip`, `.littleTip`)
- DIP (distal interphalangeal joint)
- PIP (proximal interphalangeal joint)
- MCP (metacarpophalangeal joint)

### Group Keys

Access landmark groups:

| Group Key | Points |
|-----------|--------|
| `.all` | All 21 landmarks |
| `.thumb` | 4 thumb joints |
| `.indexFinger` | 4 index finger joints |
| `.middleFinger` | 4 middle finger joints |
| `.ringFinger` | 4 ring finger joints |
| `.littleFinger` | 4 little finger joints |

```swift
// Get all points
let allPoints = try observation.recognizedPoints(.all)

// Get index finger points only
let indexPoints = try observation.recognizedPoints(.indexFinger)

// Get specific point
let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

// Check confidence
guard thumbTip.confidence > 0.5 else { return }

// Access location (normalized coordinates, lower-left origin)
let location = thumbTip.location  // CGPoint
```

### Gesture Recognition Example (Pinch)

```swift
let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else {
    return
}

let distance = hypot(
    thumbTip.location.x - indexTip.location.x,
    thumbTip.location.y - indexTip.location.y
)

let isPinching = distance < 0.05  // Normalized threshold
```

### Chirality (Handedness)

```swift
let chirality = observation.chirality  // .left or .right or .unknown
```

## Body Pose Detection

### VNDetectHumanBodyPoseRequest (2D)

**Availability**: iOS 14+, macOS 11+

Detects **18 body landmarks** (2D normalized coordinates):

```swift
let request = VNDetectHumanBodyPoseRequest()
try handler.perform([request])

for observation in request.results as? [VNHumanBodyPoseObservation] ?? [] {
    // Process each person
}
```

### Body Landmarks (18 points)

**Face** (5 landmarks):
- `.nose`, `.leftEye`, `.rightEye`, `.leftEar`, `.rightEar`

**Arms** (6 landmarks):
- Left: `.leftShoulder`, `.leftElbow`, `.leftWrist`
- Right: `.rightShoulder`, `.rightElbow`, `.rightWrist`

**Torso** (7 landmarks):
- `.neck` (between shoulders)
- `.leftShoulder`, `.rightShoulder` (also in arm groups)
- `.leftHip`, `.rightHip`
- `.root` (between hips)

**Legs** (6 landmarks):
- Left: `.leftHip`, `.leftKnee`, `.leftAnkle`
- Right: `.rightHip`, `.rightKnee`, `.rightAnkle`

**Note**: Shoulders and hips appear in multiple groups

### Group Keys (Body)

| Group Key | Points |
|-----------|--------|
| `.all` | All 18 landmarks |
| `.face` | 5 face landmarks |
| `.leftArm` | shoulder, elbow, wrist |
| `.rightArm` | shoulder, elbow, wrist |
| `.torso` | neck, shoulders, hips, root |
| `.leftLeg` | hip, knee, ankle |
| `.rightLeg` | hip, knee, ankle |

```swift
// Get all body points
let allPoints = try observation.recognizedPoints(.all)

// Get left arm only
let leftArmPoints = try observation.recognizedPoints(.leftArm)

// Get specific joint
let leftWrist = try observation.recognizedPoint(.leftWrist)
```

### VNDetectHumanBodyPose3DRequest (3D)

**Availability**: iOS 17+, macOS 14+

Returns **3D skeleton with 17 joints** in meters (real-world coordinates):

```swift
let request = VNDetectHumanBodyPose3DRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNHumanBodyPose3DObservation else {
    return
}

// Get 3D joint position
let leftWrist = try observation.recognizedPoint(.leftWrist)
let position = leftWrist.position  // simd_float4x4 matrix
let localPosition = leftWrist.localPosition  // Relative to parent joint
```

**3D Body Landmarks** (17 points): Same as 2D except no ears (15 vs 18 2D landmarks)

#### 3D Observation Properties

**bodyHeight**: Estimated height in meters
- With depth data: Measured height
- Without depth data: Reference height (1.8m)

**heightEstimation**: `.measured` or `.reference`

**cameraOriginMatrix**: `simd_float4x4` camera position/orientation relative to subject

**pointInImage(\_:)**: Project 3D joint back to 2D image coordinates

```swift
let wrist2D = try observation.pointInImage(leftWrist)
```

#### 3D Point Classes

**VNPoint3D**: Base class with `simd_float4x4` position matrix

**VNRecognizedPoint3D**: Adds identifier (joint name)

**VNHumanBodyRecognizedPoint3D**: Adds `localPosition` and `parentJoint`

```swift
// Position relative to skeleton root (center of hip)
let modelPosition = leftWrist.position

// Position relative to parent joint (left elbow)
let relativePosition = leftWrist.localPosition
```

#### Depth Input

Vision accepts depth data alongside images:

```swift
// From AVDepthData
let handler = VNImageRequestHandler(
    cvPixelBuffer: imageBuffer,
    depthData: depthData,
    orientation: orientation
)

// From file (automatic depth extraction)
let handler = VNImageRequestHandler(url: imageURL)  // Depth auto-fetched
```

**Depth formats**: Disparity or Depth (interchangeable via AVFoundation)

**LiDAR**: Use in live capture sessions for accurate scale/measurement

## Face Detection & Landmarks

### VNDetectFaceRectanglesRequest

**Availability**: iOS 11+

Detects face bounding boxes:

```swift
let request = VNDetectFaceRectanglesRequest()
try handler.perform([request])

for observation in request.results as? [VNFaceObservation] ?? [] {
    let faceBounds = observation.boundingBox  // Normalized rect
}
```

### VNDetectFaceLandmarksRequest

**Availability**: iOS 11+

Detects face with detailed landmarks:

```swift
let request = VNDetectFaceLandmarksRequest()
try handler.perform([request])

for observation in request.results as? [VNFaceObservation] ?? [] {
    if let landmarks = observation.landmarks {
        let leftEye = landmarks.leftEye
        let nose = landmarks.nose
        let leftPupil = landmarks.leftPupil  // Revision 2+
    }
}
```

**Revisions**:
- Revision 1: Basic landmarks
- Revision 2: Detects upside-down faces
- Revision 3+: Pupil locations

## Person Detection

### VNDetectHumanRectanglesRequest

**Availability**: iOS 13+

Detects human bounding boxes (torso detection):

```swift
let request = VNDetectHumanRectanglesRequest()
try handler.perform([request])

for observation in request.results as? [VNHumanObservation] ?? [] {
    let humanBounds = observation.boundingBox  // Normalized rect
}
```

**Use case**: Faster than pose detection when you only need location

## CoreImage Integration

### CIBlendWithMask Filter

Composite subject on new background using Vision mask:

```swift
// 1. Get mask from Vision
let observation = request.results?.first as? VNInstanceMaskObservation
let visionMask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false
)

// 2. Convert to CIImage
let maskImage = CIImage(cvPixelBuffer: visionMask)

// 3. Apply filter
let filter = CIFilter(name: "CIBlendWithMask")!
filter.setValue(sourceImage, forKey: kCIInputImageKey)
filter.setValue(maskImage, forKey: kCIInputMaskImageKey)
filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)

let output = filter.outputImage  // Composited result
```

**Parameters**:
- **Input image**: Original image to mask
- **Mask image**: Vision's soft segmentation mask
- **Background image**: New background (or empty image for transparency)

**HDR preservation**: CoreImage preserves high dynamic range from input (Vision/VisionKit output is SDR)

## Text Recognition APIs

### VNRecognizeTextRequest

**Availability**: iOS 13+, macOS 10.15+

Recognizes text in images with configurable accuracy/speed trade-off.

#### Basic Usage

```swift
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate  // Or .fast
request.recognitionLanguages = ["en-US", "de-DE"]  // Order matters
request.usesLanguageCorrection = true

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

for observation in request.results as? [VNRecognizedTextObservation] ?? [] {
    // Get top candidates
    let candidates = observation.topCandidates(3)
    let bestText = candidates.first?.string ?? ""
}
```

#### Recognition Levels

| Level | Performance | Accuracy | Best For |
|-------|-------------|----------|----------|
| `.fast` | Real-time | Good | Camera feed, large text, signs |
| `.accurate` | Slower | Excellent | Documents, receipts, handwriting |

**Fast path**: Character-by-character recognition (Neural Network → Character Detection)

**Accurate path**: Full-line ML recognition (Neural Network → Line/Word Recognition)

#### Properties

| Property | Type | Description |
|----------|------|-------------|
| `recognitionLevel` | `VNRequestTextRecognitionLevel` | `.fast` or `.accurate` |
| `recognitionLanguages` | `[String]` | BCP 47 language codes, order = priority |
| `usesLanguageCorrection` | `Bool` | Use language model for correction |
| `customWords` | `[String]` | Domain-specific vocabulary |
| `automaticallyDetectsLanguage` | `Bool` | Auto-detect language (iOS 16+) |
| `minimumTextHeight` | `Float` | Min text height as fraction of image (0-1) |
| `revision` | `Int` | API version (affects supported languages) |

#### Language Support

```swift
// Check supported languages for current settings
let languages = try VNRecognizeTextRequest.supportedRecognitionLanguages(
    for: .accurate,
    revision: VNRecognizeTextRequestRevision3
)
```

**Language correction**: Improves accuracy but takes processing time. Disable for codes/serial numbers.

**Custom words**: Add domain-specific vocabulary for better recognition (medical terms, product codes).

#### VNRecognizedTextObservation

**boundingBox**: Normalized rect containing recognized text

**topCandidates(_:)**: Returns `[VNRecognizedText]` ordered by confidence

#### VNRecognizedText

| Property | Type | Description |
|----------|------|-------------|
| `string` | `String` | Recognized text |
| `confidence` | `VNConfidence` | 0.0-1.0 |
| `boundingBox(for:)` | `VNRectangleObservation?` | Box for substring range |

```swift
// Get bounding box for substring
let text = candidate.string
if let range = text.range(of: "invoice") {
    let box = try candidate.boundingBox(for: range)
}
```

## Barcode Detection APIs

### VNDetectBarcodesRequest

**Availability**: iOS 11+, macOS 10.13+

Detects and decodes barcodes and QR codes.

#### Basic Usage

```swift
let request = VNDetectBarcodesRequest()
request.symbologies = [.qr, .ean13, .code128]  // Specific codes

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

for barcode in request.results as? [VNBarcodeObservation] ?? [] {
    let payload = barcode.payloadStringValue
    let type = barcode.symbology
    let bounds = barcode.boundingBox
}
```

#### Symbologies

**1D Barcodes**:
- `.codabar` (iOS 15+)
- `.code39`, `.code39Checksum`, `.code39FullASCII`, `.code39FullASCIIChecksum`
- `.code93`, `.code93i`
- `.code128`
- `.ean8`, `.ean13`
- `.gs1DataBar`, `.gs1DataBarExpanded`, `.gs1DataBarLimited` (iOS 15+)
- `.i2of5`, `.i2of5Checksum`
- `.itf14`
- `.upce`

**2D Codes**:
- `.aztec`
- `.dataMatrix`
- `.microPDF417` (iOS 15+)
- `.microQR` (iOS 15+)
- `.pdf417`
- `.qr`

**Performance**: Specifying fewer symbologies = faster detection

#### Revisions

| Revision | iOS | Features |
|----------|-----|----------|
| 1 | 11+ | Basic detection, one code at a time |
| 2 | 15+ | Codabar, GS1, MicroPDF, MicroQR, better ROI |
| 3 | 16+ | ML-based, multiple codes, better bounding boxes |

#### VNBarcodeObservation

| Property | Type | Description |
|----------|------|-------------|
| `payloadStringValue` | `String?` | Decoded content |
| `symbology` | `VNBarcodeSymbology` | Barcode type |
| `boundingBox` | `CGRect` | Normalized bounds |
| `topLeft/topRight/bottomLeft/bottomRight` | `CGPoint` | Corner points |

## VisionKit Scanner APIs

### DataScannerViewController

**Availability**: iOS 16+

Camera-based live scanner with built-in UI for text and barcodes.

#### Check Availability

```swift
// Hardware support
DataScannerViewController.isSupported

// Runtime availability (camera access, parental controls)
DataScannerViewController.isAvailable
```

#### Configuration

```swift
import VisionKit

let dataTypes: Set<DataScannerViewController.RecognizedDataType> = [
    .barcode(symbologies: [.qr, .ean13]),
    .text(textContentType: .URL),  // Or nil for all text
    // .text(languages: ["ja"])  // Filter by language
]

let scanner = DataScannerViewController(
    recognizedDataTypes: dataTypes,
    qualityLevel: .balanced,  // .fast, .balanced, .accurate
    recognizesMultipleItems: true,
    isHighFrameRateTrackingEnabled: true,
    isPinchToZoomEnabled: true,
    isGuidanceEnabled: true,
    isHighlightingEnabled: true
)

scanner.delegate = self
present(scanner, animated: true) {
    try? scanner.startScanning()
}
```

#### RecognizedDataType

| Type | Description |
|------|-------------|
| `.barcode(symbologies:)` | Specific barcode types |
| `.text()` | All text |
| `.text(languages:)` | Text filtered by language |
| `.text(textContentType:)` | Text filtered by type (URL, phone, email) |

#### Delegate Protocol

```swift
protocol DataScannerViewControllerDelegate {
    func dataScanner(_ dataScanner: DataScannerViewController,
                     didTapOn item: RecognizedItem)

    func dataScanner(_ dataScanner: DataScannerViewController,
                     didAdd addedItems: [RecognizedItem],
                     allItems: [RecognizedItem])

    func dataScanner(_ dataScanner: DataScannerViewController,
                     didUpdate updatedItems: [RecognizedItem],
                     allItems: [RecognizedItem])

    func dataScanner(_ dataScanner: DataScannerViewController,
                     didRemove removedItems: [RecognizedItem],
                     allItems: [RecognizedItem])

    func dataScanner(_ dataScanner: DataScannerViewController,
                     becameUnavailableWithError error: DataScannerViewController.ScanningUnavailable)
}
```

#### RecognizedItem

```swift
enum RecognizedItem {
    case text(RecognizedItem.Text)
    case barcode(RecognizedItem.Barcode)

    var id: UUID { get }
    var bounds: RecognizedItem.Bounds { get }
}

// Text item
struct Text {
    let transcript: String
}

// Barcode item
struct Barcode {
    let payloadStringValue: String?
    let observation: VNBarcodeObservation
}
```

#### Async Stream

```swift
// Alternative to delegate
for await items in scanner.recognizedItems {
    // Current recognized items
}
```

#### Custom Highlights

```swift
// Add custom views over recognized items
scanner.overlayContainerView.addSubview(customHighlight)

// Capture still photo
let photo = try await scanner.capturePhoto()
```

### VNDocumentCameraViewController

**Availability**: iOS 13+

Document scanning with automatic edge detection, perspective correction, and lighting adjustment.

#### Basic Usage

```swift
import VisionKit

let camera = VNDocumentCameraViewController()
camera.delegate = self
present(camera, animated: true)
```

#### Delegate Protocol

```swift
protocol VNDocumentCameraViewControllerDelegate {
    func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                       didFinishWith scan: VNDocumentCameraScan)

    func documentCameraViewControllerDidCancel(_ controller: VNDocumentCameraViewController)

    func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                       didFailWithError error: Error)
}
```

#### VNDocumentCameraScan

| Property | Type | Description |
|----------|------|-------------|
| `pageCount` | `Int` | Number of scanned pages |
| `imageOfPage(at:)` | `UIImage` | Get page image at index |
| `title` | `String` | User-editable title |

```swift
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                   didFinishWith scan: VNDocumentCameraScan) {
    controller.dismiss(animated: true)

    for i in 0..<scan.pageCount {
        let pageImage = scan.imageOfPage(at: i)
        // Process with VNRecognizeTextRequest
    }
}
```

## Document Analysis APIs

### VNDetectDocumentSegmentationRequest

**Availability**: iOS 15+, macOS 12+

Detects document boundaries for custom camera UIs or post-processing.

```swift
let request = VNDetectDocumentSegmentationRequest()
let handler = VNImageRequestHandler(ciImage: image)
try handler.perform([request])

guard let observation = request.results?.first as? VNRectangleObservation else {
    return  // No document found
}

// Get corner points (normalized)
let corners = [
    observation.topLeft,
    observation.topRight,
    observation.bottomLeft,
    observation.bottomRight
]
```

**vs VNDetectRectanglesRequest**:
- Document: ML-based, trained specifically on documents
- Rectangle: Edge-based, finds any quadrilateral

### RecognizeDocumentsRequest (iOS 26+)

**Availability**: iOS 26+, macOS 26+

Structured document understanding with semantic parsing.

#### Basic Usage

```swift
let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: imageData)

guard let document = observations.first?.document else {
    return
}
```

#### DocumentObservation Hierarchy

```
DocumentObservation
└── document: DocumentObservation.Document
    ├── text: TextObservation
    ├── tables: [Container.Table]
    ├── lists: [Container.List]
    └── barcodes: [Container.Barcode]
```

#### Table Extraction

```swift
for table in document.tables {
    for row in table.rows {
        for cell in row {
            let text = cell.content.text.transcript
            let detectedData = cell.content.text.detectedData
        }
    }
}
```

#### Detected Data Types

```swift
for data in document.text.detectedData {
    switch data.match.details {
    case .emailAddress(let email):
        let address = email.emailAddress
    case .phoneNumber(let phone):
        let number = phone.phoneNumber
    case .link(let url):
        let link = url
    case .address(let address):
        let components = address
    case .date(let date):
        let dateValue = date
    default:
        break
    }
}
```

#### TextObservation Hierarchy

```
TextObservation
├── transcript: String
├── lines: [TextObservation.Line]
├── paragraphs: [TextObservation.Paragraph]
├── words: [TextObservation.Word]
└── detectedData: [DetectedDataObservation]
```

## Visual Intelligence Integration

Visual Intelligence is a **system-level feature** (iOS 26+) that lets users point their camera at real-world objects and find matching content across apps. This is distinct from the Vision framework (VNRequest-based image analysis) covered above. Vision analyzes images within your app; Visual Intelligence lets the system invoke your app when users search with the camera or screenshots.

### How It Works

1. User activates Visual Intelligence camera or takes a screenshot
2. System analyzes what the user is looking at
3. System queries participating apps via `IntentValueQuery`
4. Your app receives a `SemanticContentDescriptor` with labels and/or pixel data
5. Your app searches its content and returns matching `AppEntity` results
6. Results appear in the Visual Intelligence UI with your app's branding

### Required Frameworks

```swift
import VisualIntelligence
import AppIntents
```

### SemanticContentDescriptor

The core object the system provides to describe what the user is looking at.

| Property | Type | Description |
|----------|------|-------------|
| `labels` | `[String]` | Classification labels for the detected item |
| `pixelBuffer` | `CVReadOnlyPixelBuffer?` | Visual data of the detected item |

Use labels for fast keyword matching against your content catalog. Use the pixel buffer for image-similarity search when labels are insufficient.

### IntentValueQuery

The entry point for Visual Intelligence to communicate with your app. Implement `values(for:)` to receive search requests and return matching entities.

```swift
struct LandmarkIntentValueQuery: IntentValueQuery {
    @Dependency var modelData: ModelData

    func values(for input: SemanticContentDescriptor) async throws -> [LandmarkEntity] {
        if !input.labels.isEmpty {
            return try await modelData.search(matching: input.labels)
        }
        guard let pixelBuffer = input.pixelBuffer else { return [] }
        return try await modelData.search(matching: pixelBuffer)
    }
}
```

### Returning Multiple Result Types

Use `@UnionValue` when your app can return different entity types from a single search.

```swift
@UnionValue
enum VisualSearchResult {
    case landmark(LandmarkEntity)
    case collection(CollectionEntity)
}
```

### Display Representation

Visual Intelligence uses your entity's `DisplayRepresentation` to show results. Provide a title, subtitle, and image for each result.

```swift
struct LandmarkEntity: AppEntity {
    var id: String
    var name: String
    var location: String

    static var typeDisplayRepresentation: TypeDisplayRepresentation {
        TypeDisplayRepresentation(
            name: LocalizedStringResource("Landmark", table: "AppIntents"),
            numericFormat: "\(placeholder: .int) landmarks"
        )
    }

    var displayRepresentation: DisplayRepresentation {
        DisplayRepresentation(
            title: "\(name)",
            subtitle: "\(location)",
            image: .init(named: thumbnailImageName)
        )
    }
}
```

### Deep Linking from Results

When a user taps a result, your app should open to the relevant content. Provide an `appLinkURL` on your entity.

```swift
var appLinkURL: URL? {
    URL(string: "yourapp://landmark/\(id)")
}
```

### "More Results" Intent

For large result sets, provide a `VisualIntelligenceSearchIntent` that opens your app's full search UI.

```swift
struct ViewMoreLandmarksIntent: AppIntent, VisualIntelligenceSearchIntent {
    static var title: LocalizedStringResource = "View More Landmarks"

    @Parameter(title: "Semantic Content")
    var semanticContent: SemanticContentDescriptor

    func perform() async throws -> some IntentResult {
        // Open your app's search view with the semantic content
        return .result()
    }
}
```

### Best Practices

- **Return results quickly** — Visual Intelligence expects low-latency responses. Limit to 10-20 most relevant results
- **Prefer labels first** — Label matching is faster than pixel buffer analysis. Fall back to pixel buffer when labels are empty or insufficient
- **Localize everything** — Display representations appear in the system UI. Use `LocalizedStringResource` for all user-facing text
- **Include images** — Results with thumbnails are more recognizable in the Visual Intelligence overlay

### Testing

1. Build and run on a physical device
2. Activate Visual Intelligence camera or take a screenshot of relevant content
3. Perform a visual search and verify your app's results appear
4. Tap results to verify deep linking opens the correct content

## API Quick Reference

### Subject Segmentation

| API | Platform | Purpose |
|-----|----------|---------|
| `VNGenerateForegroundInstanceMaskRequest` | iOS 17+ | Class-agnostic subject instances |
| `VNGeneratePersonInstanceMaskRequest` | iOS 17+ | Up to 4 people separately |
| `VNGeneratePersonSegmentationRequest` | iOS 15+ | All people (single mask) |
| `ImageAnalysisInteraction` (VisionKit) | iOS 16+ | UI for subject lifting |

### Pose Detection

| API | Platform | Landmarks | Coordinates |
|-----|----------|-----------|-------------|
| `VNDetectHumanHandPoseRequest` | iOS 14+ | 21 per hand | 2D normalized |
| `VNDetectHumanBodyPoseRequest` | iOS 14+ | 18 body joints | 2D normalized |
| `VNDetectHumanBodyPose3DRequest` | iOS 17+ | 17 body joints | 3D meters |

### Face & Person Detection

| API | Platform | Purpose |
|-----|----------|---------|
| `VNDetectFaceRectanglesRequest` | iOS 11+ | Face bounding boxes |
| `VNDetectFaceLandmarksRequest` | iOS 11+ | Face with detailed landmarks |
| `VNDetectHumanRectanglesRequest` | iOS 13+ | Human torso bounding boxes |

### Text & Barcode

| API | Platform | Purpose |
|-----|----------|---------|
| `VNRecognizeTextRequest` | iOS 13+ | Text recognition (OCR) |
| `VNDetectBarcodesRequest` | iOS 11+ | Barcode/QR detection |
| `DataScannerViewController` | iOS 16+ | Live camera scanner (text + barcodes) |
| `VNDocumentCameraViewController` | iOS 13+ | Document scanning with perspective correction |
| `VNDetectDocumentSegmentationRequest` | iOS 15+ | Programmatic document edge detection |
| `RecognizeDocumentsRequest` | iOS 26+ | Structured document extraction |

### Visual Intelligence

| API | Platform | Purpose |
|-----|----------|---------|
| `SemanticContentDescriptor` | iOS 26+ | Describes what the user is looking at (labels + pixel buffer) |
| `IntentValueQuery` | iOS 26+ | Entry point for receiving visual search requests |
| `VisualIntelligenceSearchIntent` | iOS 26+ | "More results" deep link to your app |

### Observation Types

| Observation | Returned By |
|-------------|-------------|
| `VNInstanceMaskObservation` | Foreground/person instance masks |
| `VNPixelBufferObservation` | Person segmentation (single mask) |
| `VNHumanHandPoseObservation` | Hand pose |
| `VNHumanBodyPoseObservation` | Body pose (2D) |
| `VNHumanBodyPose3DObservation` | Body pose (3D) |
| `VNFaceObservation` | Face detection/landmarks |
| `VNHumanObservation` | Human rectangles |
| `VNRecognizedTextObservation` | Text recognition |
| `VNBarcodeObservation` | Barcode detection |
| `VNRectangleObservation` | Document segmentation |
| `DocumentObservation` | Structured document (iOS 26+) |

## Resources

**WWDC**: 2019-234, 2021-10041, 2022-10024, 2022-10025, 2025-272, 2023-10176, 2023-111241, 2023-10048, 2020-10653, 2020-10043, 2020-10099

**Docs**: /vision, /visionkit, /visualintelligence, /visualintelligence/semanticcontentdescriptor, /vision/vnrecognizetextrequest, /vision/vndetectbarcodesrequest

**Skills**: axiom-vision, axiom-vision-diag

Related Skills

processing-computer-vision-tasks

from ComeOnOliver/skillshub

Process images using object detection, classification, and segmentation. Use when requesting "analyze image", "object detection", "image classification", or "computer vision". Trigger with relevant phrases based on skill purpose.

vision-exploration

from ComeOnOliver/skillshub

终局愿景探索。用户抛出一个模糊 idea，AI 主导引导，通过"追问价值 → 挖掘动机 → 推导演化 → 画终局"的链路，帮用户看到未来最远的可能性。不设限，不收敛，纯发散。

computer-vision-expert

from ComeOnOliver/skillshub

SOTA Computer Vision Expert (2026). Specialized in YOLO26, Segment Anything 3 (SAM 3), Vision Language Models, and real-time spatial analysis.

azure-ai-vision-imageanalysis-py

from ComeOnOliver/skillshub

Azure AI Vision Image Analysis SDK for captions, tags, objects, OCR, people detection, and smart cropping. Use for computer vision and image understanding tasks. Triggers: "image analysis", "computer vision", "OCR", "object detection", "ImageAnalysisClient", "image caption".

azure-ai-vision-imageanalysis-java

from ComeOnOliver/skillshub

Build image analysis applications with Azure AI Vision SDK for Java. Use when implementing image captioning, OCR text extraction, object detection, tagging, or smart cropping.

axiom-audit

from ComeOnOliver/skillshub

Audit Axiom logs to identify and prioritize errors and warnings, research probable causes, and flag log smells. Use when user asks to check Axiom logs, analyze production errors, investigate log issues, or audit logging patterns.

vision

from ComeOnOliver/skillshub

Analyze images, screenshots, diagrams, and visual content - Use when you need to understand visual content like screenshots, architecture diagrams, UI mockups, or error screenshots.

Senior Computer Vision

from ComeOnOliver/skillshub

## Overview

Product Strategy — Vision, Positioning, and Roadmap

from ComeOnOliver/skillshub

## Overview

OpenCV — Computer Vision Library

from ComeOnOliver/skillshub

You are an expert in OpenCV (Open Source Computer Vision Library), the most popular library for real-time computer vision. You help developers build image processing pipelines, object detection systems, video analysis tools, augmented reality, and document processing using OpenCV's 2,500+ algorithms for image manipulation, feature detection, camera calibration, 3D reconstruction, and DNN inference — in Python, C++, or JavaScript.

Axiom — Serverless Log Analytics

from ComeOnOliver/skillshub

## Overview

LLaVA - Large Language and Vision Assistant

from ComeOnOliver/skillshub

Open-source vision-language model for conversational image understanding.