computer-vision-pipeline

Build production computer vision pipelines for object detection, tracking, and video analysis. Handles drone footage, wildlife monitoring, and real-time detection. Supports YOLO, Detectron2, TensorFlow, PyTorch. Use for archaeological surveys, conservation, security. Activate on "object detection", "video analysis", "YOLO", "tracking", "drone footage". NOT for simple image filters, photo editing, or face recognition APIs.

85 stars

bycuriositech

View on GitHub Installation ↓

Best use case

computer-vision-pipeline is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using computer-vision-pipeline should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/computer-vision-pipeline/SKILL.md --create-dirs "https://raw.githubusercontent.com/curiositech/some_claude_skills/main/.claude/skills/computer-vision-pipeline/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/computer-vision-pipeline/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How computer-vision-pipeline Compares

Feature / Agent	computer-vision-pipeline	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Computer Vision Pipeline

Expert in building production-ready computer vision systems for object detection, tracking, and video analysis.

## When to Use

✅ **Use for**:
- Drone footage analysis (archaeological surveys, conservation)
- Wildlife monitoring and tracking
- Real-time object detection systems
- Video preprocessing and analysis
- Custom model training and inference
- Multi-object tracking (MOT)

❌ **NOT for**:
- Simple image filters (use Pillow/PIL)
- Photo editing (use Photoshop/GIMP)
- Face recognition APIs (use AWS Rekognition)
- Basic OCR (use Tesseract)

---

## Technology Selection

### Object Detection Models

| Model | Speed (FPS) | Accuracy (mAP) | Use Case |
|-------|-------------|----------------|----------|
| YOLOv8 | 140 | 53.9% | Real-time detection |
| Detectron2 | 25 | 58.7% | High accuracy, research |
| EfficientDet | 35 | 55.1% | Mobile deployment |
| Faster R-CNN | 10 | 42.0% | Legacy systems |

**Timeline**:
- 2015: Faster R-CNN (two-stage detection)
- 2016: YOLO v1 (one-stage, real-time)
- 2020: YOLOv5 (PyTorch, production-ready)
- 2023: YOLOv8 (state-of-the-art)
- 2024: YOLOv8 is industry standard for real-time

**Decision tree**:
```
Need real-time (&gt;30 FPS)? → YOLOv8
Need highest accuracy? → Detectron2 Mask R-CNN
Need mobile deployment? → YOLOv8-nano or EfficientDet
Need instance segmentation? → Detectron2 or YOLOv8-seg
Need custom objects? → Fine-tune YOLOv8
```

---

## Common Anti-Patterns

### Anti-Pattern 1: Not Preprocessing Frames Before Detection

**Novice thinking**: "Just run detection on raw video frames"

**Problem**: Poor detection accuracy, wasted GPU cycles.

**Wrong approach**:
```python
# ❌ No preprocessing - poor results
import cv2
from ultralytics import YOLO

model = YOLO('yolov8n.pt')
video = cv2.VideoCapture('drone_footage.mp4')

while True:
    ret, frame = video.read()
    if not ret:
        break

    # Raw frame detection - no normalization, no resizing
    results = model(frame)
    # Poor accuracy, slow inference
```

**Why wrong**:
- Video resolution too high (4K = 8.3 megapixels per frame)
- No normalization (pixel values 0-255 instead of 0-1)
- Aspect ratio not maintained
- GPU memory overflow on high-res frames

**Correct approach**:
```python
# ✅ Proper preprocessing pipeline
import cv2
import numpy as np
from ultralytics import YOLO

model = YOLO('yolov8n.pt')
video = cv2.VideoCapture('drone_footage.mp4')

# Model expects 640x640 input
TARGET_SIZE = 640

def preprocess_frame(frame):
    # Resize while maintaining aspect ratio
    h, w = frame.shape[:2]
    scale = TARGET_SIZE / max(h, w)
    new_w, new_h = int(w * scale), int(h * scale)

    resized = cv2.resize(frame, (new_w, new_h), interpolation=cv2.INTER_LINEAR)

    # Pad to square
    pad_w = (TARGET_SIZE - new_w) // 2
    pad_h = (TARGET_SIZE - new_h) // 2

    padded = cv2.copyMakeBorder(
        resized,
        pad_h, TARGET_SIZE - new_h - pad_h,
        pad_w, TARGET_SIZE - new_w - pad_w,
        cv2.BORDER_CONSTANT,
        value=(114, 114, 114)  # Gray padding
    )

    # Normalize to 0-1 (if model expects it)
    # normalized = padded.astype(np.float32) / 255.0

    return padded, scale

while True:
    ret, frame = video.read()
    if not ret:
        break

    preprocessed, scale = preprocess_frame(frame)
    results = model(preprocessed)

    # Scale bounding boxes back to original coordinates
    for box in results[0].boxes:
        x1, y1, x2, y2 = box.xyxy[0]
        x1, y1, x2, y2 = x1/scale, y1/scale, x2/scale, y2/scale
```

**Performance comparison**:
- Raw 4K frames: 5 FPS, 72% mAP
- Preprocessed 640x640: 45 FPS, 89% mAP

**Timeline context**:
- 2015: Manual preprocessing required
- 2020: YOLOv5 added auto-resize
- 2023: YOLOv8 has smart preprocessing but explicit control is better

---

### Anti-Pattern 2: Processing Every Frame in Video

**Novice thinking**: "Run detection on every single frame"

**Problem**: 99% of frames are redundant, wasting compute.

**Wrong approach**:
```python
# ❌ Process every frame (30 FPS video = 1800 frames/min)
import cv2
from ultralytics import YOLO

model = YOLO('yolov8n.pt')
video = cv2.VideoCapture('drone_footage.mp4')

detections = []

while True:
    ret, frame = video.read()
    if not ret:
        break

    # Run detection on EVERY frame
    results = model(frame)
    detections.append(results)

# 10-minute video = 18,000 inferences (15 minutes on GPU)
```

**Why wrong**:
- Adjacent frames are nearly identical
- Wasting 95% of compute on duplicate work
- Slow processing time
- Massive storage for results

**Correct approach 1**: Frame sampling
```python
# ✅ Sample every Nth frame
import cv2
from ultralytics import YOLO

model = YOLO('yolov8n.pt')
video = cv2.VideoCapture('drone_footage.mp4')

SAMPLE_RATE = 30  # Process 1 frame per second (if 30 FPS video)

frame_count = 0
detections = []

while True:
    ret, frame = video.read()
    if not ret:
        break

    frame_count += 1

    # Only process every 30th frame
    if frame_count % SAMPLE_RATE == 0:
        results = model(frame)
        detections.append({
            'frame': frame_count,
            'timestamp': frame_count / 30.0,
            'results': results
        })

# 10-minute video = 600 inferences (30 seconds on GPU)
```

**Correct approach 2**: Adaptive sampling with scene change detection
```python
# ✅ Only process when scene changes significantly
import cv2
import numpy as np
from ultralytics import YOLO

model = YOLO('yolov8n.pt')
video = cv2.VideoCapture('drone_footage.mp4')

def scene_changed(prev_frame, curr_frame, threshold=0.3):
    """Detect scene change using histogram comparison"""
    if prev_frame is None:
        return True

    # Convert to grayscale
    prev_gray = cv2.cvtColor(prev_frame, cv2.COLOR_BGR2GRAY)
    curr_gray = cv2.cvtColor(curr_frame, cv2.COLOR_BGR2GRAY)

    # Calculate histograms
    prev_hist = cv2.calcHist([prev_gray], [0], None, [256], [0, 256])
    curr_hist = cv2.calcHist([curr_gray], [0], None, [256], [0, 256])

    # Compare histograms
    correlation = cv2.compareHist(prev_hist, curr_hist, cv2.HISTCMP_CORREL)

    return correlation < (1 - threshold)

prev_frame = None
detections = []

while True:
    ret, frame = video.read()
    if not ret:
        break

    # Only run detection if scene changed
    if scene_changed(prev_frame, frame):
        results = model(frame)
        detections.append(results)

    prev_frame = frame.copy()

# Adapts to video content - static shots skip frames, action scenes process more
```

**Savings**:
- Every frame: 18,000 inferences
- Sample 1 FPS: 600 inferences (97% reduction)
- Adaptive: ~1,200 inferences (93% reduction)

---

### Anti-Pattern 3: Not Using Batch Inference

**Novice thinking**: "Process one image at a time"

**Problem**: GPU sits idle 80% of the time waiting for data.

**Wrong approach**:
```python
# ❌ Sequential processing - GPU underutilized
import cv2
from ultralytics import YOLO
import time

model = YOLO('yolov8n.pt')

# 100 images to process
image_paths = [f'frame_{i:04d}.jpg' for i in range(100)]

start = time.time()

for path in image_paths:
    frame = cv2.imread(path)
    results = model(frame)  # Process one at a time
    # GPU utilization: ~20%

elapsed = time.time() - start
print(f"Processed {len(image_paths)} images in {elapsed:.2f}s")
# Output: 45 seconds
```

**Why wrong**:
- GPU has to wait for CPU to load each image
- No parallelization
- GPU utilization ~20%
- Slow throughput

**Correct approach**:
```python
# ✅ Batch inference - GPU fully utilized
import cv2
from ultralytics import YOLO
import time

model = YOLO('yolov8n.pt')

image_paths = [f'frame_{i:04d}.jpg' for i in range(100)]

BATCH_SIZE = 16  # Process 16 images at once

start = time.time()

for i in range(0, len(image_paths), BATCH_SIZE):
    batch_paths = image_paths[i:i+BATCH_SIZE]

    # Load batch
    frames = [cv2.imread(path) for path in batch_paths]

    # Batch inference (single GPU call)
    results = model(frames)  # Pass list of images
    # GPU utilization: ~85%

elapsed = time.time() - start
print(f"Processed {len(image_paths)} images in {elapsed:.2f}s")
# Output: 8 seconds (5.6x faster!)
```

**Performance comparison**:
| Method | Time (100 images) | GPU Util | Throughput |
|--------|-------------------|----------|------------|
| Sequential | 45s | 20% | 2.2 img/s |
| Batch (16) | 8s | 85% | 12.5 img/s |
| Batch (32) | 6s | 92% | 16.7 img/s |

**Batch size tuning**:
```python
# Find optimal batch size for your GPU
import torch

def find_optimal_batch_size(model, image_size=(640, 640)):
    for batch_size in [1, 2, 4, 8, 16, 32, 64]:
        try:
            dummy_input = torch.randn(batch_size, 3, *image_size).cuda()

            start = time.time()
            with torch.no_grad():
                _ = model(dummy_input)
            elapsed = time.time() - start

            throughput = batch_size / elapsed
            print(f"Batch {batch_size}: {throughput:.1f} img/s")
        except RuntimeError as e:
            print(f"Batch {batch_size}: OOM (out of memory)")
            break

# Find optimal batch size before production
find_optimal_batch_size(model)
```

---

### Anti-Pattern 4: Ignoring Non-Maximum Suppression (NMS) Tuning

**Problem**: Duplicate detections, missed objects, slow post-processing.

**Wrong approach**:
```python
# ❌ Use default NMS settings for everything
from ultralytics import YOLO

model = YOLO('yolov8n.pt')

# Default settings (iou_threshold=0.45, conf_threshold=0.25)
results = model('crowded_scene.jpg')

# Result: 50 bounding boxes, 30 are duplicates!
```

**Why wrong**:
- Default IoU=0.45 is too permissive for dense objects
- Default conf=0.25 includes low-quality detections
- No adaptation to use case

**Correct approach**:
```python
# ✅ Tune NMS for your use case
from ultralytics import YOLO

model = YOLO('yolov8n.pt')

# Sparse objects (dolphins in ocean)
sparse_results = model(
    'ocean_footage.jpg',
    iou=0.5,    # Higher IoU = allow closer boxes
    conf=0.4    # Higher confidence = fewer false positives
)

# Dense objects (crowd, flock of birds)
dense_results = model(
    'crowded_scene.jpg',
    iou=0.3,    # Lower IoU = suppress more duplicates
    conf=0.5    # Higher confidence = filter noise
)

# High precision needed (legal evidence)
precise_results = model(
    'evidence.jpg',
    iou=0.5,
    conf=0.7,   # Very high confidence
    max_det=50  # Limit max detections
)
```

**NMS parameter guide**:
| Use Case | IoU | Conf | Max Det |
|----------|-----|------|---------|
| Sparse objects (wildlife) | 0.5 | 0.4 | 100 |
| Dense objects (crowd) | 0.3 | 0.5 | 300 |
| High precision (evidence) | 0.5 | 0.7 | 50 |
| Real-time (speed priority) | 0.45 | 0.3 | 100 |

---

### Anti-Pattern 5: No Tracking Between Frames

**Novice thinking**: "Run detection on each frame independently"

**Problem**: Can't count unique objects, track movement, or build trajectories.

**Wrong approach**:
```python
# ❌ Independent frame detection - no object identity
from ultralytics import YOLO
import cv2

model = YOLO('yolov8n.pt')
video = cv2.VideoCapture('dolphins.mp4')

detections = []

while True:
    ret, frame = video.read()
    if not ret:
        break

    results = model(frame)
    detections.append(results)

# Result: Can't tell if frame 10 dolphin is same as frame 20 dolphin
# Can't count unique dolphins
# Can't track trajectories
```

**Why wrong**:
- No object identity across frames
- Can't count unique objects
- Can't analyze movement patterns
- Can't build trajectories

**Correct approach**: Use tracking (ByteTrack)
```python
# ✅ Multi-object tracking with ByteTrack
from ultralytics import YOLO
import cv2

# YOLO with tracking
model = YOLO('yolov8n.pt')
video = cv2.VideoCapture('dolphins.mp4')

# Track objects across frames
tracks = {}

while True:
    ret, frame = video.read()
    if not ret:
        break

    # Run detection + tracking
    results = model.track(
        frame,
        persist=True,     # Maintain IDs across frames
        tracker='bytetrack.yaml'  # ByteTrack algorithm
    )

    # Each detection now has persistent ID
    for box in results[0].boxes:
        track_id = int(box.id[0])  # Unique ID across frames
        x1, y1, x2, y2 = box.xyxy[0]

        # Store trajectory
        if track_id not in tracks:
            tracks[track_id] = []

        tracks[track_id].append({
            'frame': len(tracks[track_id]),
            'bbox': (x1, y1, x2, y2),
            'conf': box.conf[0]
        })

# Now we can analyze:
print(f"Unique dolphins detected: {len(tracks)}")

# Trajectory analysis
for track_id, trajectory in tracks.items():
    if len(trajectory) > 30:  # Only long tracks
        print(f"Dolphin {track_id} appeared in {len(trajectory)} frames")
        # Calculate movement, speed, etc.
```

**Tracking benefits**:
- Count unique objects (not just detections per frame)
- Build trajectories and movement patterns
- Analyze behavior over time
- Filter out brief false positives

**Tracking algorithms**:
| Algorithm | Speed | Robustness | Occlusion Handling |
|-----------|-------|------------|---------------------|
| ByteTrack | Fast | Good | Excellent |
| SORT | Very Fast | Fair | Fair |
| DeepSORT | Medium | Excellent | Good |
| BotSORT | Medium | Excellent | Excellent |

---

## Production Checklist

```
□ Preprocess frames (resize, pad, normalize)
□ Sample frames intelligently (1 FPS or scene change detection)
□ Use batch inference (16-32 images per batch)
□ Tune NMS thresholds for your use case
□ Implement tracking if analyzing video
□ Log inference time and GPU utilization
□ Handle edge cases (empty frames, corrupted video)
□ Save results in structured format (JSON, CSV)
□ Visualize detections for debugging
□ Benchmark on representative data
```

---

## When to Use vs Avoid

| Scenario | Appropriate? |
|----------|--------------|
| Analyze drone footage for archaeology | ✅ Yes - custom object detection |
| Track wildlife in video | ✅ Yes - detection + tracking |
| Count people in crowd | ✅ Yes - dense object detection |
| Real-time security camera | ✅ Yes - YOLOv8 real-time |
| Filter vacation photos | ❌ No - use photo management apps |
| Face recognition login | ❌ No - use AWS Rekognition API |
| Read license plates | ❌ No - use specialized OCR |

---

## References

- `/references/yolo-guide.md` - YOLOv8 setup, training, inference patterns
- `/references/video-processing.md` - Frame extraction, scene detection, optimization
- `/references/tracking-algorithms.md` - ByteTrack, SORT, DeepSORT comparison

## Scripts

- `scripts/video_analyzer.py` - Extract frames, run detection, generate timeline
- `scripts/model_trainer.py` - Fine-tune YOLO on custom dataset, export weights

---

**This skill guides**: Computer vision | Object detection | Video analysis | YOLO | Tracking | Drone footage | Wildlife monitoring

Related Skills

github-actions-pipeline-builder

from curiositech/some_claude_skills

Build production CI/CD pipelines with GitHub Actions. Implements matrix builds, caching, deployments, testing, security scanning. Use for automated testing, deployments, release workflows. Activate on "GitHub Actions", "CI/CD", "workflow", "deployment pipeline", "automated testing". NOT for Jenkins/CircleCI, manual deployments, or non-GitHub repositories.

geospatial-data-pipeline

from curiositech/some_claude_skills

Process, analyze, and visualize geospatial data at scale. Handles drone imagery, GPS tracks, GeoJSON optimization, coordinate transformations, and tile generation. Use for mapping apps, drone data processing, location-based services. Activate on "geospatial", "GIS", "PostGIS", "GeoJSON", "map tiles", "coordinate systems". NOT for simple address validation, basic distance calculations, or static map embeds.

data-pipeline-engineer

from curiositech/some_claude_skills

Expert data engineer for ETL/ELT pipelines, streaming, data warehousing. Activate on: data pipeline, ETL, ELT, data warehouse, Spark, Kafka, Airflow, dbt, data modeling, star schema, streaming data, batch processing, data quality. NOT for: API design (use api-architect), ML training (use ML skills), dashboards (use design skills).

skill-coach

from curiositech/some_claude_skills

Guides creation of high-quality Agent Skills with domain expertise, anti-pattern detection, and progressive disclosure best practices. Use when creating skills, reviewing existing skills, or when users mention improving skill quality, encoding expertise, or avoiding common AI tooling mistakes. Activate on keywords: create skill, review skill, skill quality, skill best practices, skill anti-patterns. NOT for general coding advice or non-skill Claude Code features.

3d-cv-labeling-2026

from curiositech/some_claude_skills

Expert in 3D computer vision labeling tools, workflows, and AI-assisted annotation for LiDAR, point clouds, and sensor fusion. Covers SAM4D/Point-SAM, human-in-the-loop architectures, and vertical-specific training strategies. Activate on '3D labeling', 'point cloud annotation', 'LiDAR labeling', 'SAM 3D', 'SAM4D', 'sensor fusion annotation', '3D bounding box', 'semantic segmentation point cloud'. NOT for 2D image labeling (use clip-aware-embeddings), general ML training (use ml-engineer), video annotation without 3D (use computer-vision-pipeline), or VLM prompt engineering (use prompt-engineer).

wisdom-accountability-coach

from curiositech/some_claude_skills

Longitudinal memory tracking, philosophy teaching, and personal accountability with compassion. Expert in pattern recognition, Stoicism/Buddhism, and growth guidance. Activate on 'accountability', 'philosophy', 'Stoicism', 'Buddhism', 'personal growth', 'commitment tracking', 'wisdom teaching'. NOT for therapy or mental health treatment (refer to professionals), crisis intervention, or replacing professional coaching credentials.

windows-95-web-designer

from curiositech/some_claude_skills

Modern web applications with authentic Windows 95 aesthetic. Gradient title bars, Start menu paradigm, taskbar patterns, 3D beveled chrome. Extrapolates Win95 to AI chatbots, mobile UIs, responsive layouts. Activate on 'windows 95', 'win95', 'start menu', 'taskbar', 'retro desktop', '95 aesthetic', 'clippy'. NOT for Windows 3.1 (use windows-3-1-web-designer), vaporwave/synthwave, macOS, flat design.

windows-3-1-web-designer

from curiositech/some_claude_skills

Modern web applications with authentic Windows 3.1 aesthetic. Solid navy title bars, Program Manager navigation, beveled borders, single window controls. Extrapolates Win31 to AI chatbots (Cue Card paradigm), mobile UIs (pocket computing). Activate on 'windows 3.1', 'win31', 'program manager', 'retro desktop', '90s aesthetic', 'beveled'. NOT for Windows 95 (use windows-95-web-designer - has gradients, Start menu), vaporwave/synthwave, macOS, flat design.

win31-pixel-art-designer

from curiositech/some_claude_skills

Expert in Windows 3.1 era pixel art and graphics. Creates icons, banners, splash screens, and UI assets with authentic 16/256-color palettes, dithering patterns, and Program Manager styling. Activate on 'win31 icons', 'pixel art 90s', 'retro icons', '16-color', 'dithering', 'program manager icons', 'VGA palette'. NOT for modern flat icons, vaporwave art, or high-res illustrations.

win31-audio-design

from curiositech/some_claude_skills

Expert in Windows 3.1 era sound vocabulary for modern web/mobile apps. Creates satisfying retro UI sounds using CC-licensed 8-bit audio, Web Audio API, and haptic coordination. Activate on 'win31 sounds', 'retro audio', '90s sound effects', 'chimes', 'tada', 'ding', 'satisfying UI sounds'. NOT for modern flat UI sounds, voice synthesis, or music composition.

wedding-immortalist

from curiositech/some_claude_skills

Transform thousands of wedding photos and hours of footage into an immersive 3D Gaussian Splatting experience with theatre mode replay, face-clustered guest roster, and AI-curated best photos per person. Expert in 3DGS pipelines, face clustering, aesthetic scoring, and adaptive design matching the couple's wedding theme (disco, rustic, modern, LGBTQ+ celebrations). Activate on "wedding photos", "wedding video", "3D wedding", "Gaussian Splatting wedding", "wedding memory", "wedding immortalize", "face clustering wedding", "best wedding photos". NOT for general photo editing (use native-app-designer), non-wedding 3DGS (use drone-inspection-specialist), or event planning (not a wedding planner).

websocket-streaming

from curiositech/some_claude_skills

Implements real-time bidirectional communication between DAG execution engines and visualization dashboards via WebSocket. Covers connection management, typed event protocols, reconnection with backoff, and React hook integration. Activate on "WebSocket", "real-time updates", "live streaming", "execution events", "state streaming", "push notifications". NOT for HTTP REST APIs, server-sent events (SSE), or general networking.