computer-vision-guide

Apply computer vision research methods, models, and evaluation tools

191 stars

Best use case

computer-vision-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Apply computer vision research methods, models, and evaluation tools

Teams using computer-vision-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/computer-vision-guide/SKILL.md --create-dirs "https://raw.githubusercontent.com/wentorai/research-plugins/main/skills/domains/ai-ml/computer-vision-guide/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/computer-vision-guide/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How computer-vision-guide Compares

Feature / Agentcomputer-vision-guideStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Apply computer vision research methods, models, and evaluation tools

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Computer Vision Guide

A skill for conducting computer vision research, covering model architectures, dataset preparation, training pipelines, evaluation metrics, and common experimental protocols for image classification, object detection, and segmentation tasks.

## Core Tasks and Architectures

### Computer Vision Task Taxonomy

```
Image Classification:
  Input: Single image
  Output: Class label(s)
  Models: ResNet, EfficientNet, ViT, ConvNeXt

Object Detection:
  Input: Single image
  Output: Bounding boxes + class labels
  Models: YOLO (v5-v9), Faster R-CNN, DETR, RT-DETR

Semantic Segmentation:
  Input: Single image
  Output: Per-pixel class label
  Models: U-Net, DeepLab, SegFormer, Mask2Former

Instance Segmentation:
  Input: Single image
  Output: Per-pixel labels distinguishing individual objects
  Models: Mask R-CNN, Mask2Former, SAM

Image Generation:
  Input: Text prompt or noise
  Output: Generated image
  Models: Stable Diffusion, DALL-E, Imagen
```

### Model Architecture Evolution

```
CNNs (Convolutional Neural Networks):
  LeNet (1998) -> AlexNet (2012) -> VGG (2014) -> ResNet (2015)
  -> EfficientNet (2019) -> ConvNeXt (2022)

Vision Transformers:
  ViT (2020) -> DeiT (2021) -> Swin Transformer (2021)
  -> BEiT (2021) -> DINOv2 (2023)

Trend: Transformers are competitive with CNNs at scale.
Hybrid architectures combining convolutions and attention are common.
```

## Dataset Preparation

### Building a Research Dataset

```python
import os
from pathlib import Path


def organize_image_dataset(source_dir: str,
                            split_ratios: dict = None) -> dict:
    """
    Organize images into train/val/test splits.

    Args:
        source_dir: Directory containing class subdirectories
        split_ratios: Dict with 'train', 'val', 'test' ratios
    """
    if split_ratios is None:
        split_ratios = {"train": 0.7, "val": 0.15, "test": 0.15}

    import random
    random.seed(42)

    stats = {}
    for class_dir in sorted(Path(source_dir).iterdir()):
        if not class_dir.is_dir():
            continue

        images = list(class_dir.glob("*.jpg")) + list(class_dir.glob("*.png"))
        random.shuffle(images)

        n = len(images)
        n_train = int(n * split_ratios["train"])
        n_val = int(n * split_ratios["val"])

        stats[class_dir.name] = {
            "total": n,
            "train": n_train,
            "val": n_val,
            "test": n - n_train - n_val
        }

    return stats
```

### Data Augmentation

```python
from torchvision import transforms


def get_training_transforms(img_size: int = 224) -> transforms.Compose:
    """
    Standard data augmentation pipeline for training.

    Args:
        img_size: Target image size
    """
    return transforms.Compose([
        transforms.RandomResizedCrop(img_size, scale=(0.8, 1.0)),
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.ColorJitter(brightness=0.2, contrast=0.2,
                               saturation=0.2, hue=0.1),
        transforms.RandomRotation(15),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])
```

## Training Pipeline

### Transfer Learning Workflow

```python
import torch
import torch.nn as nn
from torchvision import models


def create_classifier(num_classes: int,
                      backbone: str = "resnet50",
                      pretrained: bool = True) -> nn.Module:
    """
    Create an image classifier using transfer learning.

    Args:
        num_classes: Number of target classes
        backbone: Model architecture name
        pretrained: Whether to use ImageNet-pretrained weights
    """
    if backbone == "resnet50":
        weights = models.ResNet50_Weights.DEFAULT if pretrained else None
        model = models.resnet50(weights=weights)
        model.fc = nn.Linear(model.fc.in_features, num_classes)
    elif backbone == "vit_b_16":
        weights = models.ViT_B_16_Weights.DEFAULT if pretrained else None
        model = models.vit_b_16(weights=weights)
        model.heads.head = nn.Linear(
            model.heads.head.in_features, num_classes
        )
    else:
        raise ValueError(f"Unknown backbone: {backbone}")

    return model
```

## Evaluation Metrics

### Metrics by Task

```
Classification:
  - Top-1 Accuracy: Fraction of correct predictions
  - Top-5 Accuracy: Correct class in top 5 predictions
  - Precision, Recall, F1: Per-class and macro-averaged
  - Confusion Matrix: Visualize class-level errors

Object Detection:
  - mAP (mean Average Precision): Standard COCO metric
  - mAP@0.5: AP at IoU threshold 0.5
  - mAP@0.5:0.95: AP averaged over IoU thresholds 0.5 to 0.95
  - AP per class: Identifies weak categories

Segmentation:
  - mIoU (mean Intersection over Union): Standard metric
  - Pixel Accuracy: Fraction of correctly classified pixels
  - Dice Coefficient: F1 score at the pixel level
```

## Reproducibility Checklist

### What to Report in Papers

```
1. Architecture: Exact model name, number of parameters
2. Pretraining: Dataset and weights used for initialization
3. Training: Optimizer, learning rate schedule, batch size, epochs
4. Augmentation: Full list of augmentations with parameters
5. Hardware: GPU type, number, training time
6. Evaluation: Exact metrics, test set version, evaluation protocol
7. Code: Link to repository with training and evaluation scripts
8. Random seeds: Report seeds used; ideally report mean over 3+ seeds
```

## Ethical Considerations

When collecting or using image datasets, consider consent (especially for images of people), geographic and demographic representation, potential for bias amplification, and dual-use concerns. Document the dataset's composition and limitations. Follow the Datasheets for Datasets framework. For generative models, implement safeguards against generating harmful content.