3d-cv-labeling-2026

Expert in 3D computer vision labeling tools, workflows, and AI-assisted annotation for LiDAR, point clouds, and sensor fusion. Covers SAM4D/Point-SAM, human-in-the-loop architectures, and vertical-specific training strategies. Activate on '3D labeling', 'point cloud annotation', 'LiDAR labeling', 'SAM 3D', 'SAM4D', 'sensor fusion annotation', '3D bounding box', 'semantic segmentation point cloud'. NOT for 2D image labeling (use clip-aware-embeddings), general ML training (use ml-engineer), video annotation without 3D (use computer-vision-pipeline), or VLM prompt engineering (use prompt-engineer).

181 stars

bymajiayu000

View on GitHub Installation ↓

Best use case

3d-cv-labeling-2026 is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using 3d-cv-labeling-2026 should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/3d-cv-labeling-2026/SKILL.md --create-dirs "https://raw.githubusercontent.com/majiayu000/claude-skill-registry/main/skills/data/3d-cv-labeling-2026/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/3d-cv-labeling-2026/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How 3d-cv-labeling-2026 Compares

Feature / Agent	3d-cv-labeling-2026	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

ChatGPT vs Claude for Agent Skills

Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.

SKILL.md Source

# 3D Computer Vision Labeling Expert (2026)

Expert guidance on 3D annotation tools, AI-assisted labeling workflows, and training architectures for LiDAR/point cloud computer vision in autonomous vehicles, robotics, infrastructure inspection, and geospatial applications.

## When to Use This Skill

✅ **Use for:**
- Selecting 3D point cloud annotation tools (BasicAI, Supervisely, Segments.ai, Deepen AI)
- Implementing SAM4D/Point-SAM for auto-labeling workflows
- Designing human-in-the-loop annotation pipelines
- Sensor fusion annotation (camera + LiDAR + radar)
- Training architecture decisions: specialized models vs VLMs
- Vertical-specific 3D detection (autonomous driving, inspection, agriculture, wildfire)

❌ **NOT for:**
- 2D image labeling without 3D context (use clip-aware-embeddings or Label Studio docs)
- General ML model training (use ml-engineer)
- Video annotation without point clouds (use computer-vision-pipeline)
- VLM prompt engineering (use prompt-engineer)
- Photogrammetry/3D reconstruction (use geo processing tools)

---

## 2026 Tool Landscape Overview

### Commercial Leaders

| Tool | Strength | Best For | Key AI Feature |
|------|----------|----------|----------------|
| **BasicAI** | One-click detection | Autonomous driving | Pre-labeling models fine-tuned for AV |
| **Supervisely** | Customization | R&D teams | AI tracking, 2D→3D single-click |
| **Segments.ai** | 2D+3D sync | Robotics perception | Sequential propagation |
| **Deepen AI** | Sensor calibration | In-house perception | Pixel-perfect multi-sensor |
| **Dataloop** | Enterprise MLOps | Large annotation teams | Model-assisted + Point Cloud Focus |
| **Encord** | Full workflow | Multi-modal projects | Track-ID management |
| **Ango Hub (iMerit)** | Dense annotation | Complex multi-modal | Frame-to-frame propagation |

### Open Source Options

| Tool | Maturity | Limitations |
|------|----------|-------------|
| **CVAT** | Stable | 3D bounding boxes only, limited interpolation |
| **3D BAT** | Good | Full-surround annotation, semi-auto tracking |
| **Label Studio** | Partial 3D | Better for multi-format, not specialized 3D |

---

## SAM Evolution for 3D (2024-2026)

### SAM4D (ICCV 2025) - Multi-Modal + Temporal

**Key innovation**: Unified Multi-modal Positional Encoding (UMPE) aligns camera and LiDAR in shared 3D space.

```
Camera Stream → Feature Extraction → ┐
                                      ├→ UMPE Alignment → Promptable 3D Segmentation
LiDAR Stream → Point Encoding     → ┘
```

**Data engine breakthrough**: Automatic pseudo-label generation at 100x+ faster than human annotation using:
1. VFM-driven video masklets
2. Spatiotemporal 4D reconstruction
3. Cross-modal masklet fusion

**Dataset**: Waymo-4DSeg (300k+ camera-LiDAR aligned masklets)

### Point-SAM (ICLR 2025) - Native 3D Prompting

**Architecture**: Efficient transformer designed specifically for point clouds (not adapted from 2D).

**Knowledge distillation**: 2D SAM → 3D Point-SAM via data engine that generates:
- Part-level pseudo-labels
- Object-level pseudo-labels

**Benchmarks**: Outperforms state-of-the-art on indoor (ScanNet) and outdoor (nuScenes, Waymo) datasets.

### SAMNet++ (2025) - Hybrid Pipeline

Two-stage approach:
1. SAM performs unsupervised segmentation
2. Adapted PointNet++ refines for semantic accuracy

**Best for**: UAV/drone workflows where colorized point clouds from L1 LiDAR + RGB cameras are available.

---

## Human-in-the-Loop Architecture

### The Model-in-the-Loop Paradigm (2023-2026)

**Old approach**: Human labels → Train model → Deploy
**New approach**: Model assists → Human validates → Rapid iteration

```
┌─────────────────────────────────────────────────────────┐
│                    LABELING PIPELINE                     │
├─────────────────────────────────────────────────────────┤
│  Raw Data → AI Pre-label → Human Review → QA Check      │
│     │           │              │             │          │
│     │     SAM4D/VLM       Corrections   Consensus      │
│     │     generates       only where    sampling        │
│     │     proposals       AI uncertain                  │
└─────────────────────────────────────────────────────────┘
```

### Efficiency Gains

| Approach | Time for 10k frames | Annotation Quality |
|----------|--------------------|--------------------|
| Manual only | 400 hours | 95% (expert) |
| AI pre-label + review | 50 hours | 97% (AI+human) |
| SAM4D data engine | 4 hours | 92% (pseudo) |

**The 80/20 rule**: ~80% of ML project time is data prep. Model-in-the-loop cuts this dramatically.

### Quality Assurance Strategies

1. **Consensus sampling**: Multiple annotators on subset, measure agreement
2. **Active learning**: Route uncertain predictions to experts
3. **Tiered review**: Tier 1 (critical objects) get SME validation, Tier 2/3 use AI confidence thresholds

---

## Why Specialized Training > VLMs for 3D

### The Core Trade-off

| Aspect | Specialized (YOLO, PointPillars) | VLMs (GPT-4V, Gemini) |
|--------|----------------------------------|----------------------|
| **Latency** | 10-50ms (real-time) | 500-2000ms |
| **3D precision** | Strong geometric priors | Noisy text-3D alignment |
| **Novel objects** | Closed-set (what you train) | Open-vocabulary |
| **Compute** | Edge-deployable | GPU cluster required |
| **Hallucinations** | None (deterministic) | Yes (safety-critical risk) |
| **Domain shift** | Struggles (fog, night) | Better generalization |

### When to Use Each

**Use Specialized Models When:**
- Real-time inference required (autonomous vehicles, robotics)
- Known object classes (infrastructure defects, crop types)
- Safety-critical deployment (can't tolerate hallucinations)
- Edge deployment (drones, embedded systems)

**Use VLMs/Foundation Models When:**
- Zero-shot exploration of new domains
- Generating training data (weak labels)
- Open-vocabulary requirements ("find anything damaged")
- Domain adaptation bootstrapping

### The Hybrid Architecture (2025+ Best Practice)

```
                    ┌───────────────────────┐
                    │    VLM (Slow Brain)   │
                    │  • Scene understanding│
                    │  • Open vocabulary    │
                    │  • Anomaly detection  │
                    └──────────┬────────────┘
                               │ High-level context
                               ▼
┌──────────────────────────────────────────────────────────┐
│              Specialized Detector (Fast Brain)           │
│  • Real-time inference (YOLO, PointPillars, CenterPoint)│
│  • Known object detection & tracking                    │
│  • Safety-critical decisions                            │
└──────────────────────────────────────────────────────────┘
```

**Examples**:
- VOLTRON: YOLOv8 + LLaMA2 for hazard identification
- DrivePI: Point clouds + multi-view + language instructions (0.5B Qwen2.5)

---

## Vertical-Specific Training Architecture

### Infrastructure Inspection

**Objects**: Utility poles, insulators, conductors, vegetation, damage types
**Sensor fusion**: RGB + thermal + LiDAR
**Training data needs**:
- Thermal anomaly samples (varied temperatures)
- Damage taxonomy (cracks, corrosion, rust grades)
- Vegetation clearance measurements

**Architecture**:
```
LiDAR → Point cloud encoder → ┐
Thermal → 2D encoder       → ├→ Fusion → Multi-task head
RGB → 2D encoder           → ┘          ├→ Object detection
                                         ├→ Defect classification
                                         └→ Clearance regression
```

### Autonomous Driving

**Objects**: Vehicles, pedestrians, cyclists, traffic signs, lane markings
**Key requirement**: Temporal consistency (track-IDs across frames)
**Training data needs**:
- Long-tail scenarios (emergency vehicles, animals, debris)
- Adverse weather (fog, rain, snow, night)
- Edge cases (construction zones, accidents)

**Architecture**: CenterPoint, PointPillars, or Voxel-based detectors with BEV (Bird's Eye View) representation.

### Agriculture/Wildfire

**Objects**: Crop rows, canopy height, fuel load, fire spread boundaries
**Sensor fusion**: RGB + multispectral + LiDAR
**Training data needs**:
- Crop growth stages
- Disease/pest visual signatures
- Fuel load density from LiDAR CHM (Canopy Height Model)

**Why not just VLM?** VLMs can't:
- Measure precise heights (LiDAR regression)
- Classify at hyperspectral wavelengths
- Maintain spatial precision for prescription maps

---

## Common Anti-Patterns

### Anti-Pattern: "Just Use SAM on Everything"

**Novice thinking**: "SAM segments anything, so I'll just run it on my LiDAR data"

**Reality**:
- SAM 1/2 are 2D models—they don't understand 3D geometry
- Point clouds need Point-SAM or SAM4D specifically
- Raw application produces noisy masks without geometric priors

**Correct approach**: Use Point-SAM for native 3D, or project to 2D for SAM → lift back to 3D.

### Anti-Pattern: Skipping Human Validation

**Novice thinking**: "AI pre-labels are 95% accurate, we can skip review"

**Reality**:
- 5% error on 100k objects = 5,000 wrong labels
- Errors compound in edge cases (exactly where you need accuracy)
- Model learns to reproduce annotation mistakes

**Correct approach**: Tier 1 (safety-critical) always human-validated. Use confidence thresholds for Tier 2/3.

### Anti-Pattern: VLM for Real-Time Inference

**Novice thinking**: "GPT-4V can identify damage in my photos"

**Reality**:
- 500-2000ms latency per frame
- Can't run on edge devices
- Hallucination risk in safety-critical contexts

**Correct approach**: Use VLM for data generation/exploration, specialized model for deployment.

### Anti-Pattern: Single-Modal Training

**Novice thinking**: "LiDAR is enough for 3D detection"

**Reality**:
- LiDAR: Precise geometry, no color/texture
- Camera: Rich semantics, no depth
- Fusion outperforms single-modal by 5-15% mAP

**Correct approach**: Sensor fusion from day one. SAM4D shows fusion pseudo-labels > single-modal.

---

## Decision Tree: Choosing Your Approach

```
                        Do you need real-time inference?
                              /                  \
                           YES                    NO
                            |                      |
                    Use specialized           Is this exploration?
                    detector (YOLO,              /        \
                    CenterPoint)               YES         NO
                            |                  |           |
                    Have labeled data?     Use VLM      Generate
                      /        \           for zero-    pseudo-labels
                   YES          NO         shot         with SAM4D
                    |            |
              Train model    Use SAM4D/
                             Point-SAM for
                             auto-labeling
```

---

## Tool Selection Decision Matrix

| Requirement | Recommended Tool |
|-------------|------------------|
| Autonomous driving at scale | Deepen AI or BasicAI |
| R&D/research flexibility | Supervisely or Segments.ai |
| Multi-modal (camera+LiDAR+radar) | Ango Hub or Dataloop |
| Self-hosted/open source | CVAT + 3D plugins or 3D BAT |
| Robotics perception | Segments.ai (2D+3D sync) |
| Budget-conscious | Label Studio + custom scripts |

---

## References

- `/references/sam4d-architecture.md` - Deep dive on SAM4D UMPE and data engine
- `/references/tool-comparison-matrix.md` - Detailed feature comparison of all tools
- `/references/hybrid-architecture-examples.md` - VOLTRON, DrivePI implementation patterns
- `/references/vertical-training-recipes.md` - Infrastructure, AV, agriculture specifics

---

## Sources

- [SAM4D: Segment Anything in Camera and LiDAR Streams](https://sam4d-project.github.io/) (ICCV 2025)
- [Point-SAM: Promptable 3D Segmentation Model](https://point-sam.github.io/) (ICLR 2025)
- [Segments.ai: 8 Best Point Cloud Labeling Tools](https://segments.ai/blog/the-8-best-point-cloud-labeling-tools/)
- [A Review of 3D Object Detection with Vision-Language Models](https://arxiv.org/html/2504.18738v1)
- [Vision-Language Models in Autonomous Driving Survey](https://arxiv.org/html/2310.14414v2)

Related Skills

2026-legal-research-agent

181

from majiayu000/claude-skill-registry

Expert legal research agent for finding and scraping expungement data state by state. Knows authoritative sources, URL patterns, Firecrawl configuration, and 2026 legal landscape. Activate on "find expungement data", "scrape state laws", "legal research", "court URLs", "statute sources", "Clean Slate laws", "automatic expungement research". NOT for interpreting laws (use national-expungement-expert), building UI, or legal advice.

lets-go-rss

159

from majiayu000/claude-skill-registry

A lightweight, full-platform RSS subscription manager that aggregates content from YouTube, Vimeo, Behance, Twitter/X, and Chinese platforms like Bilibili, Weibo, and Douyin, featuring deduplication and AI smart classification.

Content & Documentation

ux

159

from majiayu000/claude-skill-registry

This AI agent skill provides comprehensive guidance for creating professional and insightful User Experience (UX) designs, covering user research, information architecture, interaction design, visual guidance, and usability evaluation. It aims to produce actionable, user-centered solutions that avoid generic AI aesthetics.

UX Design & StrategyClaude

chrome-debug

159

from majiayu000/claude-skill-registry

This skill empowers AI agents to debug web applications and inspect browser behavior using the Chrome DevTools Protocol (CDP), offering both collaborative (headful) and automated (headless) modes.

Coding & DevelopmentClaude

vly-money

159

from majiayu000/claude-skill-registry

Generate crypto payment links for supported tokens and networks, manage access to X402 payment-protected content, and provide direct access to the vly.money wallet interface.

Fintech & CryptoClaude

thor-skills

159

from majiayu000/claude-skill-registry

An entry point and router for AI agents to manage various THOR-related cybersecurity tasks, including running scans, analyzing logs, troubleshooting, and maintenance.

SecurityClaude

grail-miner

159

from majiayu000/claude-skill-registry

This skill assists in setting up, managing, and optimizing Grail miners on Bittensor Subnet 81, handling tasks like environment configuration, R2 storage, model checkpoint management, and performance tuning.

DevOps & Infrastructure

ontopo

159

from majiayu000/claude-skill-registry

An AI agent skill to search for Israeli restaurants, check table availability, view menus, and retrieve booking links via the Ontopo platform, acting as an unofficial interface to its data.

General Utilities

tech-blog

159

from majiayu000/claude-skill-registry

Generates comprehensive technical blog posts, offering detailed explanations of system internals, architecture, and implementation, either through source code analysis or document-driven research.

Content & DocumentationClaude

astro

159

from majiayu000/claude-skill-registry

This skill provides essential Astro framework patterns, focusing on server-side rendering (SSR), static site generation (SSG), middleware, and TypeScript best practices. It helps AI agents implement secure authentication, manage API routes, and debug rendering behaviors within Astro projects.

Coding & Development

modal-deployment

159

from majiayu000/claude-skill-registry

Run Python code in the cloud with serverless containers, GPUs, and autoscaling using Modal. This skill enables agents to generate code for deploying ML models, running batch jobs, serving APIs, and scaling compute-intensive workloads.

DevOps & Infrastructure

whisper-transcribe

159

from majiayu000/claude-skill-registry

Transcribes audio and video files to text using OpenAI's Whisper CLI, enhanced with contextual grounding from local markdown files for improved accuracy.

Media Processing