3d-cv-labeling-2026
Expert in 3D computer vision labeling tools, workflows, and AI-assisted annotation for LiDAR, point clouds, and sensor fusion. Covers SAM4D/Point-SAM, human-in-the-loop architectures, and vertical-specific training strategies. Activate on '3D labeling', 'point cloud annotation', 'LiDAR labeling', 'SAM 3D', 'SAM4D', 'sensor fusion annotation', '3D bounding box', 'semantic segmentation point cloud'. NOT for 2D image labeling (use clip-aware-embeddings), general ML training (use ml-engineer), video annotation without 3D (use computer-vision-pipeline), or VLM prompt engineering (use prompt-engineer).
Best use case
3d-cv-labeling-2026 is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Expert in 3D computer vision labeling tools, workflows, and AI-assisted annotation for LiDAR, point clouds, and sensor fusion. Covers SAM4D/Point-SAM, human-in-the-loop architectures, and vertical-specific training strategies. Activate on '3D labeling', 'point cloud annotation', 'LiDAR labeling', 'SAM 3D', 'SAM4D', 'sensor fusion annotation', '3D bounding box', 'semantic segmentation point cloud'. NOT for 2D image labeling (use clip-aware-embeddings), general ML training (use ml-engineer), video annotation without 3D (use computer-vision-pipeline), or VLM prompt engineering (use prompt-engineer).
Teams using 3d-cv-labeling-2026 should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/3d-cv-labeling-2026/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How 3d-cv-labeling-2026 Compares
| Feature / Agent | 3d-cv-labeling-2026 | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Expert in 3D computer vision labeling tools, workflows, and AI-assisted annotation for LiDAR, point clouds, and sensor fusion. Covers SAM4D/Point-SAM, human-in-the-loop architectures, and vertical-specific training strategies. Activate on '3D labeling', 'point cloud annotation', 'LiDAR labeling', 'SAM 3D', 'SAM4D', 'sensor fusion annotation', '3D bounding box', 'semantic segmentation point cloud'. NOT for 2D image labeling (use clip-aware-embeddings), general ML training (use ml-engineer), video annotation without 3D (use computer-vision-pipeline), or VLM prompt engineering (use prompt-engineer).
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# 3D Computer Vision Labeling Expert (2026)
Expert guidance on 3D annotation tools, AI-assisted labeling workflows, and training architectures for LiDAR/point cloud computer vision in autonomous vehicles, robotics, infrastructure inspection, and geospatial applications.
## When to Use This Skill
✅ **Use for:**
- Selecting 3D point cloud annotation tools (BasicAI, Supervisely, Segments.ai, Deepen AI)
- Implementing SAM4D/Point-SAM for auto-labeling workflows
- Designing human-in-the-loop annotation pipelines
- Sensor fusion annotation (camera + LiDAR + radar)
- Training architecture decisions: specialized models vs VLMs
- Vertical-specific 3D detection (autonomous driving, inspection, agriculture, wildfire)
❌ **NOT for:**
- 2D image labeling without 3D context (use clip-aware-embeddings or Label Studio docs)
- General ML model training (use ml-engineer)
- Video annotation without point clouds (use computer-vision-pipeline)
- VLM prompt engineering (use prompt-engineer)
- Photogrammetry/3D reconstruction (use geo processing tools)
---
## 2026 Tool Landscape Overview
### Commercial Leaders
| Tool | Strength | Best For | Key AI Feature |
|------|----------|----------|----------------|
| **BasicAI** | One-click detection | Autonomous driving | Pre-labeling models fine-tuned for AV |
| **Supervisely** | Customization | R&D teams | AI tracking, 2D→3D single-click |
| **Segments.ai** | 2D+3D sync | Robotics perception | Sequential propagation |
| **Deepen AI** | Sensor calibration | In-house perception | Pixel-perfect multi-sensor |
| **Dataloop** | Enterprise MLOps | Large annotation teams | Model-assisted + Point Cloud Focus |
| **Encord** | Full workflow | Multi-modal projects | Track-ID management |
| **Ango Hub (iMerit)** | Dense annotation | Complex multi-modal | Frame-to-frame propagation |
### Open Source Options
| Tool | Maturity | Limitations |
|------|----------|-------------|
| **CVAT** | Stable | 3D bounding boxes only, limited interpolation |
| **3D BAT** | Good | Full-surround annotation, semi-auto tracking |
| **Label Studio** | Partial 3D | Better for multi-format, not specialized 3D |
---
## SAM Evolution for 3D (2024-2026)
### SAM4D (ICCV 2025) - Multi-Modal + Temporal
**Key innovation**: Unified Multi-modal Positional Encoding (UMPE) aligns camera and LiDAR in shared 3D space.
```
Camera Stream → Feature Extraction → ┐
├→ UMPE Alignment → Promptable 3D Segmentation
LiDAR Stream → Point Encoding → ┘
```
**Data engine breakthrough**: Automatic pseudo-label generation at 100x+ faster than human annotation using:
1. VFM-driven video masklets
2. Spatiotemporal 4D reconstruction
3. Cross-modal masklet fusion
**Dataset**: Waymo-4DSeg (300k+ camera-LiDAR aligned masklets)
### Point-SAM (ICLR 2025) - Native 3D Prompting
**Architecture**: Efficient transformer designed specifically for point clouds (not adapted from 2D).
**Knowledge distillation**: 2D SAM → 3D Point-SAM via data engine that generates:
- Part-level pseudo-labels
- Object-level pseudo-labels
**Benchmarks**: Outperforms state-of-the-art on indoor (ScanNet) and outdoor (nuScenes, Waymo) datasets.
### SAMNet++ (2025) - Hybrid Pipeline
Two-stage approach:
1. SAM performs unsupervised segmentation
2. Adapted PointNet++ refines for semantic accuracy
**Best for**: UAV/drone workflows where colorized point clouds from L1 LiDAR + RGB cameras are available.
---
## Human-in-the-Loop Architecture
### The Model-in-the-Loop Paradigm (2023-2026)
**Old approach**: Human labels → Train model → Deploy
**New approach**: Model assists → Human validates → Rapid iteration
```
┌─────────────────────────────────────────────────────────┐
│ LABELING PIPELINE │
├─────────────────────────────────────────────────────────┤
│ Raw Data → AI Pre-label → Human Review → QA Check │
│ │ │ │ │ │
│ │ SAM4D/VLM Corrections Consensus │
│ │ generates only where sampling │
│ │ proposals AI uncertain │
└─────────────────────────────────────────────────────────┘
```
### Efficiency Gains
| Approach | Time for 10k frames | Annotation Quality |
|----------|--------------------|--------------------|
| Manual only | 400 hours | 95% (expert) |
| AI pre-label + review | 50 hours | 97% (AI+human) |
| SAM4D data engine | 4 hours | 92% (pseudo) |
**The 80/20 rule**: ~80% of ML project time is data prep. Model-in-the-loop cuts this dramatically.
### Quality Assurance Strategies
1. **Consensus sampling**: Multiple annotators on subset, measure agreement
2. **Active learning**: Route uncertain predictions to experts
3. **Tiered review**: Tier 1 (critical objects) get SME validation, Tier 2/3 use AI confidence thresholds
---
## Why Specialized Training > VLMs for 3D
### The Core Trade-off
| Aspect | Specialized (YOLO, PointPillars) | VLMs (GPT-4V, Gemini) |
|--------|----------------------------------|----------------------|
| **Latency** | 10-50ms (real-time) | 500-2000ms |
| **3D precision** | Strong geometric priors | Noisy text-3D alignment |
| **Novel objects** | Closed-set (what you train) | Open-vocabulary |
| **Compute** | Edge-deployable | GPU cluster required |
| **Hallucinations** | None (deterministic) | Yes (safety-critical risk) |
| **Domain shift** | Struggles (fog, night) | Better generalization |
### When to Use Each
**Use Specialized Models When:**
- Real-time inference required (autonomous vehicles, robotics)
- Known object classes (infrastructure defects, crop types)
- Safety-critical deployment (can't tolerate hallucinations)
- Edge deployment (drones, embedded systems)
**Use VLMs/Foundation Models When:**
- Zero-shot exploration of new domains
- Generating training data (weak labels)
- Open-vocabulary requirements ("find anything damaged")
- Domain adaptation bootstrapping
### The Hybrid Architecture (2025+ Best Practice)
```
┌───────────────────────┐
│ VLM (Slow Brain) │
│ • Scene understanding│
│ • Open vocabulary │
│ • Anomaly detection │
└──────────┬────────────┘
│ High-level context
▼
┌──────────────────────────────────────────────────────────┐
│ Specialized Detector (Fast Brain) │
│ • Real-time inference (YOLO, PointPillars, CenterPoint)│
│ • Known object detection & tracking │
│ • Safety-critical decisions │
└──────────────────────────────────────────────────────────┘
```
**Examples**:
- VOLTRON: YOLOv8 + LLaMA2 for hazard identification
- DrivePI: Point clouds + multi-view + language instructions (0.5B Qwen2.5)
---
## Vertical-Specific Training Architecture
### Infrastructure Inspection
**Objects**: Utility poles, insulators, conductors, vegetation, damage types
**Sensor fusion**: RGB + thermal + LiDAR
**Training data needs**:
- Thermal anomaly samples (varied temperatures)
- Damage taxonomy (cracks, corrosion, rust grades)
- Vegetation clearance measurements
**Architecture**:
```
LiDAR → Point cloud encoder → ┐
Thermal → 2D encoder → ├→ Fusion → Multi-task head
RGB → 2D encoder → ┘ ├→ Object detection
├→ Defect classification
└→ Clearance regression
```
### Autonomous Driving
**Objects**: Vehicles, pedestrians, cyclists, traffic signs, lane markings
**Key requirement**: Temporal consistency (track-IDs across frames)
**Training data needs**:
- Long-tail scenarios (emergency vehicles, animals, debris)
- Adverse weather (fog, rain, snow, night)
- Edge cases (construction zones, accidents)
**Architecture**: CenterPoint, PointPillars, or Voxel-based detectors with BEV (Bird's Eye View) representation.
### Agriculture/Wildfire
**Objects**: Crop rows, canopy height, fuel load, fire spread boundaries
**Sensor fusion**: RGB + multispectral + LiDAR
**Training data needs**:
- Crop growth stages
- Disease/pest visual signatures
- Fuel load density from LiDAR CHM (Canopy Height Model)
**Why not just VLM?** VLMs can't:
- Measure precise heights (LiDAR regression)
- Classify at hyperspectral wavelengths
- Maintain spatial precision for prescription maps
---
## Common Anti-Patterns
### Anti-Pattern: "Just Use SAM on Everything"
**Novice thinking**: "SAM segments anything, so I'll just run it on my LiDAR data"
**Reality**:
- SAM 1/2 are 2D models—they don't understand 3D geometry
- Point clouds need Point-SAM or SAM4D specifically
- Raw application produces noisy masks without geometric priors
**Correct approach**: Use Point-SAM for native 3D, or project to 2D for SAM → lift back to 3D.
### Anti-Pattern: Skipping Human Validation
**Novice thinking**: "AI pre-labels are 95% accurate, we can skip review"
**Reality**:
- 5% error on 100k objects = 5,000 wrong labels
- Errors compound in edge cases (exactly where you need accuracy)
- Model learns to reproduce annotation mistakes
**Correct approach**: Tier 1 (safety-critical) always human-validated. Use confidence thresholds for Tier 2/3.
### Anti-Pattern: VLM for Real-Time Inference
**Novice thinking**: "GPT-4V can identify damage in my photos"
**Reality**:
- 500-2000ms latency per frame
- Can't run on edge devices
- Hallucination risk in safety-critical contexts
**Correct approach**: Use VLM for data generation/exploration, specialized model for deployment.
### Anti-Pattern: Single-Modal Training
**Novice thinking**: "LiDAR is enough for 3D detection"
**Reality**:
- LiDAR: Precise geometry, no color/texture
- Camera: Rich semantics, no depth
- Fusion outperforms single-modal by 5-15% mAP
**Correct approach**: Sensor fusion from day one. SAM4D shows fusion pseudo-labels > single-modal.
---
## Decision Tree: Choosing Your Approach
```
Do you need real-time inference?
/ \
YES NO
| |
Use specialized Is this exploration?
detector (YOLO, / \
CenterPoint) YES NO
| | |
Have labeled data? Use VLM Generate
/ \ for zero- pseudo-labels
YES NO shot with SAM4D
| |
Train model Use SAM4D/
Point-SAM for
auto-labeling
```
---
## Tool Selection Decision Matrix
| Requirement | Recommended Tool |
|-------------|------------------|
| Autonomous driving at scale | Deepen AI or BasicAI |
| R&D/research flexibility | Supervisely or Segments.ai |
| Multi-modal (camera+LiDAR+radar) | Ango Hub or Dataloop |
| Self-hosted/open source | CVAT + 3D plugins or 3D BAT |
| Robotics perception | Segments.ai (2D+3D sync) |
| Budget-conscious | Label Studio + custom scripts |
---
## References
- `/references/sam4d-architecture.md` - Deep dive on SAM4D UMPE and data engine
- `/references/tool-comparison-matrix.md` - Detailed feature comparison of all tools
- `/references/hybrid-architecture-examples.md` - VOLTRON, DrivePI implementation patterns
- `/references/vertical-training-recipes.md` - Infrastructure, AV, agriculture specifics
---
## Sources
- [SAM4D: Segment Anything in Camera and LiDAR Streams](https://sam4d-project.github.io/) (ICCV 2025)
- [Point-SAM: Promptable 3D Segmentation Model](https://point-sam.github.io/) (ICLR 2025)
- [Segments.ai: 8 Best Point Cloud Labeling Tools](https://segments.ai/blog/the-8-best-point-cloud-labeling-tools/)
- [A Review of 3D Object Detection with Vision-Language Models](https://arxiv.org/html/2504.18738v1)
- [Vision-Language Models in Autonomous Driving Survey](https://arxiv.org/html/2310.14414v2)Related Skills
2026-legal-research-agent
Expert legal research agent for finding and scraping expungement data state by state. Knows authoritative sources, URL patterns, Firecrawl configuration, and 2026 legal landscape. Activate on "find expungement data", "scrape state laws", "legal research", "court URLs", "statute sources", "Clean Slate laws", "automatic expungement research". NOT for interpreting laws (use national-expungement-expert), building UI, or legal advice.
lets-go-rss
A lightweight, full-platform RSS subscription manager that aggregates content from YouTube, Vimeo, Behance, Twitter/X, and Chinese platforms like Bilibili, Weibo, and Douyin, featuring deduplication and AI smart classification.
ux
This AI agent skill provides comprehensive guidance for creating professional and insightful User Experience (UX) designs, covering user research, information architecture, interaction design, visual guidance, and usability evaluation. It aims to produce actionable, user-centered solutions that avoid generic AI aesthetics.
chrome-debug
This skill empowers AI agents to debug web applications and inspect browser behavior using the Chrome DevTools Protocol (CDP), offering both collaborative (headful) and automated (headless) modes.
vly-money
Generate crypto payment links for supported tokens and networks, manage access to X402 payment-protected content, and provide direct access to the vly.money wallet interface.
thor-skills
An entry point and router for AI agents to manage various THOR-related cybersecurity tasks, including running scans, analyzing logs, troubleshooting, and maintenance.
grail-miner
This skill assists in setting up, managing, and optimizing Grail miners on Bittensor Subnet 81, handling tasks like environment configuration, R2 storage, model checkpoint management, and performance tuning.
ontopo
An AI agent skill to search for Israeli restaurants, check table availability, view menus, and retrieve booking links via the Ontopo platform, acting as an unofficial interface to its data.
tech-blog
Generates comprehensive technical blog posts, offering detailed explanations of system internals, architecture, and implementation, either through source code analysis or document-driven research.
astro
This skill provides essential Astro framework patterns, focusing on server-side rendering (SSR), static site generation (SSG), middleware, and TypeScript best practices. It helps AI agents implement secure authentication, manage API routes, and debug rendering behaviors within Astro projects.
modal-deployment
Run Python code in the cloud with serverless containers, GPUs, and autoscaling using Modal. This skill enables agents to generate code for deploying ML models, running batch jobs, serving APIs, and scaling compute-intensive workloads.
whisper-transcribe
Transcribes audio and video files to text using OpenAI's Whisper CLI, enhanced with contextual grounding from local markdown files for improved accuracy.