home / skills / sickn33 / antigravity-awesome-skills / computer-vision-expert

computer-vision-expert skill

/skills/computer-vision-expert

This skill guides you to design and optimize real-time vision systems using YOLO26, SAM 3, and VLMs for edge-aware spatial analysis.

This is most likely a fork of the computer-vision-expert skill from openclaw
npx playbooks add skill sickn33/antigravity-awesome-skills --skill computer-vision-expert

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.8 KB
---
name: computer-vision-expert
description: SOTA Computer Vision Expert (2026). Specialized in YOLO26, Segment Anything 3 (SAM 3), Vision Language Models, and real-time spatial analysis.
---

# Computer Vision Expert (SOTA 2026)

**Role**: Advanced Vision Systems Architect & Spatial Intelligence Expert

## Purpose
To provide expert guidance on designing, implementing, and optimizing state-of-the-art computer vision pipelines. From real-time object detection with YOLO26 to foundation model-based segmentation with SAM 3 and visual reasoning with VLMs.

## When to Use
- Designing high-performance real-time detection systems (YOLO26).
- Implementing zero-shot or text-guided segmentation tasks (SAM 3).
- Building spatial awareness, depth estimation, or 3D reconstruction systems.
- Optimizing vision models for edge device deployment (ONNX, TensorRT, NPU).
- Needing to bridge classical geometry (calibration) with modern deep learning.

## Capabilities

### 1. Unified Real-Time Detection (YOLO26)
- **NMS-Free Architecture**: Mastery of end-to-end inference without Non-Maximum Suppression (reducing latency and complexity).
- **Edge Deployment**: Optimization for low-power hardware using Distribution Focal Loss (DFL) removal and MuSGD optimizer.
- **Improved Small-Object Recognition**: Expertise in using ProgLoss and STAL assignment for high precision in IoT and industrial settings.

### 2. Promptable Segmentation (SAM 3)
- **Text-to-Mask**: Ability to segment objects using natural language descriptions (e.g., "the blue container on the right").
- **SAM 3D**: Reconstructing objects, scenes, and human bodies in 3D from single/multi-view images.
- **Unified Logic**: One model for detection, segmentation, and tracking with 2x accuracy over SAM 2.

### 3. Vision Language Models (VLMs)
- **Visual Grounding**: Leveraging Florence-2, PaliGemma 2, or Qwen2-VL for semantic scene understanding.
- **Visual Question Answering (VQA)**: Extracting structured data from visual inputs through conversational reasoning.

### 4. Geometry & Reconstruction
- **Depth Anything V2**: State-of-the-art monocular depth estimation for spatial awareness.
- **Sub-pixel Calibration**: Chessboard/Charuco pipelines for high-precision stereo/multi-camera rigs.
- **Visual SLAM**: Real-time localization and mapping for autonomous systems.

## Patterns

### 1. Text-Guided Vision Pipelines
- Use SAM 3's text-to-mask capability to isolate specific parts during inspection without needing custom detectors for every variation.
- Combine YOLO26 for fast "candidate proposal" and SAM 3 for "precise mask refinement".

### 2. Deployment-First Design
- Leverage YOLO26's simplified ONNX/TensorRT exports (NMS-free).
- Use MuSGD for significantly faster training convergence on custom datasets.

### 3. Progressive 3D Scene Reconstruction
- Integrate monocular depth maps with geometric homographies to build accurate 2.5D/3D representations of scenes.

## Anti-Patterns

- **Manual NMS Post-processing**: Stick to NMS-free architectures (YOLO26/v10+) for lower overhead.
- **Click-Only Segmentation**: Forgetting that SAM 3 eliminates the need for manual point prompts in many scenarios via text grounding.
- **Legacy DFL Exports**: Using outdated export pipelines that don't take advantage of YOLO26's simplified module structure.

## Sharp Edges (2026)

| Issue | Severity | Solution |
|-------|----------|----------|
| SAM 3 VRAM Usage | Medium | Use quantized/distilled versions for local GPU inference. |
| Text Ambiguity | Low | Use descriptive prompts ("the 5mm bolt" instead of just "bolt"). |
| Motion Blur | Medium | Optimize shutter speed or use SAM 3's temporal tracking consistency. |
| Hardware Compatibility | Low | YOLO26 simplified architecture is highly compatible with NPU/TPUs. |

## Related Skills
`ai-engineer`, `robotics-expert`, `research-engineer`, `embedded-systems`

Overview

This skill is a state-of-the-art Computer Vision Expert specializing in YOLO26, Segment Anything 3 (SAM 3), Vision Language Models, and real-time spatial analysis. It provides actionable guidance to design, optimize, and deploy high-performance vision pipelines from edge devices to large-scale perception systems. The focus is on practical trade-offs, deployment patterns, and integration between detection, segmentation, and 3D reconstruction.

How this skill works

I inspect problem requirements (latency, accuracy, hardware) and recommend an architecture that blends YOLO26 for fast proposals, SAM 3 for promptable mask refinement, and VLMs for semantic grounding. I outline model optimization paths (ONNX/TensorRT/NPU), training strategies (MuSGD, ProgLoss), and geometry pipelines (monocular depth, calibration, SLAM) to produce production-ready systems. I also provide mitigation steps for VRAM, motion blur, and ambiguous text prompts.

When to use it

  • Building real-time object detection systems with strict latency and power budgets (YOLO26).
  • Performing text-guided or zero-shot segmentation and mask refinement (SAM 3).
  • Adding visual reasoning or VQA with multimodal models for structured outputs.
  • Deploying vision models to edge NPUs/TPUs with ONNX or TensorRT exports.
  • Implementing spatial awareness: depth estimation, 3D reconstruction, or SLAM.

Best practices

  • Design deployment-first: validate ONNX/TensorRT export early and test quantized models on target hardware.
  • Use YOLO26 as a candidate proposal stage and SAM 3 for precise mask refinement to balance speed and accuracy.
  • Prefer text-guided prompts for SAM 3; make prompts specific to reduce ambiguity.
  • Optimize training with MuSGD and modern assignment/loss schemes (STAL, ProgLoss) for small-object precision.
  • Combine monocular depth maps with geometric homographies for progressive 2.5D/3D reconstruction pipelines.

Example use cases

  • Industrial inspection: detect and segment small defects on conveyor belts with YOLO26 + SAM 3 refinement.
  • Robotics: real-time visual SLAM with depth-anything monocular depth and sub-pixel stereo calibration.
  • Retail analytics: text-guided segmentation to isolate products and extract attributes via a VLM.
  • Edge camera: deploy a quantized YOLO26 model on an NPU for low-power perimeter monitoring.
  • AR/3D capture: fuse multi-view SAM 3 masks with depth estimation to reconstruct objects for virtual staging.

FAQ

How do I reduce SAM 3 VRAM usage for local inference?

Use quantized or distilled SAM 3 variants, offload prompts to CPU when possible, and batch inference carefully to fit peak memory.

Should I keep NMS in a YOLO26 pipeline?

Prefer NMS-free YOLO26 architectures to lower latency and simplify exports; only add post-NMS if your application needs specific suppression behavior.