home / skills / pluginagentmarketplace / custom-plugin-ai-data-scientist / computer-vision

computer-vision skill

safe

This skill helps you build, evaluate, and apply computer vision models for image classification, detection, and segmentation with practical examples.

npx playbooks add skill pluginagentmarketplace/custom-plugin-ai-data-scientist --skill computer-vision

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

5.3 KB

---
name: computer-vision
description: Image processing, object detection, segmentation, and vision models. Use for image classification, object detection, or visual analysis tasks.
sasmp_version: "1.3.0"
bonded_agent: 04-machine-learning-ai
bond_type: SECONDARY_BOND
---

# Computer Vision

Build models to analyze and understand visual data.

## Quick Start

### Image Classification
```python
import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image

# Load pre-trained model
model = models.resnet50(pretrained=True)
model.eval()

# Preprocess image
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

img = Image.open('image.jpg')
img_tensor = transform(img).unsqueeze(0)

# Predict
with torch.no_grad():
    output = model(img_tensor)
    probabilities = torch.nn.functional.softmax(output[0], dim=0)
    top5 = torch.topk(probabilities, 5)

print(top5)
```

### Custom CNN
```python
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x
```

## Data Augmentation

```python
from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(
        brightness=0.2,
        contrast=0.2,
        saturation=0.2,
        hue=0.1
    ),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])
```

## Object Detection with YOLO

```python
from ultralytics import YOLO

# Load model
model = YOLO('yolov8n.pt')

# Predict
results = model('image.jpg')

# Process results
for result in results:
    boxes = result.boxes
    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0]
        confidence = box.conf[0]
        class_id = box.cls[0]
        print(f"Class: {class_id}, Confidence: {confidence:.2f}")
        print(f"Box: ({x1}, {y1}, {x2}, {y2})")

# Save results
results[0].save('output.jpg')
```

## Image Segmentation

```python
# Semantic segmentation with DeepLab
model = torch.hub.load(
    'pytorch/vision:v0.10.0',
    'deeplabv3_resnet50',
    pretrained=True
)
model.eval()

# Preprocess
preprocess = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

input_tensor = preprocess(img).unsqueeze(0)

# Predict
with torch.no_grad():
    output = model(input_tensor)['out'][0]
    output_predictions = output.argmax(0)
```

## Transfer Learning

```python
from torchvision import models

# Load pre-trained ResNet
model = models.resnet50(pretrained=True)

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, num_classes)

# Train only final layer
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)
```

## Image Processing with OpenCV

```python
import cv2

# Read image
img = cv2.imread('image.jpg')

# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Edge detection
edges = cv2.Canny(gray, 100, 200)

# Blur
blurred = cv2.GaussianBlur(img, (5, 5), 0)

# Resize
resized = cv2.resize(img, (224, 224))

# Draw rectangle
cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)

# Save
cv2.imwrite('output.jpg', img)
```

## Face Detection

```python
# Haar Cascade
face_cascade = cv2.CascadeClassifier(
    cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
)

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray, 1.1, 4)

for (x, y, w, h) in faces:
    cv2.rectangle(img, (x, y), (x+w, y+h), (255, 0, 0), 2)
```

## Common Architectures

**Image Classification:**
- ResNet: Skip connections, deep networks
- EfficientNet: Compound scaling, efficient
- Vision Transformer (ViT): Attention-based

**Object Detection:**
- YOLO: Real-time, one-stage
- Faster R-CNN: Two-stage, accurate
- RetinaNet: Focal loss, handles class imbalance

**Segmentation:**
- U-Net: Encoder-decoder, medical imaging
- DeepLab: Atrous convolution, semantic segmentation
- Mask R-CNN: Instance segmentation

## Tips

1. Use pre-trained models for transfer learning
2. Apply data augmentation to prevent overfitting
3. Normalize images (ImageNet statistics)
4. Use appropriate loss functions (CrossEntropy, Focal Loss)
5. Monitor training with visualization
6. Test on diverse images

Overview

This skill provides practical tools and examples for image processing, object detection, segmentation, and building vision models. It consolidates common recipes for classification, custom CNNs, transfer learning, data augmentation, and OpenCV-based processing. Use it to prototype, train, and deploy computer vision pipelines with PyTorch, YOLO, DeepLab, and OpenCV.

How this skill works

The skill supplies ready-to-run code snippets and patterns that load pretrained models, preprocess images, and run inference for classification, detection, and segmentation. It demonstrates model customization: defining a simple CNN, freezing layers for transfer learning, and replacing classifier heads. It also covers data augmentation pipelines and standard image processing tasks with OpenCV, plus how to parse and save detection/segmentation outputs.

When to use it

Rapid prototyping of image classification or detection models
When you need transfer learning templates to adapt pretrained networks
Building preprocessing and augmentation pipelines for robust training
Implementing real-time or batch object detection with YOLO
Adding semantic or instance segmentation to visual analysis workflows

Best practices

Start from pretrained backbones (ImageNet weights) and fine-tune final layers
Use aggressive, realistic data augmentation to reduce overfitting
Normalize images with ImageNet mean/std for pretrained models
Monitor training with metrics and visualizations (loss, accuracy, sample predictions)
Choose loss functions that match the task (CrossEntropy for classification, Focal Loss for imbalance)

Example use cases

Image classification for product or defect sorting using ResNet or EfficientNet
Object detection in surveillance or retail with YOLO and bounding-box postprocessing
Semantic segmentation for medical or satellite imagery using DeepLab or U-Net
Face detection and basic tracking using OpenCV Haar cascades or DNN detectors
Transfer learning to adapt a pretrained ResNet for a small custom dataset

FAQ

Can I use these examples on GPU?

Yes. The PyTorch and YOLO examples support CUDA—move models and tensors to device('cuda') and ensure drivers and CUDA toolkit are installed.

Which architecture is best for accuracy vs. speed?

YOLO (one-stage) favors real-time speed; Faster R-CNN and Mask R-CNN prioritize accuracy for detection/instance segmentation; EfficientNet and ResNet variants balance accuracy and efficiency for classification.