home / skills / doanchienthangdev / omgkit / dnn-architectures

This skill helps you compare and implement modern deep neural network architectures for vision, NLP, and multimodal tasks with practical examples.

npx playbooks add skill doanchienthangdev/omgkit --skill dnn-architectures

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.9 KB
---
name: dnn-architectures
description: Deep neural network architectures including CNNs, RNNs, Transformers, and modern architectures for vision, NLP, and multimodal tasks.
---

# DNN Architectures

Modern deep neural network architectures.

## Convolutional Neural Networks

```python
import torch.nn as nn

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1)
        )
        self.classifier = nn.Linear(256, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)
```

## Transformer Architecture

```python
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual
        attn_out, _ = self.attn(x, x, x, attn_mask=mask)
        x = self.ln1(x + self.dropout(attn_out))
        # Feedforward with residual
        ff_out = self.ff(x)
        x = self.ln2(x + self.dropout(ff_out))
        return x
```

## Vision Transformer (ViT)

```python
class ViT(nn.Module):
    def __init__(self, image_size, patch_size, num_classes, d_model, n_heads, n_layers):
        super().__init__()
        num_patches = (image_size // patch_size) ** 2
        self.patch_embed = nn.Conv2d(3, d_model, patch_size, patch_size)
        self.cls_token = nn.Parameter(torch.zeros(1, 1, d_model))
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, d_model))
        self.transformer = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_model * 4)
            for _ in range(n_layers)
        ])
        self.head = nn.Linear(d_model, num_classes)

    def forward(self, x):
        patches = self.patch_embed(x).flatten(2).transpose(1, 2)
        cls_tokens = self.cls_token.expand(x.size(0), -1, -1)
        x = torch.cat([cls_tokens, patches], dim=1)
        x = x + self.pos_embed
        for block in self.transformer:
            x = block(x)
        return self.head(x[:, 0])
```

## Architecture Comparison

| Architecture | Best For | Params | Inference |
|--------------|----------|--------|-----------|
| ResNet | Image classification | 25M | Fast |
| EfficientNet | Efficient vision | 5-66M | Efficient |
| ViT | Vision + scale | 86-632M | GPU optimized |
| BERT | NLP understanding | 110-340M | Moderate |
| GPT | Text generation | 117M-175B | Heavy |
| T5 | Seq2seq tasks | 60M-11B | Heavy |

## Modern Architectures

```python
# Using pretrained models
from transformers import AutoModel

# Vision
vit = AutoModel.from_pretrained("google/vit-base-patch16-224")
clip = AutoModel.from_pretrained("openai/clip-vit-base-patch32")

# NLP
bert = AutoModel.from_pretrained("bert-base-uncased")
llama = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf")

# Multimodal
blip = AutoModel.from_pretrained("Salesforce/blip-image-captioning-base")
```

## Best Practices

1. Use pretrained models when possible
2. Match architecture to task
3. Consider compute budget
4. Scale model size with data size
5. Monitor memory usage

Overview

This skill catalogs modern deep neural network architectures for vision, NLP, and multimodal tasks, including CNNs, RNNs, Transformers, and Vision Transformers. It summarizes design patterns, trade-offs, and common pretrained models to help you pick and prototype architectures quickly. The content focuses on practical choices for training, inference, and scaling.

How this skill works

The skill explains core building blocks (convolutions, attention, feed‑forward blocks, patch embedding) and shows compact reference implementations for CNNs, Transformer blocks, and ViT-style models. It compares architectures by typical use cases, parameter ranges, and inference characteristics, and links to widely used pretrained models for quick experimentation. Best practices highlight when to reuse pretrained weights, how to match model size to data, and how to manage compute and memory.

When to use it

  • Image classification or feature extraction — use CNNs or ViT depending on scale and compute.
  • Language understanding or sentence encoding — use Transformer encoders like BERT.
  • Text generation or autoregressive tasks — choose decoder or encoder–decoder transformers (GPT family, T5).
  • Multimodal tasks (image captioning, retrieval) — use pretrained multimodal models like CLIP or BLIP.
  • Prototype quickly with pretrained weights, then scale or adapt architecture to dataset size.

Best practices

  • Start from pretrained checkpoints for faster convergence and improved accuracy.
  • Match model capacity to dataset size to avoid underfitting or overfitting.
  • Account for inference constraints: smaller, efficient models for CPU/edge; larger models for GPU/TPU.
  • Use LayerNorm, residual connections, and dropout in Transformers for stable training.
  • Monitor memory and batch size trade-offs; use gradient accumulation or mixed precision when needed.

Example use cases

  • Train a compact CNN for on‑device image classification with limited compute.
  • Fine‑tune a ViT or EfficientNet on a medium‑sized vision dataset for improved accuracy.
  • Fine‑tune BERT for sentence classification or NER tasks in NLP pipelines.
  • Use a pretrained CLIP or BLIP model for multimodal retrieval or captioning with minimal training.
  • Prototype text generation or summarization using GPT or T5 variants, scaling model size as budget allows.

FAQ

Should I always use pretrained models?

Use pretrained models when available; they significantly speed up convergence and improve performance, especially with limited labeled data.

When should I prefer ViT over CNNs?

Prefer ViT when you have large-scale data and GPU resources; lightweight CNNs or hybrid architectures are better for small datasets or CPU/edge inference.