home / skills / omer-metin / skills-for-antigravity / transformer-architecture

transformer-architecture skill

/skills/transformer-architecture

This skill helps you implement attention mechanisms and optimize transformer architectures with best practices for self-attention, positional encoding, and

npx playbooks add skill omer-metin/skills-for-antigravity --skill transformer-architecture

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
1.1 KB
---
name: transformer-architecture
description: Use when implementing attention mechanisms, building custom transformer models, understanding positional encoding, or optimizing transformer inference - covers self-attention, multi-head attention, RoPE, ALiBi, and architecture variantsUse when ", " mentioned. 
---

# Transformer Architecture

## Identity



## Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.

**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Overview

This skill helps engineers design, implement, and optimize transformer architectures for training and inference. It focuses on attention mechanisms, positional encodings, and practical variants such as RoPE and ALiBi. The guidance is grounded in the provided reference patterns, edge-risk diagnostics, and strict validation rules.

How this skill works

The skill inspects model design choices and flags violations of established patterns and validations. It diagnoses common failure modes for self-attention, multi-head attention, positional encodings, and inference optimizations, then suggests concrete fixes. Recommendations prioritize compatibility with the validated constraints and highlight risks that cause model instability or inefficient inference.

When to use it

  • Designing or implementing self-attention and multi-head attention modules
  • Choosing or implementing positional encodings (absolute, RoPE, ALiBi)
  • Building custom transformer variants or mixing attention patterns
  • Optimizing inference speed, memory, or parallelism
  • Validating model configs against strict architecture constraints
  • Troubleshooting training instabilities or degraded attention behavior

Best practices

  • Follow the provided reference patterns for layer ordering, normalization, and initialization to avoid subtle bugs
  • Prefer validated positional encodings for your use case; test RoPE and ALiBi behavior on long-context tasks before production
  • Use attention masking and stable softmax numerics to prevent overflow and gradient issues
  • Profile memory and FLOPs for multi-head attention; reduce head dimension before reducing head count to preserve expressivity
  • Apply inference-specific optimizations only after passing training validations to avoid breaking guarantees

Example use cases

  • Implementing a custom multi-head attention block with residual and pre/post-layernorm variants
  • Replacing absolute positional embeddings with RoPE and validating long-context stability
  • Adding ALiBi to improve extrapolation on longer sequences and confirming it meets validation rules
  • Diagnosing attention collapse where softmax outputs concentrate and suggesting numerical and architectural remedies
  • Tuning transformer layer widths, head counts, and initialization to meet memory and performance targets

FAQ

How do I choose between RoPE and ALiBi?

RoPE preserves relative-phase information useful for autoregressive models; ALiBi biases attention by distance and can improve extrapolation. Evaluate both on representative long-context tasks and follow validation checks for stability.

What causes attention collapse and how do I fix it?

Common causes are poor initialization, extreme head dimension ratios, or unstable softmax numerics. Apply validated initialization, check head/dim ratios, add numerical stabilization to softmax, and validate with the provided edge diagnostics.