home / skills / amnadtaowsoam / cerebraskills / architecture-review

architecture-review skill

/59-architecture-decision/architecture-review

This skill helps teams validate architectural decisions early by facilitating structured reviews, RFCs, and ARB processes before implementation.

npx playbooks add skill amnadtaowsoam/cerebraskills --skill architecture-review

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
10.3 KB
---
name: Architecture Review
description: Structured process for reviewing and validating architectural decisions before implementation to catch issues early.
---

# Architecture Review

## Overview

Architecture Review is a formal process where proposed architectural changes are evaluated by peers and stakeholders before implementation. It catches design flaws early when they're cheap to fix.

**Core Principle**: "Review architecture before writing code. Catching design flaws early saves months of rework."

---

## 1. When to Conduct Architecture Review

### Required Reviews
- New system or major component
- Significant technology change
- Cross-team integration
- Security-sensitive features
- Performance-critical paths
- Data model changes
- API design

### Optional Reviews
- Minor feature additions
- Bug fixes
- Refactoring (unless major)

---

## 2. Architecture Review Board (ARB)

### Composition
```markdown
## Architecture Review Board

**Core Members** (always present):
- Principal Engineer (chair)
- Tech Lead from affected team
- Security Engineer
- Platform Engineer

**Optional Members** (as needed):
- Product Manager (for business context)
- DBA (for database changes)
- DevOps Engineer (for infrastructure)
- External Architect (for major decisions)

**Quorum**: Minimum 3 core members
```

---

## 3. Architecture Review Process

```mermaid
graph TD
    A[Submit RFC] --> B[Initial Review]
    B --> C{Complete?}
    C -->|No| D[Request Changes]
    D --> A
    C -->|Yes| E[Schedule Review]
    E --> F[Review Meeting]
    F --> G{Approved?}
    G -->|Yes| H[Implement]
    G -->|No| I[Revise]
    I --> A
    G -->|Conditional| J[Address Concerns]
    J --> F
```

---

## 4. Request for Comments (RFC) Template

```markdown
# RFC-042: Migrate to Event-Driven Architecture

## Metadata
- **Author**: @alice
- **Status**: Under Review
- **Created**: 2024-01-15
- **Review Date**: 2024-01-22

## Summary
Migrate order processing from synchronous API calls to event-driven architecture using Kafka.

## Problem Statement
Current synchronous architecture causes:
- Tight coupling between services
- Cascading failures when services are down
- Difficulty scaling individual components
- Long request timeouts (>5s)

## Proposed Solution

### Architecture Diagram
```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Orders    │────▶│    Kafka    │────▶│  Inventory  │
│   Service   │     │   (Events)  │     │   Service   │
└─────────────┘     └─────────────┘     └─────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │  Shipping   │
                    │   Service   │
                    └─────────────┘
```

### Key Components
1. **Event Bus**: Kafka cluster (3 brokers)
2. **Event Schema**: Avro schemas in Schema Registry
3. **Event Types**: 
   - `order.created`
   - `order.paid`
   - `order.shipped`

### Implementation Plan
- **Phase 1** (2 weeks): Setup Kafka cluster
- **Phase 2** (3 weeks): Migrate order creation
- **Phase 3** (2 weeks): Migrate payment processing
- **Phase 4** (1 week): Decommission old sync APIs

## Alternatives Considered

### Alternative 1: Keep Synchronous
**Pros**: Simpler, no new infrastructure
**Cons**: Doesn't solve coupling/scaling issues
**Why rejected**: Doesn't address core problems

### Alternative 2: Use RabbitMQ
**Pros**: Simpler than Kafka
**Cons**: Lower throughput, less ecosystem
**Why rejected**: Need Kafka's throughput for future scale

## Trade-offs

### Pros
- ✅ Loose coupling between services
- ✅ Better fault tolerance
- ✅ Easier to scale
- ✅ Event sourcing enables audit trail

### Cons
- ❌ Increased complexity (new infrastructure)
- ❌ Eventual consistency (not immediate)
- ❌ Debugging distributed events harder
- ❌ Team needs to learn Kafka

## Risks & Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Kafka cluster failure | Low | High | Multi-AZ deployment, monitoring |
| Event ordering issues | Medium | Medium | Use partition keys correctly |
| Schema evolution breaks consumers | Medium | High | Use Schema Registry, versioning |
| Team learning curve | High | Medium | Training, documentation, pair programming |

## Performance Impact
- **Current**: 200ms average latency (synchronous)
- **Expected**: 50ms publish + eventual processing
- **Throughput**: 10K events/sec (vs 1K requests/sec currently)

## Security Considerations
- Kafka ACLs for topic access control
- TLS encryption for data in transit
- Event payload encryption for sensitive data

## Operational Impact
- **New Infrastructure**: Kafka cluster (3 brokers, 3 Zookeeper)
- **Monitoring**: Kafka metrics, consumer lag
- **Cost**: ~$500/month for managed Kafka

## Success Metrics
- Order processing latency < 100ms (p95)
- Zero data loss
- 99.9% event delivery success
- Service independence (can deploy without coordination)

## Open Questions
1. How to handle failed event processing? (DLQ strategy)
2. Should we use exactly-once semantics? (adds complexity)
3. Event retention period? (Propose 7 days)

## Appendix
- [Kafka Setup Guide](link)
- [Event Schema Definitions](link)
- [POC Results](link)
```

---

## 5. Review Meeting Agenda

```markdown
## Architecture Review Meeting Agenda (60 min)

**Attendees**: ARB members + RFC author

### 1. Introduction (5 min)
- Author presents problem and solution summary
- Q&A on context

### 2. Deep Dive (25 min)
- Architecture walkthrough
- Trade-offs discussion
- Alternatives review

### 3. Concerns & Questions (20 min)
- Security review
- Performance review
- Operational review
- Cost review

### 4. Decision (10 min)
- Vote: Approve / Conditional Approve / Reject
- Document action items
- Set follow-up date (if conditional)

### Decision Criteria
- ✅ Solves the stated problem
- ✅ Trade-offs acceptable
- ✅ Risks identified and mitigated
- ✅ Implementation plan realistic
- ✅ Team has capacity
```

---

## 6. Review Checklist

```markdown
## Architecture Review Checklist

### Functional Requirements
- [ ] Solves the stated problem
- [ ] Meets performance requirements
- [ ] Handles edge cases
- [ ] Scalable to expected load

### Non-Functional Requirements
- [ ] Security reviewed
- [ ] Monitoring plan defined
- [ ] Disaster recovery plan
- [ ] Cost analyzed

### Design Quality
- [ ] Follows company standards
- [ ] Appropriate complexity (not over-engineered)
- [ ] Clear interfaces/contracts
- [ ] Testable design

### Implementation
- [ ] Realistic timeline
- [ ] Team has necessary skills
- [ ] Dependencies identified
- [ ] Rollback plan exists

### Documentation
- [ ] Architecture diagrams clear
- [ ] Trade-offs documented
- [ ] Risks identified
- [ ] ADR will be created
```

---

## 7. Review Outcomes

### Approved
```markdown
**Status**: ✅ Approved

**Decision**: Proceed with implementation as proposed.

**Action Items**: None

**Next Steps**:
1. Create ADR-042
2. Begin Phase 1 implementation
3. Update in 2 weeks
```

### Conditional Approval
```markdown
**Status**: ⚠️ Conditionally Approved

**Decision**: Approved with following conditions:

**Required Changes**:
1. Add DLQ (Dead Letter Queue) strategy
2. Define event retention policy
3. Create runbook for Kafka failures

**Action Items**:
- @alice: Update RFC with DLQ strategy by Jan 20
- @bob: Review security implications of event encryption

**Next Review**: Jan 25 (conditional approval review)
```

### Rejected
```markdown
**Status**: ❌ Rejected

**Reason**: Complexity doesn't justify benefits for current scale.

**Recommendation**: 
Start with async job queue (Sidekiq/Bull) instead of full Kafka.
Revisit when we reach 10K orders/day.

**Next Steps**:
- Author to create new RFC for job queue approach
```

---

## 8. Lightweight Review (for smaller changes)

```markdown
## Lightweight Architecture Review

**For**: Minor architectural changes

**Process**:
1. Author creates brief design doc (1-2 pages)
2. Share in #architecture Slack channel
3. Async review (48 hours)
4. If no objections, approved
5. If concerns, schedule meeting

**Example**:
> **Proposal**: Add Redis cache for product catalog
> **Rationale**: Reduce DB load, improve latency
> **Impact**: Low (isolated change)
> **Cost**: $50/month
> 
> Any concerns? Will implement Friday if no objections.
```

---

## 9. Post-Review Follow-up

```markdown
## Post-Review Actions

### Immediately After Review
- [ ] Update RFC with decision
- [ ] Create ADR if approved
- [ ] Schedule follow-up (if conditional)
- [ ] Communicate decision to stakeholders

### During Implementation
- [ ] Weekly updates to ARB
- [ ] Flag deviations from approved design
- [ ] Document lessons learned

### After Implementation
- [ ] Retrospective on architecture decision
- [ ] Update RFC with actual vs planned
- [ ] Share learnings with team
```

---

## 10. Architecture Review Metrics

```typescript
interface ReviewMetrics {
  totalReviews: number;
  approvalRate: number;
  avgReviewTime: number;  // days from submission to decision
  conditionalApprovals: number;
  rejections: number;
  majorIssuesFound: number;  // issues caught before implementation
}

// Track effectiveness
function calculateReviewROI() {
  const issuesCaughtInReview = 15;
  const avgCostToFixInProduction = 40;  // hours
  const avgCostOfReview = 4;  // hours
  
  const saved = issuesCaughtInReview * avgCostToFixInProduction;
  const spent = totalReviews * avgCostOfReview;
  
  return {
    hoursSaved: saved - spent,
    roi: (saved / spent) * 100  // percentage
  };
}
```

---

## 11. Architecture Review Checklist

- [ ] **Process Defined**: Review process documented?
- [ ] **ARB Established**: Review board members identified?
- [ ] **RFC Template**: Template available for proposals?
- [ ] **Review Criteria**: Clear approval criteria?
- [ ] **Meeting Cadence**: Regular review meetings scheduled?
- [ ] **Lightweight Path**: Fast track for minor changes?
- [ ] **Follow-up Process**: Post-review tracking in place?
- [ ] **Metrics**: Tracking review effectiveness?

---

## Related Skills
* `59-architecture-decision/adr-templates`
* `59-architecture-decision/tech-stack-selection`
* `59-architecture-decision/tradeoff-analysis`
* `00-meta-skills/decision-making`

Overview

This skill defines a structured process for reviewing and validating architectural decisions before implementation to catch issues early. It formalizes who should participate, what to include in a Request for Comments (RFC), meeting agendas, and decision outcomes. The goal is to reduce rework, technical debt, and operational surprises by surfacing risks and trade-offs up front.

How this skill works

Authors submit an RFC that outlines problem, proposed design, alternatives, risks, and implementation plan. An Architecture Review Board (core and optional members) conducts an initial check, schedules a review meeting if complete, and votes to Approve, Conditionally Approve, or Reject. Post-review actions include creating an ADR, tracking action items, and monitoring implementation against the approved design.

When to use it

  • Designing a new system or major component
  • Significant technology or platform changes
  • Cross-team integrations or public APIs
  • Security-sensitive or performance-critical features
  • Data model or schema changes affecting many services

Best practices

  • Use a concise RFC template covering summary, diagram, trade-offs, risks, and implementation plan
  • Include core ARB members: principal engineer (chair), tech lead, security and platform engineers
  • Run a lightweight async review path for small, low-risk changes to avoid overhead
  • Record explicit decision criteria and action items; create an ADR for approved designs
  • Track review metrics (approval rate, avg review time, issues caught) to improve the process

Example use cases

  • Migrating synchronous order processing to an event-driven architecture (Kafka) with trade-offs and mitigation plans
  • Adding a global cache layer for product catalog after lightweight review and cost analysis
  • Replacing a database engine or changing schema that affects multiple services
  • Introducing a new authentication mechanism with security and operational runbooks
  • Designing an autoscaling strategy for a performance-critical API

FAQ

Who must attend an architecture review?

Core attendance includes the principal engineer (chair), affected team tech lead, security engineer, and platform engineer; optional members join as needed.

What if a proposal is time-sensitive?

Use the lightweight review path for low-risk changes or request an expedited meeting; document conditional approvals and required follow-ups.