home / skills / 404kidwiz / claude-supercode-skills / llm-architect-skill

llm-architect-skill skill

/llm-architect-skill

This skill designs scalable LLM architectures, selects models, and optimizes deployment and safety for production-grade, cost-efficient AI systems.

npx playbooks add skill 404kidwiz/claude-supercode-skills --skill llm-architect-skill

Review the files below or copy the command above to add this skill to your agents.

Files (12)
SKILL.md
6.8 KB
---
name: llm-architect
description: Use when user needs LLM system architecture, model deployment, optimization strategies, and production serving infrastructure. Designs scalable large language model applications with focus on performance, cost efficiency, and safety.
---

# LLM Architect

## Purpose

Provides expert large language model system architecture for designing, deploying, and optimizing LLM applications at scale. Specializes in model selection, RAG (Retrieval Augmented Generation) pipelines, fine-tuning strategies, serving infrastructure, cost optimization, and safety guardrails for production LLM systems.

## When to Use

- Designing end-to-end LLM systems from requirements to production
- Selecting models and serving infrastructure for specific use cases
- Implementing RAG (Retrieval Augmented Generation) pipelines
- Optimizing LLM costs while maintaining quality thresholds
- Building safety guardrails and compliance mechanisms
- Planning fine-tuning vs RAG vs prompt engineering strategies
- Scaling LLM inference for high-throughput applications

## Quick Start

**Invoke this skill when:**
- Designing end-to-end LLM systems from requirements to production
- Selecting models and serving infrastructure for specific use cases
- Implementing RAG (Retrieval Augmented Generation) pipelines
- Optimizing LLM costs while maintaining quality thresholds
- Building safety guardrails and compliance mechanisms

**Do NOT invoke when:**
- Simple API integration exists (use backend-developer instead)
- Only prompt engineering needed without architecture decisions
- Training foundation models from scratch (almost always wrong approach)
- Generic ML tasks unrelated to language models (use ml-engineer)

## Decision Framework

### Model Selection Quick Guide

| Requirement | Recommended Approach |
|-------------|---------------------|
| Latency <100ms | Small fine-tuned model (7B quantized) |
| Latency <2s, budget unlimited | Claude 3 Opus / GPT-4 |
| Latency <2s, domain-specific | Claude 3 Sonnet fine-tuned |
| Latency <2s, cost-sensitive | Claude 3 Haiku |
| Batch/async acceptable | Batch API, cheapest tier |

### RAG vs Fine-Tuning Decision Tree

```
Need to customize LLM behavior?
│
├─ Need domain-specific knowledge?
│  ├─ Knowledge changes frequently?
│  │  └─ RAG (Retrieval Augmented Generation)
│  └─ Knowledge is static?
│     └─ Fine-tuning OR RAG (test both)
│
├─ Need specific output format/style?
│  ├─ Can describe in prompt?
│  │  └─ Prompt engineering (try first)
│  └─ Format too complex for prompt?
│     └─ Fine-tuning
│
└─ Need latency <100ms?
   └─ Fine-tuned small model (7B-13B)
```

### Architecture Pattern

```
[Client] → [API Gateway + Rate Limiting]
              ↓
         [Request Router]
          (Route by intent/complexity)
              ↓
    ┌────────┴────────┐
    ↓                 ↓
[Fast Model]    [Powerful Model]
(Haiku/Small)   (Sonnet/Large)
    ↓                 ↓
[Cache Layer] ← [Response Aggregator]
    ↓
[Logging & Monitoring]
    ↓
[Response to Client]
```

## Core Workflow: Design LLM System

### 1. Requirements Gathering

Ask these questions:
- **Latency**: What's the P95 response time requirement?
- **Scale**: Expected requests/day and growth trajectory?
- **Accuracy**: What's the minimum acceptable quality? (measurable metric)
- **Cost**: Budget constraints? ($/request or $/month)
- **Data**: Existing datasets for evaluation? Sensitivity level?
- **Compliance**: Regulatory requirements? (HIPAA, GDPR, SOC2, etc.)

### 2. Model Selection

```python
def select_model(requirements):
    if requirements.latency_p95 < 100:  # milliseconds
        if requirements.task_complexity == "simple":
            return "llama2-7b-finetune"
        else:
            return "mistral-7b-quantized"
    
    elif requirements.latency_p95 < 2000:
        if requirements.budget == "unlimited":
            return "claude-3-opus"
        elif requirements.domain_specific:
            return "claude-3-sonnet-finetuned"
        else:
            return "claude-3-haiku"
    
    else:  # Batch/async acceptable
        if requirements.accuracy_critical:
            return "gpt-4-with-ensemble"
        else:
            return "batch-api-cheapest-tier"
```

### 3. Prototype & Evaluate

```bash
# Run benchmark on eval dataset
python scripts/evaluate_model.py \
  --model claude-3-sonnet \
  --dataset data/eval_1000_examples.jsonl \
  --metrics accuracy,latency,cost

# Expected output:
# Accuracy: 94.3%
# P95 Latency: 1,245ms
# Cost per 1K requests: $2.15
```

### 4. Iteration Checklist

- [ ] Latency P95 meets requirement? If no → optimize serving (quantization, caching)
- [ ] Accuracy meets threshold? If no → improve prompts, fine-tune, or upgrade model
- [ ] Cost within budget? If no → aggressive caching, smaller model routing, batching
- [ ] Safety guardrails tested? If no → add content filters, PII detection
- [ ] Monitoring dashboards live? If no → set up Prometheus + Grafana
- [ ] Runbook documented? If no → document common failures and fixes

## Cost Optimization Strategies

| Strategy | Savings | When to Use |
|----------|---------|-------------|
| Semantic caching | 40-80% | 60%+ similar queries |
| Multi-model routing | 30-50% | Mixed complexity queries |
| Prompt compression | 10-20% | Long context inputs |
| Batching | 20-40% | Async-tolerant workloads |
| Smaller model cascade | 40-60% | Simple queries first |

## Safety Checklist

- [ ] Content filtering tested against adversarial examples
- [ ] PII detection and redaction validated
- [ ] Prompt injection defenses in place
- [ ] Output validation rules implemented
- [ ] Audit logging configured for all requests
- [ ] Compliance requirements documented and validated

## Red Flags - When to Escalate

| Observation | Action |
|-------------|--------|
| Accuracy <80% after prompt iteration | Consider fine-tuning |
| Latency 2x requirement | Review infrastructure |
| Cost >2x budget | Aggressive caching/routing |
| Hallucination rate >5% | Add RAG or stronger guardrails |
| Safety bypass detected | Immediate security review |

## Quick Reference: Performance Targets

| Metric | Target | Critical |
|--------|--------|----------|
| P95 Latency | <2x requirement | <3x requirement |
| Accuracy | >90% | >80% |
| Cache Hit Rate | >60% | >40% |
| Error Rate | <1% | <5% |
| Cost/1K requests | Within budget | <150% budget |

## Additional Resources

- **Detailed Technical Reference**: See [REFERENCE.md](REFERENCE.md)
  - RAG implementation workflow
  - Semantic caching patterns
  - Deployment configurations
  
- **Code Examples & Patterns**: See [EXAMPLES.md](EXAMPLES.md)
  - Anti-patterns (fine-tuning when prompting suffices, no fallback)
  - Quality checklist for LLM systems
  - Resilient LLM call patterns

Overview

This skill provides expert guidance for designing, deploying, and optimizing large language model systems in production. It focuses on model selection, RAG pipelines, serving infrastructure, cost optimization, and safety guardrails. Use it to turn requirements into scalable architectures that balance performance, cost, and compliance.

How this skill works

I inspect system requirements (latency, scale, accuracy, cost, data sensitivity) and map them to architectural patterns, model choices, and deployment strategies. I evaluate trade-offs between RAG, fine-tuning, and prompt engineering, then recommend routing, caching, and monitoring approaches. I also provide checklists for prototyping, benchmarking, cost reduction, and safety compliance.

When to use it

  • Designing an end-to-end LLM system from requirements to production
  • Choosing models and serving infrastructure for specific latency and cost targets
  • Implementing or improving a RAG pipeline and retrieval stack
  • Optimizing inference cost and throughput for high-volume applications
  • Defining safety guardrails, PII handling, and compliance mechanisms
  • Deciding between fine-tuning, RAG, and prompt engineering strategies

Best practices

  • Start with measurable requirements: P95 latency, accuracy targets, and cost per request
  • Prototype quickly and benchmark on representative datasets before committing to a model
  • Use multi-model routing with a fast small model for simple queries and a powerful model for complex ones
  • Apply semantic caching and batching where possible to reduce requests and cost
  • Implement layered safety: content filters, PII detection, prompt injection defenses, and audit logs
  • Monitor P95 latency, cache hit rate, error rate, and hallucination metrics; maintain runbooks for common failures

Example use cases

  • Customer support assistant: route simple FAQs to a cheap model and complex tickets to a powerful model with RAG for up-to-date docs
  • Legal document assistant: RAG over internal corpora with strict PII redaction and audit logging for compliance
  • Real-time chat with tight latency: deploy quantized 7B model on optimized inference hardware and aggressive caching
  • Knowledge base search: semantic retrieval + reranking, with frequent index updates and fallback fine-tune for stable facts
  • Batch annotation pipeline: asynchronous batch API with cheapest tier and ensemble checks for high-accuracy labeling

FAQ

When should I choose RAG over fine-tuning?

Choose RAG when knowledge changes frequently or you must avoid repeated retraining. Use fine-tuning if the knowledge is static, output formatting is complex, and latency/consistency requirements favor an embedded model.

How do I reduce inference cost without sacrificing quality?

Combine semantic caching, multi-model routing, prompt compression, and batching. Route simple queries to smaller models, cache common responses, compress long contexts, and batch async workloads to achieve significant savings.