home / skills / nik-kale / sre-skills / production-readiness

production-readiness skill

/skills/production-readiness

This skill helps you assess production readiness using a comprehensive reliability, observability, security, and operations checklist for go-live.

npx playbooks add skill nik-kale/sre-skills --skill production-readiness

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
5.9 KB
---
name: production-readiness
description: Comprehensive checklist for production deployment readiness covering reliability, observability, security, and operational requirements. Use when preparing for go-live, launch readiness review, production deployment checklist, or assessing if a service is ready for production.
---

# Production Readiness

Systematic checklist to ensure services are ready for production deployment.

## When to Use This Skill

- Preparing a new service for production launch
- Go-live readiness review
- Production deployment checklist needed
- Assessing service maturity
- Pre-launch security review

## Quick Readiness Assessment

Copy and complete this checklist:

```
Production Readiness Assessment:
Service: _______________
Date: _________________
Reviewer: _____________

Reliability:     [ ] Pass  [ ] Partial  [ ] Fail
Observability:   [ ] Pass  [ ] Partial  [ ] Fail
Security:        [ ] Pass  [ ] Partial  [ ] Fail
Operations:      [ ] Pass  [ ] Partial  [ ] Fail
Documentation:   [ ] Pass  [ ] Partial  [ ] Fail

Overall Status:  [ ] Ready  [ ] Conditional  [ ] Not Ready
```

## Reliability Checklist

### SLOs Defined

```
SLO Checklist:
- [ ] Availability SLO defined (e.g., 99.9%)
- [ ] Latency SLO defined (e.g., p99 < 200ms)
- [ ] Error rate SLO defined (e.g., < 0.1%)
- [ ] SLOs documented and communicated
- [ ] Error budget policy established
```

### Fault Tolerance

```
Fault Tolerance:
- [ ] No single points of failure
- [ ] Graceful degradation implemented
- [ ] Circuit breakers for dependencies
- [ ] Retry logic with exponential backoff
- [ ] Timeouts configured for all external calls
- [ ] Rate limiting in place
```

### Capacity

```
Capacity Planning:
- [ ] Load tested to 2x expected peak
- [ ] Auto-scaling configured (if applicable)
- [ ] Resource limits set (CPU, memory)
- [ ] Connection pool sizes appropriate
- [ ] Queue capacity sufficient
```

### Data Resilience

```
Data Protection:
- [ ] Backups configured and tested
- [ ] Backup restoration tested
- [ ] Data replication in place
- [ ] RPO/RTO defined and achievable
- [ ] No data loss on service restart
```

## Observability Checklist

### Metrics

```
Metrics:
- [ ] RED metrics exposed (Rate, Errors, Duration)
- [ ] Resource metrics available (CPU, memory, disk)
- [ ] Business metrics tracked
- [ ] Dependency health metrics
- [ ] Custom metrics for key operations
```

### Logging

```
Logging:
- [ ] Structured logging (JSON)
- [ ] Request/trace IDs in all logs
- [ ] Log levels appropriate (no excessive DEBUG)
- [ ] Sensitive data not logged
- [ ] Log retention configured
```

### Tracing

```
Distributed Tracing:
- [ ] Trace context propagated
- [ ] Spans for external calls
- [ ] Key operations instrumented
- [ ] Sampling rate configured
- [ ] Trace storage/retention set
```

### Alerting

```
Alerts:
- [ ] SLO-based alerts configured
- [ ] Alert thresholds tuned (not noisy)
- [ ] Runbooks linked to alerts
- [ ] Escalation paths defined
- [ ] On-call rotation assigned
```

### Dashboards

```
Dashboards:
- [ ] Service health dashboard exists
- [ ] Key metrics visualized
- [ ] Dashboard accessible to team
- [ ] Dependencies shown
- [ ] Historical data available
```

## Security Checklist

### Authentication & Authorization

```
Auth:
- [ ] Authentication required for all endpoints
- [ ] Authorization checks implemented
- [ ] Service-to-service auth configured
- [ ] No hardcoded credentials
- [ ] Secrets in secret manager
```

### Network Security

```
Network:
- [ ] TLS for all connections
- [ ] Network policies/firewall rules
- [ ] Internal services not publicly exposed
- [ ] Egress traffic controlled
- [ ] DDoS protection (if public)
```

### Data Security

```
Data:
- [ ] Sensitive data encrypted at rest
- [ ] PII handling documented
- [ ] Data retention policy applied
- [ ] Audit logging for sensitive operations
- [ ] GDPR/compliance requirements met
```

### Vulnerability Management

```
Vulnerabilities:
- [ ] Dependencies scanned for CVEs
- [ ] Container images scanned
- [ ] No critical vulnerabilities
- [ ] Security review completed
- [ ] Penetration testing (if required)
```

## Operations Checklist

### Deployment

```
Deployment:
- [ ] CI/CD pipeline configured
- [ ] Deployment is automated
- [ ] Rollback procedure documented
- [ ] Rollback tested
- [ ] Blue-green or canary supported
- [ ] Feature flags for risky changes
```

### Runbooks

```
Runbooks:
- [ ] Startup/shutdown procedures
- [ ] Common troubleshooting steps
- [ ] Escalation procedures
- [ ] Disaster recovery steps
- [ ] Maintenance procedures
```

### On-Call

```
On-Call Readiness:
- [ ] On-call rotation scheduled
- [ ] Team trained on service
- [ ] Escalation paths clear
- [ ] Contact information current
- [ ] Handoff procedures defined
```

## Documentation Checklist

```
Documentation:
- [ ] Architecture diagram current
- [ ] API documentation complete
- [ ] README with setup instructions
- [ ] Dependencies documented
- [ ] Configuration documented
- [ ] Known issues/limitations listed
```

## Rollback Plan

Every production deployment needs a rollback plan:

```
Rollback Plan:
- Rollback trigger: [What conditions trigger rollback]
- Rollback method: [How to rollback - automated/manual]
- Rollback time: [Expected time to complete]
- Data considerations: [Any data migration concerns]
- Verification: [How to verify rollback success]
```

## Pre-Launch Final Checklist

Complete immediately before go-live:

```
Final Pre-Launch:
- [ ] All checklist items above addressed
- [ ] Stakeholders notified of launch
- [ ] War room/incident channel ready
- [ ] Key personnel available
- [ ] Monitoring dashboards open
- [ ] Rollback ready to execute
- [ ] Communication templates prepared
```

## Common Blockers

See [references/common-blockers.md](references/common-blockers.md) for typical issues that block production readiness.

## Additional Resources

- [Common Production Blockers](references/common-blockers.md)
- [SLO Design Guide](references/slo-guide.md)

Overview

This skill provides a practical, systematic production-readiness checklist covering reliability, observability, security, operations, and documentation. It guides teams through pre-launch validation, rollout safeguards, and post-deploy monitoring to reduce risk at go-live. Use it to confirm readiness, drive launch reviews, or prepare rollback and incident plans.

How this skill works

The skill breaks readiness into focused checklists: SLOs, fault tolerance, capacity, data resilience, metrics, logging, tracing, alerts, dashboards, authentication, network and data security, vulnerability scanning, CI/CD, runbooks, and documentation. Each area uses binary checks (pass/partial/fail) and prescribes concrete artifacts: SLO definitions, runbooks, dashboards, and rollback plans. It culminates in a final pre-launch checklist and a template rollback plan to verify operational preparedness.

When to use it

  • Preparing a new service for its first production launch
  • Conducting a go-live readiness review before deployment
  • Assessing service maturity during a launch or audit
  • Completing a pre-deployment security and compliance review
  • Validating runbooks, monitoring, and rollback plans before a release

Best practices

  • Define measurable SLOs and an error budget policy before launch
  • Automate deployments and test rollback procedures regularly
  • Instrument RED metrics, structured logs, and distributed traces from day one
  • Keep secrets in a manager and avoid hardcoded credentials
  • Run load and failover tests to validate capacity and fault tolerance

Example use cases

  • Run a checklist-driven launch readiness review for a microservice
  • Prepare an e-commerce API for Black Friday traffic with capacity and SLO checks
  • Validate observability and on-call readiness for a new feature rollout
  • Create a documented rollback plan and test it during staging
  • Perform a security and dependency vulnerability assessment before go-live

FAQ

How detailed should SLOs and error budgets be?

SLOs should be measurable, tied to user experience (availability, latency, error rate), and realistic. Define error budgets that permit controlled risk and link them to escalation and deployment policies.

What is the minimum observability needed at launch?

At minimum expose RED metrics, structured logs with trace/request IDs, health dashboards, and SLO-based alerts with linked runbooks and an on-call rota.