home / skills / amnadtaowsoam / cerebraskills / rollout-and-kill-switch
This skill guides safe agent deployment through canaries, feature flags, and kill switches, enabling rapid rollouts with automated rollback safeguards.
npx playbooks add skill amnadtaowsoam/cerebraskills --skill rollout-and-kill-switchReview the files below or copy the command above to add this skill to your agents.
---
name: Rollout and Kill Switch
description: Comprehensive guide to safe agent deployment strategies including canary releases, feature flags, kill switches, and automated rollback mechanisms
---
# Rollout and Kill Switch
## Why Controlled Rollouts?
**Problem:** Deploying agent changes to all users at once is risky
### Risks
```
Bug affects all users
Performance issues at scale
Unexpected behavior
No easy rollback
```
### Solution: Gradual Rollout
```
1% → Monitor → 10% → Monitor → 50% → Monitor → 100%
Issues detected early → Affect fewer users → Easy rollback
```
---
## Rollout Strategies
### Canary Deployment
```
Deploy new version to small % of users
Monitor metrics
If good, increase %
If bad, rollback
Timeline:
Day 1: 1% of users
Day 2: 5% of users
Day 3: 10% of users
Day 4: 25% of users
Day 5: 50% of users
Day 6: 100% of users
```
### Blue-Green Deployment
```
Blue: Current version (100% traffic)
Green: New version (0% traffic)
Test green → Switch traffic → Green becomes blue
Instant rollback: Switch back to blue
```
### Feature Flags
```
Deploy code to all users
Feature disabled by default
Enable for specific users/% of traffic
Monitor
Enable for all
```
---
## Implementation
### Feature Flags
```python
class FeatureFlags:
def __init__(self):
self.flags = {}
def is_enabled(self, flag_name, user_id=None, default=False):
flag = self.flags.get(flag_name, {})
# Check if globally enabled
if flag.get("enabled", default):
return True
# Check rollout percentage
rollout_pct = flag.get("rollout_percentage", 0)
if rollout_pct > 0:
# Consistent hashing (same user always gets same result)
if (hash(user_id) % 100) < rollout_pct:
return True
# Check user whitelist
if user_id in flag.get("whitelist", []):
return True
return False
# Usage
flags = FeatureFlags()
flags.flags = {
"new_agent_version": {
"enabled": False,
"rollout_percentage": 10, # 10% of users
"whitelist": ["user_123", "user_456"] # Always enabled for these users
}
}
if flags.is_enabled("new_agent_version", user_id="user_789"):
# Use new agent version
agent = AgentV2()
else:
# Use old agent version
agent = AgentV1()
```
### Database-Backed Feature Flags
```sql
CREATE TABLE feature_flags (
name VARCHAR(255) PRIMARY KEY,
enabled BOOLEAN DEFAULT FALSE,
rollout_percentage INT DEFAULT 0,
whitelist JSONB DEFAULT '[]',
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
```
```python
def is_feature_enabled(flag_name, user_id):
flag = db.query_one("""
SELECT enabled, rollout_percentage, whitelist
FROM feature_flags
WHERE name = %s
""", (flag_name,))
if not flag:
return False
if flag["enabled"]:
return True
if (hash(user_id) % 100) < flag["rollout_percentage"]:
return True
if user_id in flag["whitelist"]:
return True
return False
```
---
## Kill Switch
### Emergency Stop
```python
class KillSwitch:
def __init__(self):
self.killed = False
def activate(self, reason):
self.killed = True
log_event(f"Kill switch activated: {reason}")
send_alert(f"🚨 Kill switch activated: {reason}")
def deactivate(self):
self.killed = False
log_event("Kill switch deactivated")
def is_active(self):
return self.killed
# Global kill switch
kill_switch = KillSwitch()
# In agent code
def run_agent(user_input):
if kill_switch.is_active():
return "Service temporarily unavailable. Please try again later."
# Normal agent logic
return agent.run(user_input)
# Activate kill switch
kill_switch.activate("High error rate detected")
```
### Database-Backed Kill Switch
```sql
CREATE TABLE kill_switches (
name VARCHAR(255) PRIMARY KEY,
active BOOLEAN DEFAULT FALSE,
reason TEXT,
activated_by VARCHAR(100),
activated_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ DEFAULT NOW()
);
```
```python
def is_kill_switch_active(name):
result = db.query_one("""
SELECT active FROM kill_switches WHERE name = %s
""", (name,))
return result["active"] if result else False
def activate_kill_switch(name, reason, activated_by):
db.execute("""
INSERT INTO kill_switches (name, active, reason, activated_by, activated_at)
VALUES (%s, TRUE, %s, %s, NOW())
ON CONFLICT (name) DO UPDATE
SET active = TRUE, reason = %s, activated_by = %s, activated_at = NOW()
""", (name, reason, activated_by, reason, activated_by))
send_alert(f"🚨 Kill switch '{name}' activated: {reason}")
```
---
## Monitoring and Auto-Rollback
### Monitor Metrics
```python
def monitor_agent_metrics(version):
# Get metrics for last hour
metrics = db.query_one("""
SELECT
COUNT(*) as total_requests,
SUM(CASE WHEN success THEN 1 ELSE 0 END) as successes,
AVG(latency_ms) as avg_latency,
SUM(CASE WHEN error THEN 1 ELSE 0 END) as errors
FROM agent_logs
WHERE version = %s
AND timestamp > NOW() - INTERVAL '1 hour'
""", (version,))
success_rate = metrics["successes"] / metrics["total_requests"]
error_rate = metrics["errors"] / metrics["total_requests"]
return {
"success_rate": success_rate,
"error_rate": error_rate,
"avg_latency": metrics["avg_latency"]
}
```
### Auto-Rollback on Failures
```python
def auto_rollback_check(current_version, previous_version):
metrics = monitor_agent_metrics(current_version)
# Thresholds
if metrics["success_rate"] < 0.95: # < 95% success
rollback(current_version, previous_version, "Low success rate")
if metrics["error_rate"] > 0.05: # > 5% errors
rollback(current_version, previous_version, "High error rate")
if metrics["avg_latency"] > 5000: # > 5 seconds
rollback(current_version, previous_version, "High latency")
def rollback(from_version, to_version, reason):
# Deactivate current version
db.execute("""
UPDATE feature_flags
SET enabled = FALSE
WHERE name = %s
""", (f"agent_{from_version}",))
# Activate previous version
db.execute("""
UPDATE feature_flags
SET enabled = TRUE
WHERE name = %s
""", (f"agent_{to_version}",))
log_event(f"Auto-rolled back from {from_version} to {to_version}: {reason}")
send_alert(f"🔄 Auto-rollback: {from_version} → {to_version} ({reason})")
```
---
## Gradual Rollout Automation
### Increase Rollout Percentage
```python
def gradual_rollout(flag_name, target_percentage=100, step=10, interval_hours=24):
"""
Gradually increase rollout percentage
Args:
flag_name: Feature flag name
target_percentage: Final percentage (default 100%)
step: Increase by this % each interval (default 10%)
interval_hours: Hours between increases (default 24)
"""
current_pct = get_rollout_percentage(flag_name)
while current_pct < target_percentage:
# Check metrics before increasing
metrics = monitor_agent_metrics(flag_name)
if metrics["success_rate"] < 0.95:
send_alert(f"⚠️ Rollout paused: Low success rate ({metrics['success_rate']:.2%})")
break
# Increase percentage
new_pct = min(current_pct + step, target_percentage)
set_rollout_percentage(flag_name, new_pct)
log_event(f"Increased {flag_name} rollout to {new_pct}%")
# Wait before next increase
time.sleep(interval_hours * 3600)
current_pct = new_pct
# Usage
gradual_rollout("new_agent_version", target_percentage=100, step=10, interval_hours=24)
```
---
## Feature Flag Services
### LaunchDarkly
```python
import ldclient
from ldclient.config import Config
ldclient.set_config(Config("sdk-key-123"))
client = ldclient.get()
# Check flag
user = {"key": "user_123"}
show_new_feature = client.variation("new-agent-version", user, False)
if show_new_feature:
agent = AgentV2()
else:
agent = AgentV1()
```
### Split.io
```python
from splitio import get_factory
factory = get_factory("api-key-123")
client = factory.client()
# Check flag
treatment = client.get_treatment("user_123", "new-agent-version")
if treatment == "on":
agent = AgentV2()
else:
agent = AgentV1()
```
### Unleash (Open Source)
```python
from UnleashClient import UnleashClient
client = UnleashClient(
url="http://unleash.example.com/api",
app_name="my-agent",
custom_headers={"Authorization": "..."}
)
client.initialize_client()
# Check flag
if client.is_enabled("new-agent-version", {"userId": "user_123"}):
agent = AgentV2()
else:
agent = AgentV1()
```
---
## Best Practices
### 1. Start Small (1-5%)
```python
# Good
set_rollout_percentage("new_feature", 1) # Start with 1%
# Bad
set_rollout_percentage("new_feature", 50) # Too aggressive
```
### 2. Monitor Closely
```python
# Monitor every 5 minutes during rollout
while rollout_in_progress:
metrics = monitor_agent_metrics("new_version")
if metrics["error_rate"] > threshold:
rollback()
time.sleep(300) # 5 minutes
```
### 3. Have Rollback Plan
```python
# Always know how to rollback
rollback_plan = {
"method": "Feature flag toggle",
"steps": [
"1. Set feature_flag.enabled = False",
"2. Verify traffic switched to old version",
"3. Monitor for 1 hour"
],
"contact": "[email protected]"
}
```
### 4. Test Rollback
```python
# Regularly test rollback procedure
def test_rollback():
# Enable new version
enable_feature("new_version")
assert is_feature_enabled("new_version")
# Rollback
disable_feature("new_version")
assert not is_feature_enabled("new_version")
# Verify old version works
response = agent_v1.run("test input")
assert response is not None
```
### 5. Communicate Changes
```python
# Notify team before rollout
send_notification(
channel="#agent-ops",
message=f"Starting rollout of new agent version to 10% of users. Monitoring dashboard: {dashboard_url}"
)
```
---
## Rollout Checklist
### Pre-Rollout
```
☐ Code reviewed and approved
☐ Tests passing (unit, integration, e2e)
☐ Monitoring dashboard ready
☐ Rollback plan documented
☐ Team notified
☐ Oncall engineer assigned
```
### During Rollout
```
☐ Start at 1-5%
☐ Monitor metrics every 5-15 minutes
☐ Check error logs
☐ Verify user feedback
☐ Gradually increase % (10%, 25%, 50%, 100%)
☐ Wait 24 hours between increases
```
### Post-Rollout
```
☐ Verify 100% rollout successful
☐ Monitor for 48 hours
☐ Remove feature flag (if permanent)
☐ Document lessons learned
☐ Update runbooks
```
---
## Summary
**Rollout Strategies:**
- Canary (gradual % increase)
- Blue-green (instant switch)
- Feature flags (selective enable)
**Kill Switch:**
- Emergency stop
- Database-backed
- Alert on activation
**Auto-Rollback:**
- Monitor metrics
- Rollback on failures
- Alert team
**Feature Flag Services:**
- LaunchDarkly
- Split.io
- Unleash (open source)
**Best Practices:**
- Start small (1-5%)
- Monitor closely
- Have rollback plan
- Test rollback
- Communicate changes
**Rollout Timeline:**
- Day 1: 1%
- Day 2: 5%
- Day 3: 10%
- Day 4: 25%
- Day 5: 50%
- Day 6: 100%
This skill is a practical guide for safely deploying agent updates using canary releases, feature flags, blue-green switches, kill switches, and automated rollback. It focuses on controlled rollouts, monitoring, and clear rollback procedures to reduce blast radius and speed incident response. The content includes code patterns, database schemas, and a rollout checklist you can apply directly.
It describes deployment patterns that limit exposure: incremental percentage rollouts (canary), parallel environments (blue-green), and runtime toggles (feature flags). It also provides kill switch implementations for emergency stops and automated monitoring that triggers rollback when thresholds (error rate, success rate, latency) are breached. Examples use lightweight Python and SQL snippets and integrate with feature-flag services.
How fast should I increase rollout percentage?
Start at 1–5%, monitor closely for a few hours, then step by 5–25% depending on risk and stability; wait between increments to observe effects.
What thresholds trigger auto-rollback?
Typical thresholds: success rate < 95%, error rate > 5%, or average latency exceeding an acceptable limit (example 5s). Tune to your service SLOs.
When should I use blue-green instead of canary?
Use blue-green for instant cutover or when you need full-environment testing before switching traffic; use canary for gradual exposure and metric-driven ramping.