home / skills / phrazzld / claude-config / verify-fix

verify-fix skill

/skills/verify-fix

This skill verifies production fix effectiveness by validating logs, metrics, and database state before declaring incidents resolved.

npx playbooks add skill phrazzld/claude-config --skill verify-fix

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.9 KB
---
name: verify-fix
description: "Mandatory incident fix verification with observables. Invoke after: applying production fixes, before declaring incidents resolved, when someone says 'I think that fixed it'. Requires log entries, metric changes, and database state confirmation."
effort: max
---

# /verify-fix - Verify Incident Fix with Observables

**MANDATORY** verification step after any production incident fix.

## Philosophy

A fix is just a hypothesis until proven by metrics. "That should fix it" is not verification.

## When to Use

- After applying ANY fix to a production incident
- Before declaring an incident resolved
- When someone says "I think that fixed it"

## Verification Protocol

### 1. Define Observable Success Criteria

Before testing, explicitly state what we expect to see:

```
SUCCESS CRITERIA:
- [ ] Log entry: "[specific log message]"
- [ ] Metric change: [metric] goes from [X] to [Y]
- [ ] Database state: [field] = [expected value]
- [ ] API response: [endpoint] returns [expected response]
```

### 2. Trigger Test Event

```bash
# For webhook issues:
stripe events resend [event_id] --webhook-endpoint [endpoint_id]

# For API issues:
curl -X POST [endpoint] -d '[test payload]'

# For auth issues:
# Log in as test user, perform action
```

### 3. Observe Results

```bash
# Watch logs in real-time
vercel logs [app] --json | grep [pattern]

# Or for Convex:
npx convex logs --prod | grep [pattern]

# Check metrics
stripe events retrieve [event_id] | jq '.pending_webhooks'
```

### 4. Verify Database State

```bash
# Check the affected record
npx convex run --prod [query] '{"id": "[affected_id]"}'
```

### 5. Document Evidence

```
VERIFICATION EVIDENCE:
- Timestamp: [when]
- Test performed: [what we did]
- Log entry observed: [paste relevant log]
- Metric before: [value]
- Metric after: [value]
- Database state confirmed: [yes/no]

VERDICT: [VERIFIED / NOT VERIFIED]
```

## Red Flags (Fix NOT Verified)

- "The code looks right now"
- "The config is correct"
- "It should work"
- "Let's wait and see"
- No log entry observed
- Metrics unchanged
- Can't reproduce the original symptom

## Example: Webhook Fix Verification

```bash
# 1. Resend the failing event
stripe events resend evt_xxx --webhook-endpoint we_xxx

# 2. Watch logs (expect to see "Webhook received")
timeout 15 vercel logs app --json | grep webhook

# 3. Check delivery metric (expect decrease)
stripe events retrieve evt_xxx | jq '.pending_webhooks'
# Before: 4, After: 3 = DELIVERY SUCCEEDED

# 4. Check database state
npx convex run --prod users/queries:getUserByClerkId '{"clerkId": "user_xxx"}'
# Expect: subscriptionStatus = "active"

# VERDICT: VERIFIED - all 4 checks passed
```

## If Verification Fails

1. **Don't panic** - the fix hypothesis was wrong, that's okay
2. **Revert** if the fix made things worse
3. **Loop back** to observation phase (OODA-V)
4. **Question assumptions** - what did we miss?

Overview

This skill enforces a mandatory, observable verification step after any production fix. It turns “I think that fixed it” into evidence-driven validation before declaring incidents resolved. The goal is to require logs, metrics, and database confirmation so fixes are proven, not assumed.

How this skill works

Define clear observable success criteria first (specific log messages, metric deltas, and database state). Trigger a controlled test event, observe logs and metrics in real time, query the database for the expected state, and record timestamped evidence. If all observables match the criteria, mark the incident VERIFIED; otherwise treat the fix as failed and iterate.

When to use it

  • After applying any fix to production
  • Before declaring an incident resolved
  • When someone says “I think that fixed it”
  • When reproducing a prior failing event is possible
  • During post-deployment smoke tests

Best practices

  • Write explicit success criteria before running tests (logs, metrics, DB state, API response)
  • Use controlled test events that reproduce the original failure path
  • Capture timestamps, command output, and metric snapshots as verification evidence
  • Prefer automated checks for metrics and logs to reduce human error
  • If verification fails, revert if harmful and loop back to diagnostics quickly

Example use cases

  • Webhook delivery fix: resend the failing event, watch webhook logs, confirm pending_webhooks decreases, and verify DB record updated
  • Authentication bug: log in as a test user, perform the action, confirm specific auth log entry, and check user session state in the DB
  • API response regression: curl the endpoint with test payload, assert expected response, validate related metric change and DB record
  • Configuration change: deploy config, trigger a related request, observe the new config-driven log line and metric improvement

FAQ

What counts as sufficient evidence?

Sufficient evidence is a timestamped log entry matching the expected message, a measurable metric change that returns the system toward normal, and database state confirming the intended write or flag change. At least one item from each category should be satisfied when applicable.

What if the original symptom cannot be reproduced?

If you cannot reproduce the symptom, do not mark the fix verified. Investigate why reproduction is failing, collect related telemetry, and consider staged rollbacks or feature flags while you continue diagnosis.