home / skills / whawkinsiv / solo-founder-superpowers / monitor

monitor skill

Q: How do I avoid alert fatigue?

Limit critical SMS alerts to true outages or big error spikes, route everything else to email or daily digests, and tune thresholds based on normal traffic patterns.

safe

/skills/monitor

This skill helps you set up production monitoring and respond to incidents by tracking uptime, errors, and performance across your SaaS app.

npx playbooks add skill whawkinsiv/solo-founder-superpowers --skill monitor

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

9.0 KB

---
name: monitor
description: "Use this skill when the user needs to set up production monitoring, track app health, configure error alerts, or respond to incidents. Covers error tracking, uptime monitoring, performance metrics, and incident response for SaaS applications."
---

# Monitor

## Monitoring Checklist

```
Basic Monitoring:
- [ ] Uptime monitoring (is site up?)
- [ ] Error tracking (are errors happening?)
- [ ] Performance monitoring (is it slow?)
- [ ] User activity (are people using it?)
- [ ] Critical alerts configured
- [ ] Check dashboard daily
```

See [MONITORING-SETUP.md](MONITORING-SETUP.md) for implementation.

---

## Why Monitor?

**Without monitoring:**
- Users hit errors, you don't know
- Site goes down, you find out from Twitter
- Slow performance, users leave silently
- Security issues, no alert

**With monitoring:**
- Errors show in dashboard immediately
- Get text when site goes down
- See performance degradation
- Catch issues before users complain

**Goal: Know about problems before users tell you.**

---

## Three Essential Monitors

### 1. Is It Up?

**Uptime monitoring** - Pings your app every minute

**Free tools:**
- UptimeRobot (free, 50 monitors)
- Pingdom (limited free tier)
- Vercel/Netlify (built-in for deployed apps)

**Setup:**
```
1. Sign up for UptimeRobot
2. Add monitor for https://yourapp.com
3. Add your email for alerts
4. Get texted if site is down
```

### 2. Are There Errors?

**Error tracking** - Captures JavaScript errors and API failures

**Free tools:**
- Sentry (free tier: 5k errors/month)
- LogRocket (limited free)
- Vercel/Netlify logs (for deployed apps)

**Setup:**
```
Tell AI:
"Add Sentry error tracking:
- Capture all frontend errors
- Capture all API errors
- Include user context
- Send to Sentry dashboard"
```

### 3. Is It Slow?

**Performance monitoring** - Tracks page load times

**Free tools:**
- Vercel Analytics (built-in)
- Google PageSpeed Insights (free)
- Cloudflare Analytics (free tier)

**Setup:**
- Usually automatic with hosting platform
- Check dashboard weekly

---

## What to Monitor

### Critical Metrics

**Must monitor:**
- Site uptime (99%+)
- Error rate (< 1% of requests)
- API response time (< 500ms)
- Page load time (< 3s)

**Nice to have:**
- Active users
- Feature usage
- Conversion rates
- User paths

**For MVP:** Focus on the "must monitor" only.

---

## Setting Up Alerts

**Configure alerts for:**

**Critical (text me immediately):**
- Site is down
- Error rate spike (10x normal)
- Database connection lost
- Payment processing failing

**Important (email within hour):**
- API slow (>2 seconds)
- Error rate elevated (2x normal)
- Disk space low (>80%)

**Informational (daily digest):**
- New errors discovered
- Performance trending down
- Traffic patterns

**Tell AI:**
```
Configure monitoring alerts:
- Critical: Text to [phone]
- Important: Email to [email]
- Send summary: Daily at 9am
```

---

## Daily Monitoring Routine

**5-minute morning check:**

```
Daily Check:
1. Open monitoring dashboard
2. Check uptime (should be 100% yesterday)
3. Check error count (any spikes?)
4. Check performance (slower than usual?)
5. Review any alerts from overnight
```

**If all green:** You're done, 5 minutes.

**If red:** Investigate using debug skill.

---

## Reading Monitoring Dashboards

### Uptime Dashboard

**Green:** Site responding  
**Red:** Site down or slow to respond

**What to check:**
- Uptime percentage (target: 99%+)
- Response time (target: <500ms)
- Recent downtime incidents

### Error Dashboard

**Look for:**
- Error count spikes (sudden jump)
- New error types (didn't see before)
- Affected users (how many hit this?)
- Error frequency (happening a lot?)

**Priority:**
- Affecting many users → High priority
- Blocking key features → High priority
- Edge case error → Lower priority

### Performance Dashboard

**Look for:**
- Load time trending up (getting slower)
- Slow endpoints (which API calls)
- Slow pages (which routes)
- Geographic differences (slow in specific regions)

---

## Error Investigation

**When errors spike:**

```
1. Open error tracking dashboard (Sentry)
2. Find the most frequent error
3. Read error message and stack trace
4. Note: How many users affected?
5. Note: Started when?
6. Check: Did we deploy recently?
```

**Give to AI:**
```
Error in production:
[Paste error message and stack trace]

Affected: [X] users in last [Y] hours
Started: [timestamp]
Recent deploys: [any?]

Please:
1. Explain what's wrong
2. Propose hotfix
3. How to test before deploying
```

---

## User-Reported Issues

**When user reports problem:**

```
User Report Investigation:
1. Can you reproduce it?
2. Check monitoring for errors at that time
3. Check logs for that user
4. Check if others affected
5. Determine severity

Then use debug skill to fix.
```

**Tell AI:**
```
User reported: [issue description]
User: [email or ID]
Timestamp: [when it happened]

Check monitoring and logs for this user at this time.
What errors or issues do you see?
```

---

## Proactive Monitoring

**Catch issues before users:**

**Weekly checks:**
```
Weekly Review:
- [ ] Error trends (going up or down?)
- [ ] Performance trends (slower?)
- [ ] New error types introduced
- [ ] Uptime issues resolved
- [ ] Alert noise (too many false alerts?)
```

**Monthly checks:**
```
Monthly Health:
- [ ] Compare to last month
- [ ] Any degradation?
- [ ] Any improvements?
- [ ] Monitoring gaps (what's not tracked?)
```

---

## Free Monitoring Stack

**Recommended for MVP:**

**Uptime:**
- UptimeRobot (free) - 50 monitors

**Errors:**
- Sentry (free) - 5k errors/month

**Performance:**
- Vercel Analytics (free on Vercel)
- Cloudflare Analytics (free)

**Logs:**
- Platform logs (Vercel, Netlify, Railway)

**Cost: $0/month until you need more.**

---

## When to Upgrade Monitoring

**Upgrade when:**
- Hitting free tier limits
- Need more detailed analytics
- Need faster alert response
- Need advanced features (session replay, etc.)

**Paid tiers (typically $20-50/mo):**
- Sentry Pro ($26/mo)
- LogRocket ($99/mo - session replay)
- DataDog ($15/host/mo)

**For < 1000 users:** Free tiers sufficient.

---

## Common Monitoring Mistakes

| Mistake | Fix |
|---------|-----|
| No monitoring set up | Set up before launch |
| Alert fatigue (too many alerts) | Only alert on critical issues |
| Checking once a month | Check daily (5 minutes) |
| Ignoring trends | Watch for degradation over time |
| No alerts configured | Set up text alerts for downtime |
| Monitoring but not acting | Use monitoring to find and fix issues |

---

## Interpreting Trends

**Good trends:**
- Errors decreasing
- Performance improving
- Uptime stable at 99.9%+

**Warning trends:**
- Errors slowly increasing
- Performance slowly degrading
- Uptime dipping below 99%

**Critical trends:**
- Sudden error spike
- Sudden performance drop
- Multiple downtime incidents

**Action:** Address warning trends before they become critical.

---

## Logging vs Monitoring

**Logging:**
- Records what happened
- For debugging specific issues
- Detailed, verbose
- Review when investigating

**Monitoring:**
- Tracks overall health
- For catching issues early
- High-level metrics
- Review daily

**Both needed:** Monitoring alerts you, logs help debug.

---

## Setting Up Logging

**Tell AI:**
```
Add application logging:
- Log all errors with context
- Log API requests/responses
- Log slow operations (>1s)
- Log authentication events
- Don't log sensitive data

Format: JSON with timestamp, level, message, context
Send to: [Platform logs or external service]
```

**Log levels:**
- ERROR: Something broke
- WARN: Something concerning
- INFO: Normal operations
- DEBUG: Detailed debugging info

**Production:** Log ERROR and WARN only.

---

## Monitoring Integrations

**Third-party services:**

**Payments (Stripe):**
- Failed payments alert
- Refund requests alert
- Subscription cancellations (daily digest)

**Email (SendGrid):**
- Delivery failures alert
- Bounce rate elevated alert
- Spam complaints alert

**Database:**
- Connection pool exhausted
- Slow queries (>1s)
- Disk space low

**Tell AI:**
```
Add monitoring for [service]:
- Alert on failures
- Track success rate
- Log errors with context
```

---

## Incident Response

**When alerts fire:**

```
Incident Response:
1. Acknowledge alert (mark as seen)
2. Assess severity:
   - Critical: Site down, payments failing
   - High: Errors affecting many users
   - Medium: Isolated issues
3. Immediate action:
   - Critical: Hotfix or rollback
   - High: Fix within hours
   - Medium: Fix in next deploy
4. Update users if needed
5. Post-mortem after resolved
```

**Critical incidents:**
```
1. Assess impact (how many affected?)
2. Quick fix or rollback
3. Deploy hotfix
4. Verify fixed
5. Monitor closely for hour
6. Update status page if you have one
```

---

## Success Looks Like

✅ Know about issues before users report them  
✅ Uptime >99.9%  
✅ Errors caught and fixed quickly  
✅ Performance trends stable or improving  
✅ Daily monitoring routine (5 minutes)  
✅ Alerts configured and actionable  
✅ Issues resolved proactively

Overview

This skill helps founders set up and run production monitoring for SaaS apps. It focuses on uptime checks, error tracking, performance metrics, and incident response so you learn about problems before users do. Use it to configure alerts, run a daily monitoring routine, and act on incidents quickly.

How this skill works

The skill guides you through three core monitors: uptime pings, error tracking (frontend and API), and performance monitoring (page/API timings). It defines critical metrics and alert tiers, suggests free tools and integrations, and provides checklists for daily, weekly, and monthly reviews. When an alert fires it gives a clear incident response flow and a template for feeding errors into AI for diagnosis and hotfix suggestions.

When to use it

Before launch to ensure basic health checks are in place
When you need automated alerts for downtime, errors, or slow performance
After deploying changes to detect regressions quickly
When investigating user-reported issues tied to errors or timing
When preparing SLOs and on-call procedures

Best practices

Start with three essentials: uptime, errors, performance — focus on must-monitor metrics for an MVP
Configure tiered alerts: critical (SMS), important (email), informational (daily digest)
Keep alert noise low—only SMS for outages and large error spikes
Run a 5-minute daily check: uptime, error count, performance, overnight alerts
Log errors with context (JSON) but avoid sensitive data; use logs for debugging, monitoring for detection

Example use cases

Set up UptimeRobot to ping your site every minute and SMS you on downtime
Add Sentry to capture frontend and API errors, including user context and frequency
Enable Vercel/Cloudflare analytics to track page load times and slow endpoints
Create alert rules: text on site down, email on elevated error rate, daily digest of new errors
Run a weekly review of error and performance trends and tune alert thresholds

FAQ

What are the minimum metrics to monitor for an MVP?

Uptime, error rate, API response time, and page load time — keep targets simple (99%+ uptime, <1% error rate, API <500ms, page <3s).

How do I avoid alert fatigue?

Limit critical SMS alerts to true outages or big error spikes, route everything else to email or daily digests, and tune thresholds based on normal traffic patterns.