home / skills / sickn33 / antigravity-awesome-skills / server-management
This skill helps you think through server management decisions, balancing process control, monitoring, scaling, and security to run reliable systems.
npx playbooks add skill sickn33/antigravity-awesome-skills --skill server-managementReview the files below or copy the command above to add this skill to your agents.
---
name: server-management
description: Server management principles and decision-making. Process management, monitoring strategy, and scaling decisions. Teaches thinking, not commands.
allowed-tools: Read, Write, Edit, Glob, Grep, Bash
---
# Server Management
> Server management principles for production operations.
> **Learn to THINK, not memorize commands.**
---
## 1. Process Management Principles
### Tool Selection
| Scenario | Tool |
|----------|------|
| **Node.js app** | PM2 (clustering, reload) |
| **Any app** | systemd (Linux native) |
| **Containers** | Docker/Podman |
| **Orchestration** | Kubernetes, Docker Swarm |
### Process Management Goals
| Goal | What It Means |
|------|---------------|
| **Restart on crash** | Auto-recovery |
| **Zero-downtime reload** | No service interruption |
| **Clustering** | Use all CPU cores |
| **Persistence** | Survive server reboot |
---
## 2. Monitoring Principles
### What to Monitor
| Category | Key Metrics |
|----------|-------------|
| **Availability** | Uptime, health checks |
| **Performance** | Response time, throughput |
| **Errors** | Error rate, types |
| **Resources** | CPU, memory, disk |
### Alert Severity Strategy
| Level | Response |
|-------|----------|
| **Critical** | Immediate action |
| **Warning** | Investigate soon |
| **Info** | Review daily |
### Monitoring Tool Selection
| Need | Options |
|------|---------|
| Simple/Free | PM2 metrics, htop |
| Full observability | Grafana, Datadog |
| Error tracking | Sentry |
| Uptime | UptimeRobot, Pingdom |
---
## 3. Log Management Principles
### Log Strategy
| Log Type | Purpose |
|----------|---------|
| **Application logs** | Debug, audit |
| **Access logs** | Traffic analysis |
| **Error logs** | Issue detection |
### Log Principles
1. **Rotate logs** to prevent disk fill
2. **Structured logging** (JSON) for parsing
3. **Appropriate levels** (error/warn/info/debug)
4. **No sensitive data** in logs
---
## 4. Scaling Decisions
### When to Scale
| Symptom | Solution |
|---------|----------|
| High CPU | Add instances (horizontal) |
| High memory | Increase RAM or fix leak |
| Slow response | Profile first, then scale |
| Traffic spikes | Auto-scaling |
### Scaling Strategy
| Type | When to Use |
|------|-------------|
| **Vertical** | Quick fix, single instance |
| **Horizontal** | Sustainable, distributed |
| **Auto** | Variable traffic |
---
## 5. Health Check Principles
### What Constitutes Healthy
| Check | Meaning |
|-------|---------|
| **HTTP 200** | Service responding |
| **Database connected** | Data accessible |
| **Dependencies OK** | External services reachable |
| **Resources OK** | CPU/memory not exhausted |
### Health Check Implementation
- Simple: Just return 200
- Deep: Check all dependencies
- Choose based on load balancer needs
---
## 6. Security Principles
| Area | Principle |
|------|-----------|
| **Access** | SSH keys only, no passwords |
| **Firewall** | Only needed ports open |
| **Updates** | Regular security patches |
| **Secrets** | Environment vars, not files |
| **Audit** | Log access and changes |
---
## 7. Troubleshooting Priority
When something's wrong:
1. **Check if running** (process status)
2. **Check logs** (error messages)
3. **Check resources** (disk, memory, CPU)
4. **Check network** (ports, DNS)
5. **Check dependencies** (database, APIs)
---
## 8. Anti-Patterns
| ❌ Don't | ✅ Do |
|----------|-------|
| Run as root | Use non-root user |
| Ignore logs | Set up log rotation |
| Skip monitoring | Monitor from day one |
| Manual restarts | Auto-restart config |
| No backups | Regular backup schedule |
---
> **Remember:** A well-managed server is boring. That's the goal.
This skill teaches server management principles and decision-making for production environments, emphasizing thinking over memorizing commands. It covers process management, monitoring, logging, scaling, health checks, security, and troubleshooting priorities to help you design reliable, maintainable infrastructure.
The skill explains how to choose the right tools and strategies for different runtimes (systemd, PM2, containers, orchestration) and maps goals like auto-recovery, zero-downtime reloads, and clustering to concrete approaches. It describes what to monitor, how to structure logs, when and how to scale, how to implement health checks, and the security controls and troubleshooting steps that keep servers stable.
How do I decide between vertical and horizontal scaling?
Profile the bottleneck: increase RAM/CPU for quick fixes, but favor horizontal scaling for resilience and sustained growth.
What should a basic monitoring alert policy include?
Define severity levels: critical (immediate action), warning (investigate soon), info (review routinely); monitor uptime, response time, error rate, and resource saturation.