home / skills / vudovn / antigravity-kit / server-management
This skill helps you apply server management principles to plan monitoring, scaling, and reliability decisions, focusing on thinking rather than memorizing
npx playbooks add skill vudovn/antigravity-kit --skill server-managementReview the files below or copy the command above to add this skill to your agents.
---
name: server-management
description: Server management principles and decision-making. Process management, monitoring strategy, and scaling decisions. Teaches thinking, not commands.
allowed-tools: Read, Write, Edit, Glob, Grep, Bash
---
# Server Management
> Server management principles for production operations.
> **Learn to THINK, not memorize commands.**
---
## 1. Process Management Principles
### Tool Selection
| Scenario | Tool |
|----------|------|
| **Node.js app** | PM2 (clustering, reload) |
| **Any app** | systemd (Linux native) |
| **Containers** | Docker/Podman |
| **Orchestration** | Kubernetes, Docker Swarm |
### Process Management Goals
| Goal | What It Means |
|------|---------------|
| **Restart on crash** | Auto-recovery |
| **Zero-downtime reload** | No service interruption |
| **Clustering** | Use all CPU cores |
| **Persistence** | Survive server reboot |
---
## 2. Monitoring Principles
### What to Monitor
| Category | Key Metrics |
|----------|-------------|
| **Availability** | Uptime, health checks |
| **Performance** | Response time, throughput |
| **Errors** | Error rate, types |
| **Resources** | CPU, memory, disk |
### Alert Severity Strategy
| Level | Response |
|-------|----------|
| **Critical** | Immediate action |
| **Warning** | Investigate soon |
| **Info** | Review daily |
### Monitoring Tool Selection
| Need | Options |
|------|---------|
| Simple/Free | PM2 metrics, htop |
| Full observability | Grafana, Datadog |
| Error tracking | Sentry |
| Uptime | UptimeRobot, Pingdom |
---
## 3. Log Management Principles
### Log Strategy
| Log Type | Purpose |
|----------|---------|
| **Application logs** | Debug, audit |
| **Access logs** | Traffic analysis |
| **Error logs** | Issue detection |
### Log Principles
1. **Rotate logs** to prevent disk fill
2. **Structured logging** (JSON) for parsing
3. **Appropriate levels** (error/warn/info/debug)
4. **No sensitive data** in logs
---
## 4. Scaling Decisions
### When to Scale
| Symptom | Solution |
|---------|----------|
| High CPU | Add instances (horizontal) |
| High memory | Increase RAM or fix leak |
| Slow response | Profile first, then scale |
| Traffic spikes | Auto-scaling |
### Scaling Strategy
| Type | When to Use |
|------|-------------|
| **Vertical** | Quick fix, single instance |
| **Horizontal** | Sustainable, distributed |
| **Auto** | Variable traffic |
---
## 5. Health Check Principles
### What Constitutes Healthy
| Check | Meaning |
|-------|---------|
| **HTTP 200** | Service responding |
| **Database connected** | Data accessible |
| **Dependencies OK** | External services reachable |
| **Resources OK** | CPU/memory not exhausted |
### Health Check Implementation
- Simple: Just return 200
- Deep: Check all dependencies
- Choose based on load balancer needs
---
## 6. Security Principles
| Area | Principle |
|------|-----------|
| **Access** | SSH keys only, no passwords |
| **Firewall** | Only needed ports open |
| **Updates** | Regular security patches |
| **Secrets** | Environment vars, not files |
| **Audit** | Log access and changes |
---
## 7. Troubleshooting Priority
When something's wrong:
1. **Check if running** (process status)
2. **Check logs** (error messages)
3. **Check resources** (disk, memory, CPU)
4. **Check network** (ports, DNS)
5. **Check dependencies** (database, APIs)
---
## 8. Anti-Patterns
| ❌ Don't | ✅ Do |
|----------|-------|
| Run as root | Use non-root user |
| Ignore logs | Set up log rotation |
| Skip monitoring | Monitor from day one |
| Manual restarts | Auto-restart config |
| No backups | Regular backup schedule |
---
> **Remember:** A well-managed server is boring. That's the goal.
This skill teaches server management principles and decision-making for production operations. It focuses on process management, monitoring, logging, scaling, health checks, security, and troubleshooting thinking rather than specific commands. The goal is to make systems reliable, observable, and maintainable. Learn patterns to choose tools and respond to incidents effectively.
The skill explains what to inspect and why: process status, resource usage, logs, health endpoints, and external dependencies. It maps symptoms to appropriate actions—when to restart, scale, profile, or patch—and recommends monitoring and alerting strategies by severity. You get guidance on log management, structured logging, rotation, and avoiding sensitive data in outputs. Security and anti-patterns are highlighted to prevent common operational mistakes.
What should I check first during an incident?
Start with process status, then check recent logs, resource usage (CPU/memory/disk), network connectivity, and external dependencies in that order.
When should I use auto-scaling vs manual scaling?
Use auto-scaling for variable traffic patterns and spikes; use manual/vertical scaling for quick fixes or when stateful constraints prevent horizontal scaling.