home / skills / sickn33 / antigravity-awesome-skills / server-management

server-management skill

safe

This skill helps you think through server management decisions, balancing process control, monitoring, scaling, and security to run reliable systems.

This is most likely a fork of the server-management skill from vudovn

npx playbooks add skill sickn33/antigravity-awesome-skills --skill server-management

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.6 KB

---
name: server-management
description: Server management principles and decision-making. Process management, monitoring strategy, and scaling decisions. Teaches thinking, not commands.
allowed-tools: Read, Write, Edit, Glob, Grep, Bash
---

# Server Management

> Server management principles for production operations.
> **Learn to THINK, not memorize commands.**

---

## 1. Process Management Principles

### Tool Selection

| Scenario | Tool |
|----------|------|
| **Node.js app** | PM2 (clustering, reload) |
| **Any app** | systemd (Linux native) |
| **Containers** | Docker/Podman |
| **Orchestration** | Kubernetes, Docker Swarm |

### Process Management Goals

| Goal | What It Means |
|------|---------------|
| **Restart on crash** | Auto-recovery |
| **Zero-downtime reload** | No service interruption |
| **Clustering** | Use all CPU cores |
| **Persistence** | Survive server reboot |

---

## 2. Monitoring Principles

### What to Monitor

| Category | Key Metrics |
|----------|-------------|
| **Availability** | Uptime, health checks |
| **Performance** | Response time, throughput |
| **Errors** | Error rate, types |
| **Resources** | CPU, memory, disk |

### Alert Severity Strategy

| Level | Response |
|-------|----------|
| **Critical** | Immediate action |
| **Warning** | Investigate soon |
| **Info** | Review daily |

### Monitoring Tool Selection

| Need | Options |
|------|---------|
| Simple/Free | PM2 metrics, htop |
| Full observability | Grafana, Datadog |
| Error tracking | Sentry |
| Uptime | UptimeRobot, Pingdom |

---

## 3. Log Management Principles

### Log Strategy

| Log Type | Purpose |
|----------|---------|
| **Application logs** | Debug, audit |
| **Access logs** | Traffic analysis |
| **Error logs** | Issue detection |

### Log Principles

1. **Rotate logs** to prevent disk fill
2. **Structured logging** (JSON) for parsing
3. **Appropriate levels** (error/warn/info/debug)
4. **No sensitive data** in logs

---

## 4. Scaling Decisions

### When to Scale

| Symptom | Solution |
|---------|----------|
| High CPU | Add instances (horizontal) |
| High memory | Increase RAM or fix leak |
| Slow response | Profile first, then scale |
| Traffic spikes | Auto-scaling |

### Scaling Strategy

| Type | When to Use |
|------|-------------|
| **Vertical** | Quick fix, single instance |
| **Horizontal** | Sustainable, distributed |
| **Auto** | Variable traffic |

---

## 5. Health Check Principles

### What Constitutes Healthy

| Check | Meaning |
|-------|---------|
| **HTTP 200** | Service responding |
| **Database connected** | Data accessible |
| **Dependencies OK** | External services reachable |
| **Resources OK** | CPU/memory not exhausted |

### Health Check Implementation

- Simple: Just return 200
- Deep: Check all dependencies
- Choose based on load balancer needs

---

## 6. Security Principles

| Area | Principle |
|------|-----------|
| **Access** | SSH keys only, no passwords |
| **Firewall** | Only needed ports open |
| **Updates** | Regular security patches |
| **Secrets** | Environment vars, not files |
| **Audit** | Log access and changes |

---

## 7. Troubleshooting Priority

When something's wrong:

1. **Check if running** (process status)
2. **Check logs** (error messages)
3. **Check resources** (disk, memory, CPU)
4. **Check network** (ports, DNS)
5. **Check dependencies** (database, APIs)

---

## 8. Anti-Patterns

| ❌ Don't | ✅ Do |
|----------|-------|
| Run as root | Use non-root user |
| Ignore logs | Set up log rotation |
| Skip monitoring | Monitor from day one |
| Manual restarts | Auto-restart config |
| No backups | Regular backup schedule |

---

> **Remember:** A well-managed server is boring. That's the goal.

Overview

This skill teaches server management principles and decision-making for production environments, emphasizing thinking over memorizing commands. It covers process management, monitoring, logging, scaling, health checks, security, and troubleshooting priorities to help you design reliable, maintainable infrastructure.

How this skill works

The skill explains how to choose the right tools and strategies for different runtimes (systemd, PM2, containers, orchestration) and maps goals like auto-recovery, zero-downtime reloads, and clustering to concrete approaches. It describes what to monitor, how to structure logs, when and how to scale, how to implement health checks, and the security controls and troubleshooting steps that keep servers stable.

When to use it

Designing or revising production process management for an app
Setting up monitoring and alerting strategy for new services
Planning scaling decisions after performance profiling
Implementing log rotation and structured logging for observability
Hardening server security and access controls

Best practices

Pick the right tool for the scenario: PM2 for Node.js clustering, systemd for system services, containers for portability, orchestration for scale
Monitor availability, performance, errors, and resources with severity-based alerts (critical, warning, info)
Use structured (JSON) logs, rotate them, and avoid logging sensitive data
Prefer horizontal scaling for long-term capacity; use vertical only as a short-term fix
Require SSH keys, keep firewalls minimal, apply security updates, and store secrets in environment variables
Follow a clear troubleshooting sequence: process → logs → resources → network → dependencies

Example use cases

Configure a Node.js service for zero-downtime deploys with PM2 clustering and graceful reloads
Design an observability stack using Prometheus + Grafana for metrics and Sentry for errors
Implement log rotation and JSON logging to feed a log aggregation pipeline
Choose autoscaling policies after profiling response time and throughput under load
Create health checks that satisfy load balancer requirements: simple for speed, deep for readiness

FAQ

How do I decide between vertical and horizontal scaling?

Profile the bottleneck: increase RAM/CPU for quick fixes, but favor horizontal scaling for resilience and sustained growth.

What should a basic monitoring alert policy include?

Define severity levels: critical (immediate action), warning (investigate soon), info (review routinely); monitor uptime, response time, error rate, and resource saturation.