home / skills / vudovn / antigravity-kit / server-management

server-management skill

safe

This skill helps you apply server management principles to plan monitoring, scaling, and reliability decisions, focusing on thinking rather than memorizing

npx playbooks add skill vudovn/antigravity-kit --skill server-management

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.6 KB

---
name: server-management
description: Server management principles and decision-making. Process management, monitoring strategy, and scaling decisions. Teaches thinking, not commands.
allowed-tools: Read, Write, Edit, Glob, Grep, Bash
---

# Server Management

> Server management principles for production operations.
> **Learn to THINK, not memorize commands.**

---

## 1. Process Management Principles

### Tool Selection

| Scenario | Tool |
|----------|------|
| **Node.js app** | PM2 (clustering, reload) |
| **Any app** | systemd (Linux native) |
| **Containers** | Docker/Podman |
| **Orchestration** | Kubernetes, Docker Swarm |

### Process Management Goals

| Goal | What It Means |
|------|---------------|
| **Restart on crash** | Auto-recovery |
| **Zero-downtime reload** | No service interruption |
| **Clustering** | Use all CPU cores |
| **Persistence** | Survive server reboot |

---

## 2. Monitoring Principles

### What to Monitor

| Category | Key Metrics |
|----------|-------------|
| **Availability** | Uptime, health checks |
| **Performance** | Response time, throughput |
| **Errors** | Error rate, types |
| **Resources** | CPU, memory, disk |

### Alert Severity Strategy

| Level | Response |
|-------|----------|
| **Critical** | Immediate action |
| **Warning** | Investigate soon |
| **Info** | Review daily |

### Monitoring Tool Selection

| Need | Options |
|------|---------|
| Simple/Free | PM2 metrics, htop |
| Full observability | Grafana, Datadog |
| Error tracking | Sentry |
| Uptime | UptimeRobot, Pingdom |

---

## 3. Log Management Principles

### Log Strategy

| Log Type | Purpose |
|----------|---------|
| **Application logs** | Debug, audit |
| **Access logs** | Traffic analysis |
| **Error logs** | Issue detection |

### Log Principles

1. **Rotate logs** to prevent disk fill
2. **Structured logging** (JSON) for parsing
3. **Appropriate levels** (error/warn/info/debug)
4. **No sensitive data** in logs

---

## 4. Scaling Decisions

### When to Scale

| Symptom | Solution |
|---------|----------|
| High CPU | Add instances (horizontal) |
| High memory | Increase RAM or fix leak |
| Slow response | Profile first, then scale |
| Traffic spikes | Auto-scaling |

### Scaling Strategy

| Type | When to Use |
|------|-------------|
| **Vertical** | Quick fix, single instance |
| **Horizontal** | Sustainable, distributed |
| **Auto** | Variable traffic |

---

## 5. Health Check Principles

### What Constitutes Healthy

| Check | Meaning |
|-------|---------|
| **HTTP 200** | Service responding |
| **Database connected** | Data accessible |
| **Dependencies OK** | External services reachable |
| **Resources OK** | CPU/memory not exhausted |

### Health Check Implementation

- Simple: Just return 200
- Deep: Check all dependencies
- Choose based on load balancer needs

---

## 6. Security Principles

| Area | Principle |
|------|-----------|
| **Access** | SSH keys only, no passwords |
| **Firewall** | Only needed ports open |
| **Updates** | Regular security patches |
| **Secrets** | Environment vars, not files |
| **Audit** | Log access and changes |

---

## 7. Troubleshooting Priority

When something's wrong:

1. **Check if running** (process status)
2. **Check logs** (error messages)
3. **Check resources** (disk, memory, CPU)
4. **Check network** (ports, DNS)
5. **Check dependencies** (database, APIs)

---

## 8. Anti-Patterns

| ❌ Don't | ✅ Do |
|----------|-------|
| Run as root | Use non-root user |
| Ignore logs | Set up log rotation |
| Skip monitoring | Monitor from day one |
| Manual restarts | Auto-restart config |
| No backups | Regular backup schedule |

---

> **Remember:** A well-managed server is boring. That's the goal.

Overview

This skill teaches server management principles and decision-making for production operations. It focuses on process management, monitoring, logging, scaling, health checks, security, and troubleshooting thinking rather than specific commands. The goal is to make systems reliable, observable, and maintainable. Learn patterns to choose tools and respond to incidents effectively.

How this skill works

The skill explains what to inspect and why: process status, resource usage, logs, health endpoints, and external dependencies. It maps symptoms to appropriate actions—when to restart, scale, profile, or patch—and recommends monitoring and alerting strategies by severity. You get guidance on log management, structured logging, rotation, and avoiding sensitive data in outputs. Security and anti-patterns are highlighted to prevent common operational mistakes.

When to use it

Designing or reviewing production deployment architectures
Creating monitoring and alerting strategies for services
Deciding when and how to scale applications
Responding to incidents and prioritizing troubleshooting steps
Establishing logging, rotation, and observability pipelines

Best practices

Choose the right process manager for the context (systemd for system services, PM2 for Node apps, containers for ephemeral workloads)
Monitor availability, performance, errors, and resource metrics; align alerts with severity and response expectations
Use structured JSON logs, rotate files, and omit sensitive data
Prefer horizontal scaling for sustainable growth; use vertical scaling only as a short-term fix
Implement health checks appropriate to load balancer needs (simple for liveness, deep for readiness)
Enforce security: SSH keys, minimal open ports, regular patches, and secure secret handling

Example use cases

Set up a monitoring stack (Prometheus + Grafana) to track latency, error rates, and resource usage
Design an auto-scaling policy that scales horizontally on request latency and queue depth
Implement structured logging and log rotation to enable fast debugging and long-term analysis
Create health endpoints that verify database and dependency connectivity for readiness probes
Run a post-incident review that maps alerts and logs to root causes and preventive changes

FAQ

What should I check first during an incident?

Start with process status, then check recent logs, resource usage (CPU/memory/disk), network connectivity, and external dependencies in that order.

When should I use auto-scaling vs manual scaling?

Use auto-scaling for variable traffic patterns and spikes; use manual/vertical scaling for quick fixes or when stateful constraints prevent horizontal scaling.