home / skills / shaul1991 / shaul-agents-plugin / devops-monitor

devops-monitor skill

/skills/devops-monitor

This skill monitors system status, analyzes logs, and checks health to help you detect issues and maintain reliable deployments.

npx playbooks add skill shaul1991/shaul-agents-plugin --skill devops-monitor

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.1 KB
---
name: devops-monitor
description: DevOps Monitor Agent. 시스템 모니터링, 로그 분석, 상태 확인을 담당합니다. 모니터링, 상태(status), 로그(logs), 알림 관련 요청 시 사용됩니다.
allowed-tools: Bash(docker:*), Bash(curl:*), Bash(cat:*), Bash(journalctl:*), Bash(ss:*), Bash(systemctl:*), Read, Grep
---

# DevOps Monitor Agent

## 역할
시스템 상태 모니터링 및 로그 분석을 담당합니다.

## 담당 업무

### 1. 컨테이너 모니터링
```bash
# 상태 확인
docker ps --filter "name=nest-api"

# 리소스 사용량
docker stats --no-stream --filter "name=nest-api"
```

### 2. 로그 분석
```bash
# 애플리케이션 로그
docker logs nest-api-[blue|green]-[dev|prod] --tail 100

# 시스템 로그
journalctl -u caddy -n 50
```

### 3. 헬스체크
```bash
# API 헬스체크
curl -sf https://[dev-]api-nest.shaul.link/health/live
curl -sf https://[dev-]api-nest.shaul.link/health/ready
```

### 4. 네트워크 상태
```bash
# 포트 확인
ss -tlnp | grep -E "3100|3101|3102|3103"

# 네트워크 연결
docker network ls --filter "name=nest-api"
```

## 모니터링 대시보드

### 시스템 상태
| 항목 | 명령어 |
|------|--------|
| 컨테이너 | `docker ps --filter "name=nest-api"` |
| 이미지 | `docker images nest-api` |
| 볼륨 | `docker volume ls --filter "name=nest-api"` |
| 네트워크 | `docker network ls --filter "name=nest-api"` |

### 서비스 상태
| 서비스 | 확인 방법 |
|--------|-----------|
| Caddy | `systemctl status caddy` |
| Docker | `systemctl status docker` |
| API | `curl /health/live` |

## 알림 기준

| 수준 | 조건 | 대응 |
|------|------|------|
| Critical | 헬스체크 실패 | 즉시 롤백 |
| Warning | 응답 지연 > 2초 | 원인 분석 |
| Info | 정상 상태 | 모니터링 유지 |

## 로그 분석 가이드

### 에러 패턴
```bash
# 에러 로그 필터링
docker logs nest-api-[slot]-[env] 2>&1 | grep -i error

# 경고 로그
docker logs nest-api-[slot]-[env] 2>&1 | grep -i warn
```

### 주요 확인 사항
1. 데이터베이스 연결 오류
2. Redis 연결 오류
3. 메모리 부족
4. 요청 타임아웃

Overview

This skill provides a focused DevOps monitoring agent for containerized services, logs, healthchecks, and alerting. It centralizes commands and checks to quickly assess container state, service health, network exposure, and common error patterns. Use it to triage incidents, confirm rollbacks, and maintain uptime SLAs.

How this skill works

The agent inspects Docker containers, images, volumes, and networks using targeted CLI commands and extracts recent logs for analysis. It runs healthcheck requests against live and ready endpoints, queries systemd service status for key services, and scans ports and network resources. Alert levels are mapped to concrete responses: immediate rollback for critical failures, investigation for warnings, and routine monitoring for informational states.

When to use it

  • Triage production incidents involving containers or APIs
  • Validate deployment health after a release or rollback
  • Investigate error and warning patterns in application logs
  • Confirm network port bindings and Docker network configuration
  • Automate basic monitoring checks in runbooks or CI hooks

Best practices

  • Run container status and resource checks (docker ps, docker stats) before log analysis to scope issues
  • Use health endpoints (/health/live and /health/ready) as the primary availability signal
  • Filter logs for error and warn patterns before deep dives to reduce noise
  • Correlate systemd service status with container state for service-level failures
  • Apply alert thresholds: treat healthcheck failures as critical and response delays >2s as warnings

Example use cases

  • Check running containers and resource usage for the nest-api service with docker ps and docker stats
  • Fetch the last 100 application logs and grep for error/warn to identify recent failures
  • Run curl health endpoints to decide whether to rollback or continue a deployment
  • Inspect ports with ss and confirm Docker network presence when troubleshooting connectivity issues
  • Query systemctl status for Caddy and Docker when external routing or container runtime issues surface

FAQ

What commands find recent application errors?

Use docker logs for the target container and pipe to grep -i error or grep -i warn to surface error and warning patterns.

How do I decide when to rollback?

Treat failed liveness/readiness checks as critical and initiate a rollback immediately; treat response delays over 2 seconds as warnings requiring root-cause analysis first.