home / skills / openclaw / skills / cluster-agent-swarm-skills

cluster-agent-swarm-skills skill

/skills/kcns008/cluster-agent-swarm-skills

This skill coordinates seven specialized agents to manage Kubernetes and OpenShift platform operations across orchestration, governance, and developer

npx playbooks add skill openclaw/skills --skill cluster-agent-swarm-skills

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
9.6 KB
---
name: cluster-agent-swarm
description: >
  Complete Platform Agent Swarm — A coordinated multi-agent system for Kubernetes and OpenShift 
  platform operations. Includes Orchestrator (Jarvis), Cluster Ops (Atlas), GitOps (Flow), 
  Security (Shield), Observability (Pulse), Artifacts (Cache), and Developer Experience (Desk).
metadata:
  author: cluster-agent-swarm
  version: 1.0.0
  agent_name: Swarm
  agent_role: Platform Agent Swarm (All Agents)
  session_key: "agent:platform:swarm"
  heartbeat: "*/5 * * * *"
  platforms:
    - openshift
    - kubernetes
    - eks
    - aks
    - gke
    - rosa
    - aro
  tools:
    - kubectl
    - oc
    - argocd
    - helm
    - kustomize
    - az
    - aws
    - gcloud
    - rosa
    - jq
    - curl
    - git
---

# Cluster Agent Swarm — Complete Platform Operations

This is the complete cluster-agent-swarm skill package. When you add this skill, you get 
access to ALL 7 specialized agents working together as a coordinated swarm.

## Installation Options

### Install All Skills (Recommended)
```bash
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills
```

This installs all 7 agents as a single combined skill with access to all capabilities.

### Install Individual Skills
Each agent can also be installed separately:
```bash
# Orchestrator - Task routing and coordination
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/orchestrator

# Cluster Ops - Atlas (cluster operations)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/cluster-ops

# GitOps - Flow (ArgoCD, Helm, Kustomize)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/gitops

# Security - Shield (RBAC, policies, CVEs)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/security

# Observability - Pulse (metrics, alerts, incidents)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/observability

# Artifacts - Cache (registries, SBOM, promotions)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/artifacts

# Developer Experience - Desk (namespaces, onboarding)
npx skills add https://github.com/kcns008/cluster-agent-swarm-skills/skills/developer-experience
```

---

## The Swarm — Agent Roster

| Agent | Code Name | Session Key | Domain |
|-------|-----------|-------------|--------|
| Orchestrator | Jarvis | `agent:platform:orchestrator` | Task routing, coordination, standups |
| Cluster Ops | Atlas | `agent:platform:cluster-ops` | Cluster lifecycle, nodes, upgrades |
| GitOps | Flow | `agent:platform:gitops` | ArgoCD, Helm, Kustomize, deploys |
| Security | Shield | `agent:platform:security` | RBAC, policies, secrets, scanning |
| Observability | Pulse | `agent:platform:observability` | Metrics, logs, alerts, incidents |
| Artifacts | Cache | `agent:platform:artifacts` | Registries, SBOM, promotion, CVEs |
| Developer Experience | Desk | `agent:platform:developer-experience` | Namespaces, onboarding, support |

---

## Agent Capabilities Summary

### What Agents CAN Do
- Read cluster state (`kubectl get`, `kubectl describe`, `oc get`)
- Deploy via GitOps (`argocd app sync`, Flux reconciliation)
- Create documentation and reports
- Investigate and triage incidents
- Provision standard resources (namespaces, quotas, RBAC)
- Run health checks and audits
- Scan images and generate SBOMs
- Query metrics and logs
- Execute pre-approved runbooks

### What Agents CANNOT Do (Human-in-the-Loop Required)
- Delete production resources (`kubectl delete` in prod)
- Modify cluster-wide policies (NetworkPolicy, OPA, Kyverno cluster policies)
- Make direct changes to secrets without rotation workflow
- Modify network routes or service mesh configuration
- Scale beyond defined resource limits
- Perform irreversible cluster upgrades
- Approve production deployments (can prepare, human approves)
- Change RBAC at cluster-admin level

---

## Communication Patterns

### @Mentions
Agents communicate via @mentions in shared task comments:
```
@Shield Please review the RBAC for payment-service v3.2 before I sync.
@Pulse Is the CPU spike related to the deployment or external traffic?
@Atlas The staging cluster needs 2 more worker nodes.
```

### Thread Subscriptions
- Commenting on a task → auto-subscribe
- Being @mentioned → auto-subscribe
- Being assigned → auto-subscribe
- Once subscribed → receive ALL future comments on heartbeat

### Escalation Path
1. Agent detects issue
2. Agent attempts resolution within guardrails
3. If blocked → @mention another agent or escalate to human
4. P1 incidents → all relevant agents auto-notified

---

## Heartbeat Schedule

Agents wake on staggered 5-minute intervals:
```
*/5  * * * *  Atlas   (Cluster Ops - needs fast response for incidents)
*/5  * * * *  Pulse   (Observability - needs fast response for alerts)
*/5  * * * *  Shield  (Security - fast response for CVEs and threats)
*/10 * * * *  Flow    (GitOps - deployments can wait a few minutes)
*/10 * * * *  Cache   (Artifacts - promotions are scheduled)
*/15 * * * *  Desk    (DevEx - developer requests aren't usually urgent)
*/15 * * * *  Orchestrator (Coordination - overview and standups)
```

---

## Key Principles

- **Roles over genericism** — Each agent has a defined SOUL with exactly who they are
- **Files over mental notes** — Only files persist between sessions
- **Staggered schedules** — Don't wake all agents at once
- **Shared context** — One source of truth for tasks and communication
- **Heartbeat, not always-on** — Balance responsiveness with cost
- **Human-in-the-loop** — Critical actions require approval
- **Guardrails over freedom** — Define what agents can and cannot do
- **Audit everything** — Every action logged to activity feed
- **Reliability first** — System stability always wins over new features
- **Security by default** — Deny access, approve by exception

---

## Detailed Agent Capabilities

### Orchestrator (Jarvis)
- Task routing: determining which agent should handle which request
- Workflow orchestration: coordinating multi-agent operations
- Daily standups: compiling swarm-wide status reports
- Priority management: determining urgency and sequencing of work
- Cross-agent communication: facilitating collaboration
- Accountability: tracking what was promised vs what was delivered

### Cluster Ops (Atlas)
- OpenShift/Kubernetes cluster operations (upgrades, scaling, patching)
- Node pool management and autoscaling
- Resource quota management and capacity planning
- Network troubleshooting (OVN-Kubernetes, Cilium, Calico)
- Storage class management and PVC/CSI issues
- etcd backup, restore, and health monitoring
- Multi-platform expertise (OCP, EKS, AKS, GKE, ROSA, ARO)

### GitOps (Flow)
- ArgoCD application management (sync, rollback, sync waves, hooks)
- Helm chart development, debugging, and templating
- Kustomize overlays and patch generation
- ApplicationSet templates for multi-cluster deployments
- Deployment strategy management (canary, blue-green, rolling)
- Git repository management and branching strategies
- Drift detection and remediation
- Secrets management integration (Vault, Sealed Secrets, External Secrets)

### Security (Shield)
- RBAC audit and management
- NetworkPolicy review and enforcement
- Security policy validation (OPA, Kyverno)
- Vulnerability scanning (image scanning, CVE triage)
- Secret rotation workflows
- Security incident investigation
- Compliance reporting

### Observability (Pulse)
- Prometheus/Grafana metric queries
- Log aggregation and search (Loki, Elasticsearch)
- Alert triage and investigation
- SLO tracking and error budget monitoring
- Incident response coordination
- Dashboards and visualization
- Telemetry pipeline troubleshooting

### Artifacts (Cache)
- Container registry management
- Image scanning and CVE analysis
- SBOM generation and tracking
- Artifact promotion workflows
- Version management
- Registry caching and proxying

### Developer Experience (Desk)
- Namespace provisioning
- Resource quota and limit range management
- Developer onboarding
- Template generation
- Developer support and troubleshooting
- Documentation generation

---

## File Structure

```
cluster-agent-swarm-skills/
├── SKILL.md                    # This file - combined swarm
├── AGENTS.md                   # Swarm configuration and protocols
├── skills/
│   ├── orchestrator/           # Jarvis - task routing
│   │   └── SKILL.md
│   ├── cluster-ops/            # Atlas - cluster operations
│   │   └── SKILL.md
│   ├── gitops/                 # Flow - GitOps
│   │   └── SKILL.md
│   ├── security/               # Shield - security
│   │   └── SKILL.md
│   ├── observability/          # Pulse - monitoring
│   │   └── SKILL.md
│   ├── artifacts/              # Cache - artifacts
│   │   └── SKILL.md
│   └── developer-experience/   # Desk - DevEx
│       └── SKILL.md
├── scripts/                    # Shared scripts
└── references/                 # Shared documentation
```

---

## Reference Documentation

For detailed capabilities of each agent, refer to individual SKILL.md files:
- `skills/orchestrator/SKILL.md` - Full Orchestrator documentation
- `skills/cluster-ops/SKILL.md` - Full Cluster Ops documentation
- `skills/gitops/SKILL.md` - Full GitOps documentation
- `skills/security/SKILL.md` - Full Security documentation
- `skills/observability/SKILL.md` - Full Observability documentation
- `skills/artifacts/SKILL.md` - Full Artifacts documentation
- `skills/developer-experience/SKILL.md` - Full Developer Experience documentation

Overview

This skill delivers a coordinated swarm of seven platform agents for Kubernetes and OpenShift operations, combining orchestration, cluster lifecycle, GitOps, security, observability, artifacts, and developer experience. It provides an integrated, role-driven automation layer that prepares actions, performs inspections, and requires human approval for destructive or high-risk changes. The swarm is optimized for reliability, auditability, and clear escalation paths.

How this skill works

Each agent has a defined role and wakes on a staggered heartbeat to inspect cluster state, telemetry, and Git repositories. Agents can read cluster resources, run health checks, query metrics/logs, scan images, and prepare deployment or remediation plans. Coordination is handled by an Orchestrator that routes tasks, compiles reports, and sequences multi-agent workflows while guardrails enforce human-in-the-loop for critical actions.

When to use it

  • Automate routine cluster operations, capacity planning, and node management.
  • Coordinate multi-step GitOps deployments across clusters and environments.
  • Triage and investigate incidents using logs, metrics, and runbooks.
  • Enforce security posture: RBAC reviews, policy validation, and CVE triage.
  • Provision namespaces, quotas, and onboarding support for developers.
  • Manage artifact lifecycle: registries, SBOMs, promotions and scans.

Best practices

  • Define clear ownership and escalation paths for each agent role.
  • Keep sensitive modifications gated behind explicit human approval workflows.
  • Maintain a single source of truth for tasks, manifests, and runbooks.
  • Stagger heartbeats and rate-limit agent actions to reduce blast radius.
  • Audit every action and ensure logs are retained for incident review.
  • Use GitOps for declarative deployments and keep secrets in a managed vault.

Example use cases

  • Run a controlled ArgoCD sync with pre-checks by observability and security agents before human approval.
  • Auto-detect a node pressure incident, have Cluster Ops perform diagnostics, and escalate to human if remediation exceeds guardrails.
  • Generate an SBOM and run image scans across a registry, then create a promotion plan for vetted images.
  • Provision developer namespaces with quotas and RBAC templates, and produce onboarding docs automatically.
  • Triage a high-severity alert: Pulse identifies root signals, Atlas runs cluster checks, Shield validates policy implications, and Jarvis coordinates the response.

FAQ

Can agents make destructive changes automatically?

No. Destructive or high-risk actions (deleting prod resources, cluster-wide policy changes, secret modifications) require explicit human approval and are blocked by guardrails.

How do agents communicate and escalate?

Agents communicate via mentions in shared task threads and auto-subscribe to updates; escalation happens through cross-agent mentions and a defined human escalation path for P1 incidents.

What integrations are supported for deployments and telemetry?

The swarm integrates with GitOps tools (ArgoCD, Flux), Helm, Kustomize, Prometheus/Grafana, Loki/Elastic, and common registries and SBOM scanners.