home / skills / lerianstudio / ring / ops-platform-onboarding

ops-platform-onboarding skill

/.archive/ops-team/skills/ops-platform-onboarding

This skill guides platform onboarding for new services, standardizing infrastructure, observability, security, and documentation to ensure compliant

npx playbooks add skill lerianstudio/ring --skill ops-platform-onboarding

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
11.2 KB
---
name: ops-platform-onboarding
description: |
  Structured workflow for onboarding new services to the internal platform
  including infrastructure provisioning, observability setup, and documentation.

trigger: |
  - New service deployment to platform
  - Service migration to platform
  - Platform capability adoption
  - Team onboarding to platform

skip_when: |
  - Application development -> use ring-dev-team specialists
  - Existing service configuration changes -> standard change management
  - Non-platform infrastructure -> use ops-infrastructure-architect

related:
  similar: [ring:dev-cycle]
  uses: [platform-engineer]
---

# Platform Onboarding Workflow

This skill defines the structured process for onboarding services to the internal developer platform. Use it to ensure consistent, compliant service deployments.

---

## Onboarding Phases

| Phase | Focus | Output |
|-------|-------|--------|
| **1. Requirements** | Gather service requirements | Requirements doc |
| **2. Golden Path Selection** | Choose deployment pattern | Selected template |
| **3. Infrastructure Provisioning** | Create service resources | Infrastructure ready |
| **4. Observability Setup** | Configure monitoring | Dashboards/alerts |
| **5. Security Configuration** | Apply security controls | Security validated |
| **6. Documentation** | Complete service docs | Runbook ready |
| **7. Handoff** | Transfer to service team | Ownership confirmed |

---

## Phase 1: Requirements Gathering

### Service Requirements Checklist

```markdown
## Service Onboarding Request

**Service Name:** [name]
**Team:** [owning team]
**Requested By:** [name]
**Target Date:** YYYY-MM-DD

### Service Information

| Attribute | Value |
|-----------|-------|
| Service type | [API / Worker / Batch / Frontend] |
| Language/runtime | [Go / Node.js / Python / etc.] |
| Criticality | [Tier 1/2/3/4] |
| External traffic | [Yes / No] |
| Data sensitivity | [PII / Financial / Public] |

### Resource Requirements

| Resource | Requirement | Notes |
|----------|-------------|-------|
| CPU | [cores] | [peak/average] |
| Memory | [GB] | [peak/average] |
| Storage | [GB] | [type: SSD/HDD] |
| Database | [type] | [shared/dedicated] |
| Cache | [type] | [shared/dedicated] |

### Dependencies

| Dependency | Type | SLA Required |
|------------|------|--------------|
| [service] | Internal | [Yes/No] |
| [external] | External | [Yes/No] |

### Compliance Requirements

- [ ] SOC2
- [ ] PCI-DSS
- [ ] GDPR
- [ ] HIPAA
- [ ] Other: ____________
```

---

## Phase 2: Golden Path Selection

### Available Golden Paths

| Golden Path | Use Case | Includes |
|-------------|----------|----------|
| **api-service** | REST/GraphQL APIs | ALB, EKS, RDS, ElastiCache |
| **worker-service** | Background processing | SQS, EKS, auto-scaling |
| **batch-job** | Scheduled jobs | EventBridge, Lambda/Fargate |
| **frontend-app** | Static sites, SPAs | CloudFront, S3, API Gateway |
| **data-pipeline** | ETL, streaming | Kinesis, Glue, S3 |

### Golden Path Selection Matrix

| Requirement | api-service | worker-service | batch-job |
|-------------|-------------|----------------|-----------|
| HTTP traffic | Yes | No | No |
| Queue processing | Optional | Yes | Optional |
| Scheduled runs | No | No | Yes |
| Real-time | Yes | Near-real-time | No |
| Auto-scaling | Yes | Yes | N/A |

### Selection Template

```markdown
## Golden Path Selection

**Service:** [name]
**Selected Path:** [api-service / worker-service / etc.]

### Rationale

1. Service type [X] matches [golden path] pattern
2. Traffic requirements of [X] supported by [features]
3. Compliance requirements met by built-in [controls]

### Customizations Required

| Standard Component | Customization | Reason |
|--------------------|---------------|--------|
| [component] | [change] | [why] |

### Approval

- [ ] Platform team reviewed
- [ ] Security team reviewed (if customizations)
- [ ] Architecture team reviewed (if non-standard)
```

---

## Phase 3: Infrastructure Provisioning

### Provisioning Checklist

- [ ] Namespace/project created
- [ ] Compute resources provisioned
- [ ] Database provisioned (if required)
- [ ] Cache provisioned (if required)
- [ ] Load balancer configured
- [ ] DNS entries created
- [ ] SSL certificates provisioned
- [ ] Secrets stored in vault
- [ ] IAM roles/service accounts created

### Terraform/IaC Template

```hcl
# Example service provisioning
module "service" {
  source = "platform/service-template"

  service_name    = var.service_name
  team            = var.team
  environment     = var.environment
  golden_path     = "api-service"

  # Compute
  cpu_request     = "500m"
  memory_request  = "512Mi"
  replicas_min    = 2
  replicas_max    = 10

  # Database
  database_enabled = true
  database_class   = "db.t3.medium"

  # Tags
  tags = {
    Team        = var.team
    Environment = var.environment
    CostCenter  = var.cost_center
  }
}
```

### Provisioning Verification

```bash
# Verify namespace
kubectl get namespace [service-name]

# Verify compute
kubectl get deployment -n [service-name]

# Verify database
aws rds describe-db-instances --db-instance-identifier [service-db]

# Verify DNS
dig [service-name].internal.example.com
```

---

## Phase 4: Observability Setup

### Observability Checklist

- [ ] Structured logging configured
- [ ] Tracing instrumentation added
- [ ] Metrics endpoints exposed
- [ ] Service dashboard created
- [ ] SLI/SLO defined
- [ ] Alerts configured
- [ ] On-call integration set up

### Dashboard Template

Standard service dashboard includes:

| Panel | Metrics |
|-------|---------|
| Request rate | requests/sec, by status code |
| Error rate | 5xx rate, 4xx rate |
| Latency | p50, p95, p99 |
| Saturation | CPU, memory utilization |
| Dependencies | Upstream/downstream health |

### Alert Configuration

| Alert | Condition | Severity | Response |
|-------|-----------|----------|----------|
| High error rate | 5xx > 1% for 5m | Critical | Page on-call |
| High latency | p99 > 1s for 5m | Warning | Alert team |
| Low availability | uptime < 99.9% | Critical | Page on-call |
| Resource saturation | CPU > 85% for 10m | Warning | Alert team |

### SLI/SLO Definition

```markdown
## Service Level Objectives

**Service:** [name]
**SLO Version:** 1.0

| SLI | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.9% | Successful requests / total requests |
| Latency | p99 < 500ms | Request duration percentile |
| Error rate | < 0.1% | 5xx responses / total responses |

### Error Budget

- Monthly budget: 43.2 minutes downtime
- Current consumption: [X]%
- Actions if budget exceeded: [escalation process]
```

---

## Phase 5: Security Configuration

### Security Checklist

- [ ] Network policies applied
- [ ] Service mesh mTLS configured
- [ ] Secrets management configured
- [ ] IAM permissions follow least privilege
- [ ] Security scanning in CI/CD
- [ ] Dependency scanning enabled
- [ ] WAF rules applied (if external)

### Network Policy Template

```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: service-policy
  namespace: [service-name]
spec:
  podSelector:
    matchLabels:
      app: [service-name]
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: istio-system
      ports:
        - port: 8080
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: database
      ports:
        - port: 5432
```

### Security Review

```markdown
## Security Configuration Review

**Service:** [name]
**Reviewer:** @security-team

| Control | Status | Notes |
|---------|--------|-------|
| mTLS enabled | PASS | Istio strict mode |
| Network policies | PASS | Ingress/egress restricted |
| Secrets management | PASS | Using Vault |
| Least privilege IAM | PASS | Scoped to required resources |
| Vulnerability scanning | PASS | Trivy in CI/CD |
```

---

## Phase 6: Documentation

### Required Documentation

| Document | Purpose | Template |
|----------|---------|----------|
| **Service Overview** | What the service does | README.md |
| **Runbook** | Operational procedures | runbook.md |
| **Architecture** | Design decisions | architecture.md |
| **API Docs** | Interface documentation | OpenAPI spec |
| **On-call Guide** | Incident handling | oncall.md |

### Runbook Template

```markdown
## [Service Name] Runbook

### Service Overview

[Brief description of what the service does]

### Quick Reference

| Item | Value |
|------|-------|
| Repository | [link] |
| Dashboard | [link] |
| Logs | [query link] |
| On-call | [PagerDuty service] |

### Common Operations

#### Restart Service

```bash
kubectl rollout restart deployment/[service] -n [namespace]
```

#### Scale Service

```bash
kubectl scale deployment/[service] -n [namespace] --replicas=X
```

#### Check Logs

```bash
kubectl logs -l app=[service] -n [namespace] --tail=100
```

### Troubleshooting

| Symptom | Possible Cause | Resolution |
|---------|----------------|------------|
| High latency | DB connection pool | Scale DB or optimize queries |
| 5xx errors | Dependency down | Check upstream services |
| OOM kills | Memory leak | Investigate heap, restart |

### Escalation

| Level | Contact | When |
|-------|---------|------|
| L1 | [team Slack channel] | First response |
| L2 | [on-call engineer] | Cannot resolve in 15m |
| L3 | [service owner] | Critical/extended outage |
```

---

## Phase 7: Handoff

### Handoff Checklist

- [ ] Service owner identified and trained
- [ ] On-call rotation set up
- [ ] Access provisioned to team
- [ ] Documentation reviewed by team
- [ ] Shadowing session completed
- [ ] Ownership officially transferred

### Handoff Template

```markdown
## Service Handoff Confirmation

**Service:** [name]
**Date:** YYYY-MM-DD
**Platform Team:** @[name]
**Service Owner:** @[name]

### Completed Items

- [x] Infrastructure provisioned and documented
- [x] Observability configured
- [x] Security controls applied
- [x] Runbook created and reviewed
- [x] On-call rotation configured
- [x] Training session completed

### Outstanding Items

| Item | Owner | Due Date |
|------|-------|----------|
| [item] | [owner] | YYYY-MM-DD |

### Acknowledgment

By signing below, the service owner confirms:
1. Receipt of all documentation
2. Understanding of operational procedures
3. Acceptance of on-call responsibilities

**Service Owner:** _________________ Date: _______
**Platform Team:** _________________ Date: _______
```

---

## Anti-Rationalization Table

| Rationalization | Why It's WRONG | Required Action |
|-----------------|----------------|-----------------|
| "Skip documentation, code is self-explanatory" | On-call != developers | **Complete runbook** |
| "We'll add observability later" | Blind deployments = incidents | **Observability on day 1** |
| "Golden path doesn't fit exactly" | Customizations add complexity | **Justify every deviation** |
| "Security can come later" | Later = never for security | **Security from start** |
| "Team can figure it out" | Assumptions cause outages | **Complete handoff process** |

---

## Dispatch Specialist

For platform onboarding tasks, dispatch:

```
Task tool:
  subagent_type: "ring:platform-engineer"
  model: "opus"
  prompt: |
    SERVICE ONBOARDING REQUEST
    Service: [name]
    Team: [team]
    Type: [API/Worker/Batch]
    Requirements: [summary]
    Golden Path: [if known]
```

Overview

This skill provides a structured workflow for onboarding new services to the internal developer platform. It enforces mandatory phases: requirements, golden path selection, infrastructure provisioning, observability, security, documentation, and handoff. The goal is consistent, compliant deployments with operational readiness and clear ownership.

How this skill works

The workflow collects a service requirements form to capture type, traffic, resource, dependency, and compliance needs. It guides selection of a golden path template (api-service, worker, batch, frontend, data-pipeline) and applies Terraform/IaC modules to provision namespace, compute, storage, networking, and secrets. Observability, security checks, documentation templates, and a formal handoff complete the process and provide verification commands and acceptance criteria.

When to use it

  • Onboarding any new internal service that will run on the platform
  • Migrating an existing service into the platform standard
  • When compliance (SOC2, PCI, GDPR, HIPAA) or security controls are required
  • Before production traffic or external exposure is enabled
  • When a team needs a repeatable, audited provisioning and handoff

Best practices

  • Complete the Requirements checklist before choosing a golden path
  • Prefer standard golden path templates; justify and document deviations
  • Provision observability and SLI/SLOs on day one, not later
  • Apply least-privilege IAM, network policies, and automated security scans in CI/CD
  • Create a concise runbook and run a shadowing session during handoff

Example use cases

  • Onboard a new REST API service using the api-service golden path with ALB, EKS, and RDS
  • Provision a background worker that processes SQS messages and needs autoscaling
  • Set up a scheduled batch job using EventBridge and Lambda/Fargate
  • Migrate a frontend single-page app to CloudFront + S3 with API Gateway
  • Bring a data pipeline online with Kinesis and S3 storage following platform templates

FAQ

What if the golden path doesn't exactly fit our needs?

You may customize a template but document every deviation, obtain platform and security reviews, and justify added operational complexity.

When must observability be configured?

Observability (logging, tracing, metrics, SLOs) must be enabled during provisioning so the service is measurable from day one.