home / skills / zhanghandong / rust-skills / domain-cloud-native

domain-cloud-native skill

/skills/domain-cloud-native

This skill helps you design and run cloud-native Rust services with proper observability, health checks, graceful shutdown, and 12-factor config.

npx playbooks add skill zhanghandong/rust-skills --skill domain-cloud-native

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.0 KB
---
name: domain-cloud-native
description: "Use when building cloud-native apps. Keywords: kubernetes, k8s, docker, container, grpc, tonic, microservice, service mesh, observability, tracing, metrics, health check, cloud, deployment, 云原生, 微服务, 容器"
user-invocable: false
---

# Cloud-Native Domain

> **Layer 3: Domain Constraints**

## Domain Constraints → Design Implications

| Domain Rule | Design Constraint | Rust Implication |
|-------------|-------------------|------------------|
| 12-Factor | Config from env | Environment-based config |
| Observability | Metrics + traces | tracing + opentelemetry |
| Health checks | Liveness/readiness | Dedicated endpoints |
| Graceful shutdown | Clean termination | Signal handling |
| Horizontal scale | Stateless design | No local state |
| Container-friendly | Small binaries | Release optimization |

---

## Critical Constraints

### Stateless Design

```
RULE: No local persistent state
WHY: Pods can be killed/rescheduled anytime
RUST: External state (Redis, DB), no static mut
```

### Graceful Shutdown

```
RULE: Handle SIGTERM, drain connections
WHY: Zero-downtime deployments
RUST: tokio::signal + graceful shutdown
```

### Observability

```
RULE: Every request must be traceable
WHY: Debugging distributed systems
RUST: tracing spans, opentelemetry export
```

---

## Trace Down ↓

From constraints to design (Layer 2):

```
"Need distributed tracing"
    ↓ m12-lifecycle: Span lifecycle
    ↓ tracing + opentelemetry

"Need graceful shutdown"
    ↓ m07-concurrency: Signal handling
    ↓ m12-lifecycle: Connection draining

"Need health checks"
    ↓ domain-web: HTTP endpoints
    ↓ m06-error-handling: Health status
```

---

## Key Crates

| Purpose | Crate |
|---------|-------|
| gRPC | tonic |
| Kubernetes | kube, kube-runtime |
| Docker | bollard |
| Tracing | tracing, opentelemetry |
| Metrics | prometheus, metrics |
| Config | config, figment |
| Health | HTTP endpoints |

## Design Patterns

| Pattern | Purpose | Implementation |
|---------|---------|----------------|
| gRPC services | Service mesh | tonic + tower |
| K8s operators | Custom resources | kube-runtime Controller |
| Observability | Debugging | tracing + OTEL |
| Health checks | Orchestration | `/health`, `/ready` |
| Config | 12-factor | Env vars + secrets |

## Code Pattern: Graceful Shutdown

```rust
use tokio::signal;

async fn run_server() -> anyhow::Result<()> {
    let app = Router::new()
        .route("/health", get(health))
        .route("/ready", get(ready));

    let addr = SocketAddr::from(([0, 0, 0, 0], 8080));

    axum::Server::bind(&addr)
        .serve(app.into_make_service())
        .with_graceful_shutdown(shutdown_signal())
        .await?;

    Ok(())
}

async fn shutdown_signal() {
    signal::ctrl_c().await.expect("failed to listen for ctrl+c");
    tracing::info!("shutdown signal received");
}
```

## Health Check Pattern

```rust
async fn health() -> StatusCode {
    StatusCode::OK
}

async fn ready(State(db): State<Arc<DbPool>>) -> StatusCode {
    match db.ping().await {
        Ok(_) => StatusCode::OK,
        Err(_) => StatusCode::SERVICE_UNAVAILABLE,
    }
}
```

---

## Common Mistakes

| Mistake | Domain Violation | Fix |
|---------|-----------------|-----|
| Local file state | Not stateless | External storage |
| No SIGTERM handling | Hard kills | Graceful shutdown |
| No tracing | Can't debug | tracing spans |
| Static config | Not 12-factor | Env vars |

---

## Trace to Layer 1

| Constraint | Layer 2 Pattern | Layer 1 Implementation |
|------------|-----------------|------------------------|
| Stateless | External state | Arc<Client> for external |
| Graceful shutdown | Signal handling | tokio::signal |
| Tracing | Span lifecycle | tracing + OTEL |
| Health checks | HTTP endpoints | Dedicated routes |

---

## Related Skills

| When | See |
|------|-----|
| Async patterns | m07-concurrency |
| HTTP endpoints | domain-web |
| Error handling | m13-domain-error |
| Resource lifecycle | m12-lifecycle |

Overview

This skill captures cloud-native domain constraints and concrete Rust design implications for building containerized microservices. It focuses on stateless design, graceful shutdown, observability, health checks, and 12-factor configuration. Use it to align architecture and implementation choices when targeting Kubernetes, service meshes, and container platforms.

How this skill works

The skill translates high-level domain rules into actionable design constraints and Rust patterns. It maps requirements (tracing, shutdown, health, config, scaling) to crates (tonic, kube, tracing, opentelemetry, prometheus) and code patterns (signal handling, dedicated health endpoints, env-based config). It highlights common mistakes and prescribes fixes to keep services cloud-native.

When to use it

  • Designing microservices intended to run in Kubernetes or other orchestrators
  • Implementing gRPC services, service mesh integrations, or operators in Rust
  • Adding observability: tracing, metrics, and distributed context propagation
  • Hardening services for production: health checks and graceful shutdown
  • Converting a stateful app into a horizontally scalable, stateless service

Best practices

  • Follow 12-factor: read config from env and secrets, avoid static config files
  • Keep services stateless; use external stores (DB, Redis) for persistence
  • Implement graceful shutdown using tokio::signal and connection draining
  • Instrument requests with tracing spans and export via OpenTelemetry
  • Expose /health and /ready endpoints; make readiness depend on critical deps
  • Optimize binaries for containers and minimize image size

Example use cases

  • A tonic-based gRPC microservice with tracing and Prometheus metrics for a service mesh
  • A Kubernetes operator using kube-runtime Controller to manage custom resources
  • An HTTP API built with axum exposing /health and /ready and handling SIGTERM for zero-downtime deploys
  • A Rust worker connecting to external Redis and DB, keeping no local persistent state
  • Adding OpenTelemetry spans and exporters to trace requests across microservices

FAQ

How should I handle local caches or temp files?

Avoid relying on local persistent files. Use ephemeral in-memory caches only for performance and ensure they can be rebuilt; persist important state to external services.

What crates should I use for observability?

Use tracing for structured spans, opentelemetry for exporters, and prometheus or metrics for metrics collection and scraping.