home / skills / williamzujkowski / standards / service-mesh

service-mesh skill

/skills/cloud-native/service-mesh

This skill helps you understand and implement service mesh concepts for secure, observable, and resilient inter-service communication across Kubernetes

npx playbooks add skill williamzujkowski/standards --skill service-mesh

Review the files below or copy the command above to add this skill to your agents.

Files (8)
SKILL.md
17.0 KB
---
name: service-mesh
category: cloud-native
difficulty: advanced
tags:
- istio
- linkerd
- envoy
- mtls
- traffic-management
- observability
prerequisites:
- kubernetes
- networking
- security
estimated_time: 8-12 hours
description: A service mesh is an infrastructure layer that provides transparent service-to-service
  communication with built-in observability, traffic management, and security features
  without requiring application code changes.
---


# Service Mesh

## Level 1: Quick Reference

### What is a Service Mesh?

A service mesh is an infrastructure layer that provides transparent service-to-service communication with built-in observability, traffic management, and security features without requiring application code changes.

**Core Components:**

- **Control Plane**: Configuration and policy management (Istiod)
- **Data Plane**: Sidecar proxies handling traffic (Envoy)
- **Service Identity**: Certificate-based authentication (mTLS)

### Key Benefits

**1. Observability**

- Automatic metrics collection (latency, throughput, errors)
- Distributed tracing (request flow visualization)
- Traffic topology and service dependencies
- Real-time dashboards (Kiali, Grafana)

**2. Traffic Management**

- Intelligent routing (canary, blue-green, A/B)
- Load balancing (round-robin, least-request, consistent hash)
- Traffic splitting and mirroring
- Request retries and timeouts

**3. Security**

- Automatic mutual TLS (mTLS) encryption
- Service-to-service authentication
- Fine-grained authorization policies
- Certificate rotation and management

**4. Resilience**

- Circuit breaking and outlier detection
- Rate limiting and quota management
- Fault injection for chaos testing
- Connection pooling

### Istio vs Linkerd Comparison

| Feature | Istio | Linkerd |
|---------|-------|---------|
| **Proxy** | Envoy (C++) | Linkerd2-proxy (Rust) |
| **Resource Usage** | Higher (100-200MB/pod) | Lower (10-20MB/pod) |
| **Features** | Comprehensive (100+ CRDs) | Focused (essential features) |
| **Complexity** | More complex setup | Simpler, faster setup |
| **Multi-cluster** | Full support | Basic support |
| **Traffic Management** | Advanced (mirroring, A/B) | Basic (canary, split) |
| **Observability** | Full stack (Kiali/Jaeger) | Built-in (Linkerd Viz) |
| **Maturity** | Production-ready (CNCF) | Production-ready (CNCF) |

**Choose Istio when:**

- Need advanced traffic management (mirroring, A/B testing)
- Multi-cluster or multi-cloud deployments
- Complex authorization requirements
- Established operations team

**Choose Linkerd when:**

- Resource efficiency is critical
- Prefer simplicity over features
- Kubernetes-only environment
- Faster onboarding required

### Essential Service Mesh Checklist

**Pre-Installation:**

```yaml
□ Kubernetes cluster v1.22+ running
□ CNI plugin supporting network policies
□ Prometheus and Grafana available
□ Resource capacity planned (2-4 CPU, 4-8GB RAM)
□ DNS resolution working correctly
```

**Post-Installation:**

```yaml
□ mTLS enabled cluster-wide (STRICT mode)
□ Sidecar injection configured (automatic/manual)
□ Monitoring dashboards accessible
□ Health checks and readiness probes adjusted
□ Certificate rotation configured
□ Resource limits set on proxies
```

**Traffic Management:**

```yaml
□ VirtualService defines routing rules
□ DestinationRule configures load balancing
□ Circuit breaking thresholds set
□ Retry policies configured
□ Timeouts appropriate for services
```

**Security:**

```yaml
□ Authorization policies applied (default-deny)
□ PeerAuthentication enforces mTLS
□ Service accounts properly scoped
□ Ingress/egress gateways secured
□ Certificate validity monitored
```

### Quick Start Commands

```bash
# Install Istio (demo profile)
istioctl install --set profile=demo -y

# Enable sidecar injection
kubectl label namespace default istio-injection=enabled

# Deploy sample app
kubectl apply -f samples/bookinfo/platform/kube/bookinfo.yaml

# Create ingress gateway
kubectl apply -f samples/bookinfo/networking/bookinfo-gateway.yaml

# Check mesh status
istioctl proxy-status

# View metrics
kubectl -n istio-system port-forward svc/kiali 20001:20001
```

### Common Patterns

**Canary Deployment:**

```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews-canary
spec:
  hosts:
  - reviews
  http:
  - match:
    - headers:
        end-user:
          exact: jason
    route:
    - destination:
        host: reviews
        subset: v2
  - route:
    - destination:
        host: reviews
        subset: v1
      weight: 90
    - destination:
        host: reviews
        subset: v2
      weight: 10
```

**Circuit Breaking:**

```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews-circuit-breaker
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 10
        maxRequestsPerConnection: 2
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
```

**Zero Trust Security:**

```yaml
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: default
spec:
  {} # Empty spec denies all traffic
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-frontend-to-backend
spec:
  selector:
    matchLabels:
      app: backend
  action: ALLOW
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/default/sa/frontend"]
    to:
    - operation:
        methods: ["GET", "POST"]
```

---

## Level 2:
>
> **📚 Full Examples**: See [REFERENCE.md](./REFERENCE.md) for complete code samples, detailed configurations, and production-ready implementations.

 Implementation Guide

### Service Mesh Architecture

#### Control Plane (Istiod)

The control plane manages and configures the data plane proxies:

**Core Functions:**

1. **Service Discovery**: Abstracts platform-specific discovery mechanisms
2. **Configuration Distribution**: Pushes routing rules to proxies
3. **Certificate Authority**: Issues and rotates mTLS certificates
4. **Policy Enforcement**: Applies security and traffic policies

**Istiod Components:**


*See [REFERENCE.md](./REFERENCE.md#example-0) for complete implementation.*


#### Data Plane (Envoy Proxies)

**Sidecar Pattern:**


*See [REFERENCE.md](./REFERENCE.md#example-1) for complete implementation.*


**Envoy Capabilities:**

- Layer 7 (HTTP/gRPC) and Layer 4 (TCP) proxying
- Dynamic configuration via xDS APIs
- Advanced load balancing algorithms
- Health checking and circuit breaking
- Rich metrics and distributed tracing

#### Traffic Flow


*See [REFERENCE.md](./REFERENCE.md#example-2) for complete implementation.*


### Istio Installation and Configuration

#### Installation Options

**1. IstioOperator (Recommended):**


*See [REFERENCE.md](./REFERENCE.md#example-3) for complete implementation.*


**2. Helm Installation:**


*See [REFERENCE.md](./REFERENCE.md#example-4) for complete implementation.*


**3. Production Profile:**


*See [REFERENCE.md](./REFERENCE.md#example-5) for complete implementation.*


#### Sidecar Injection Strategies

**Automatic Injection (Namespace-level):**

```bash
# Label namespace for injection
kubectl label namespace production istio-injection=enabled

# Verify
kubectl get namespace -L istio-injection
```

**Manual Injection (Pod-level):**

```bash
# Inject sidecar into existing deployment
kubectl get deployment myapp -o yaml | \
  istioctl kube-inject -f - | \
  kubectl apply -f -
```

**Selective Injection (Pod annotation):**

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: myapp
  annotations:
    sidecar.istio.io/inject: "true"  # Explicit injection
spec:
  containers:
  - name: app
    image: myapp:v1
```

#### Configuration Validation


*See [REFERENCE.md](./REFERENCE.md#example-9) for complete implementation.*


### Advanced Traffic Management

#### 1. Canary Deployments

**Gradual Rollout Strategy:**


*See [REFERENCE.md](./REFERENCE.md#example-10) for complete implementation.*


**Progressive Rollout Script:**


*See [REFERENCE.md](./REFERENCE.md#example-11) for complete implementation.*


#### 2. Blue-Green Deployments

**Zero-Downtime Cutover:**


*See [REFERENCE.md](./REFERENCE.md#example-12) for complete implementation.*


**Cutover Process:**


*See [REFERENCE.md](./REFERENCE.md#example-13) for complete implementation.*


#### 3. A/B Testing

**User Cohort Routing:**


*See [REFERENCE.md](./REFERENCE.md#example-14) for complete implementation.*


#### 4. Traffic Mirroring

**Shadow Testing:**


*See [REFERENCE.md](./REFERENCE.md#example-15) for complete implementation.*


**Use Cases:**

- Test new version without impacting users
- Load testing with real traffic patterns
- Validate refactored services
- Debug production issues safely

### Observability Stack

#### Kiali (Service Mesh Dashboard)

**Installation:**

```bash
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/kiali.yaml

# Access dashboard
istioctl dashboard kiali
```

**Key Features:**

- Service topology graph
- Traffic flow visualization
- Configuration validation
- Distributed tracing integration

**Kiali Configuration:**


*See [REFERENCE.md](./REFERENCE.md#example-17) for complete implementation.*


#### Jaeger (Distributed Tracing)

**Installation:**

```bash
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/jaeger.yaml

# Access UI
istioctl dashboard jaeger
```

**Enable Tracing in Mesh:**


*See [REFERENCE.md](./REFERENCE.md#example-19) for complete implementation.*


**Application Instrumentation:**


*See [REFERENCE.md](./REFERENCE.md#example-20) for complete implementation.*


#### Prometheus & Grafana

**Prometheus Installation:**

```bash
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/prometheus.yaml
```

**Key Metrics:**


*See [REFERENCE.md](./REFERENCE.md#example-22) for complete implementation.*


**Grafana Dashboards:**

```bash
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/grafana.yaml

istioctl dashboard grafana
```

**Custom Dashboard (JSON):**


*See [REFERENCE.md](./REFERENCE.md#example-24) for complete implementation.*


### Security Configuration

#### Mutual TLS (mTLS)

**Enable Strict mTLS (Cluster-wide):**


*See [REFERENCE.md](./REFERENCE.md#example-25) for complete implementation.*


**Per-Namespace mTLS:**


*See [REFERENCE.md](./REFERENCE.md#example-26) for complete implementation.*


**Per-Port mTLS:**


*See [REFERENCE.md](./REFERENCE.md#example-27) for complete implementation.*


#### Authorization Policies

**Default Deny:**


*See [REFERENCE.md](./REFERENCE.md#example-28) for complete implementation.*


**Allow Specific Services:**


*See [REFERENCE.md](./REFERENCE.md#example-29) for complete implementation.*


**JWT Authentication:**


*See [REFERENCE.md](./REFERENCE.md#example-30) for complete implementation.*


**RBAC with Custom Claims:**


*See [REFERENCE.md](./REFERENCE.md#example-31) for complete implementation.*


### Resilience Patterns

#### Circuit Breaking

**Configuration:**


*See [REFERENCE.md](./REFERENCE.md#example-32) for complete implementation.*


**Testing Circuit Breaker:**


*See [REFERENCE.md](./REFERENCE.md#example-33) for complete implementation.*


#### Retries and Timeouts

**Retry Configuration:**


*See [REFERENCE.md](./REFERENCE.md#example-34) for complete implementation.*


**Timeout Configuration:**


*See [REFERENCE.md](./REFERENCE.md#example-35) for complete implementation.*


**Fault Injection (Chaos Testing):**


*See [REFERENCE.md](./REFERENCE.md#example-36) for complete implementation.*


#### Rate Limiting

**Local Rate Limiting (Envoy):**


*See [REFERENCE.md](./REFERENCE.md#example-37) for complete implementation.*


### Multi-Cluster Service Mesh

#### Cluster Federation

**Primary-Remote Model:**


*See [REFERENCE.md](./REFERENCE.md#example-38) for complete implementation.*


*See [REFERENCE.md](./REFERENCE.md#example-39) for complete implementation.*


**Cross-Cluster Service Discovery:**


*See [REFERENCE.md](./REFERENCE.md#example-40) for complete implementation.*


### Performance Tuning

#### Resource Optimization

**Sidecar Resource Limits:**


*See [REFERENCE.md](./REFERENCE.md#example-41) for complete implementation.*


**Sidecar Scoping:**


*See [REFERENCE.md](./REFERENCE.md#example-42) for complete implementation.*


#### Telemetry Optimization

**Reduce Metrics Cardinality:**


*See [REFERENCE.md](./REFERENCE.md#example-43) for complete implementation.*


**Sampling Configuration:**


*See [REFERENCE.md](./REFERENCE.md#example-44) for complete implementation.*


## Examples

### Basic Usage


*See [REFERENCE.md](./REFERENCE.md#example-45) for complete implementation.*


### Advanced Usage

```python
// TODO: Add advanced example for service-mesh
// This example shows production-ready patterns
```

### Integration Example

```python
// TODO: Add integration example showing how service-mesh
// works with other systems and services
```

See `examples/service-mesh/` for complete working examples.

## Integration Points

This skill integrates with:

### Upstream Dependencies

- **Tools**: Common development tools and frameworks
- **Prerequisites**: Basic understanding of general concepts

### Downstream Consumers

- **Applications**: Production systems requiring service-mesh functionality
- **CI/CD Pipelines**: Automated testing and deployment workflows
- **Monitoring Systems**: Observability and logging platforms

### Related Skills

- See other skills in this category

### Common Integration Patterns

1. **Development Workflow**: How this skill fits into daily development
2. **Production Deployment**: Integration with production systems
3. **Monitoring & Alerting**: Observability integration points

## Common Pitfalls

### Pitfall 1: Insufficient Testing

**Problem:** Not testing edge cases and error conditions leads to production bugs

**Solution:** Implement comprehensive test coverage including:

- Happy path scenarios
- Error handling and edge cases
- Integration points with external systems

**Prevention:** Enforce minimum code coverage (80%+) in CI/CD pipeline

### Pitfall 2: Hardcoded Configuration

**Problem:** Hardcoding values makes applications inflexible and environment-dependent

**Solution:** Use environment variables and configuration management:

- Separate config from code
- Use environment-specific configuration files
- Never commit secrets to version control

**Prevention:** Use tools like dotenv, config validators, and secret scanners

### Pitfall 3: Ignoring Security Best Practices

**Problem:** Security vulnerabilities from not following established security patterns

**Solution:** Follow security guidelines:

- Input validation and sanitization
- Proper authentication and authorization
- Encrypted data transmission (TLS/SSL)
- Regular security audits and updates

**Prevention:** Use security linters, SAST tools, and regular dependency updates

**Best Practices:**

- Follow established patterns and conventions for service-mesh
- Keep dependencies up to date and scan for vulnerabilities
- Write comprehensive documentation and inline comments
- Use linting and formatting tools consistently
- Implement proper error handling and logging
- Regular code reviews and pair programming
- Monitor production metrics and set up alerts

---

## Level 3: Deep Dive Resources

### Official Documentation

- [Istio Documentation](https://istio.io/latest/docs/)
- [Linkerd Documentation](https://linkerd.io/2/overview/)
- [Envoy Proxy Documentation](https://www.envoyproxy.io/docs)
- [CNCF Service Mesh Landscape](https://landscape.cncf.io/guide#orchestration-management--service-mesh)

### Books

- **"Istio in Action" by Christian Posta and Rinor Maloku** - Comprehensive guide
- **"Service Mesh Patterns" by Alex Soto Bueno** - Design patterns
- **"Mastering Service Mesh" by Anjul Sahu** - Advanced concepts

### Online Courses

- [Istio Fundamentals (Tetrate Academy)](https://academy.tetrate.io/)
- [Service Mesh with Istio (Pluralsight)](https://www.pluralsight.com/)
- [Linkerd Fundamentals (Linux Foundation)](https://training.linuxfoundation.org/)

### Hands-On Labs

- [Istio Workshop](https://github.com/solo-io/istio-workshops)
- [Service Mesh Patterns](https://servicemeshpatterns.github.io/servicemeshbook/)
- [Envoy Proxy Sandbox](https://www.envoyproxy.io/docs/envoy/latest/start/sandboxes/)

### Community Resources

- [IstioCon Talks](https://events.istio.io/)
- [CNCF Webinars](https://www.cncf.io/webinars/)
- [Service Mesh Summit](https://servicemeshsummit.io/)

### Tools & Extensions

- **Kiali** - Service mesh observability
- **Flagger** - Progressive delivery operator
- **Meshery** - Multi-mesh management
- **SMI (Service Mesh Interface)** - Standard specification

### Best Practices

- [Istio Production Best Practices](https://istio.io/latest/docs/ops/best-practices/)
- [Service Mesh Security Patterns](https://www.nist.gov/publications/zero-trust-architecture)
- [Performance Tuning Guide](https://istio.io/latest/docs/ops/deployment/performance-and-scalability/)

Overview

This skill explains service mesh fundamentals and practical patterns for adding transparent service-to-service communication, observability, traffic management, and security without changing application code. It summarizes core components, installation basics, and production-focused checklists to get a mesh running reliably on Kubernetes. Content emphasizes Istio and Linkerd trade-offs and operational tasks for production readiness.

How this skill works

The skill inspects the control plane (configuration, certificate authority, policy distribution) and the data plane (sidecar proxies handling L4/L7 traffic and telemetry). It describes how mTLS, routing rules (VirtualService/DestinationRule), and observability tools (Prometheus, Jaeger, Kiali) integrate to provide secure, observable, and resilient service-to-service communication. It also covers common traffic management and resilience primitives like canary routing, circuit breaking, retries, and rate limiting.

When to use it

  • You need uniform service identity, encryption, and fine-grained authorization without changing app code.
  • You want advanced traffic control (canary, A/B, mirroring) for progressive delivery.
  • You require out-of-the-box observability: distributed tracing, metrics, and topology views.
  • You operate microservices at scale and need centralized policy, resilience, and telemetry.
  • You must implement zero-trust service-to-service communication (mTLS + authorization).

Best practices

  • Start small: enable a mesh in a non-production namespace and validate sidecar behavior before cluster-wide rollout.
  • Enable strict mTLS cluster-wide and enforce default-deny authorization policies, then open specific traffic paths.
  • Plan resources for proxies (CPU/RAM) and set resource limits for sidecars to avoid noisy neighbors.
  • Integrate Prometheus/Grafana/Jaeger/Kiali early to validate routing, latency, and service topology.
  • Use progressive traffic management (canary/weighting) and automated health checks to validate rollouts.

Example use cases

  • Canary deployments where a small percentage of users hit a new service version for validation.
  • Blue-green cutovers using gateway routing for zero-downtime swaps between environments.
  • Traffic mirroring to run load tests on a new service version without impacting production users.
  • Enforcing cluster-wide mTLS and fine-grained AuthorizationPolicy for zero-trust posture.
  • Applying circuit breaking and retries to protect downstream services from cascading failures.

FAQ

When should I choose Istio vs Linkerd?

Choose Istio for advanced traffic management, multi-cluster needs, and extensive policy control; choose Linkerd when resource efficiency, simplicity, and faster onboarding matter.

Do I need to change application code to use a service mesh?

No. Sidecar proxies handle traffic, observability, and security transparently, so most applications require no code changes beyond optional tracing headers or metrics instrumentation for richer data.