home / skills / williamzujkowski / standards / service-mesh
This skill helps you understand and implement service mesh concepts for secure, observable, and resilient inter-service communication across Kubernetes
npx playbooks add skill williamzujkowski/standards --skill service-meshReview the files below or copy the command above to add this skill to your agents.
---
name: service-mesh
category: cloud-native
difficulty: advanced
tags:
- istio
- linkerd
- envoy
- mtls
- traffic-management
- observability
prerequisites:
- kubernetes
- networking
- security
estimated_time: 8-12 hours
description: A service mesh is an infrastructure layer that provides transparent service-to-service
communication with built-in observability, traffic management, and security features
without requiring application code changes.
---
# Service Mesh
## Level 1: Quick Reference
### What is a Service Mesh?
A service mesh is an infrastructure layer that provides transparent service-to-service communication with built-in observability, traffic management, and security features without requiring application code changes.
**Core Components:**
- **Control Plane**: Configuration and policy management (Istiod)
- **Data Plane**: Sidecar proxies handling traffic (Envoy)
- **Service Identity**: Certificate-based authentication (mTLS)
### Key Benefits
**1. Observability**
- Automatic metrics collection (latency, throughput, errors)
- Distributed tracing (request flow visualization)
- Traffic topology and service dependencies
- Real-time dashboards (Kiali, Grafana)
**2. Traffic Management**
- Intelligent routing (canary, blue-green, A/B)
- Load balancing (round-robin, least-request, consistent hash)
- Traffic splitting and mirroring
- Request retries and timeouts
**3. Security**
- Automatic mutual TLS (mTLS) encryption
- Service-to-service authentication
- Fine-grained authorization policies
- Certificate rotation and management
**4. Resilience**
- Circuit breaking and outlier detection
- Rate limiting and quota management
- Fault injection for chaos testing
- Connection pooling
### Istio vs Linkerd Comparison
| Feature | Istio | Linkerd |
|---------|-------|---------|
| **Proxy** | Envoy (C++) | Linkerd2-proxy (Rust) |
| **Resource Usage** | Higher (100-200MB/pod) | Lower (10-20MB/pod) |
| **Features** | Comprehensive (100+ CRDs) | Focused (essential features) |
| **Complexity** | More complex setup | Simpler, faster setup |
| **Multi-cluster** | Full support | Basic support |
| **Traffic Management** | Advanced (mirroring, A/B) | Basic (canary, split) |
| **Observability** | Full stack (Kiali/Jaeger) | Built-in (Linkerd Viz) |
| **Maturity** | Production-ready (CNCF) | Production-ready (CNCF) |
**Choose Istio when:**
- Need advanced traffic management (mirroring, A/B testing)
- Multi-cluster or multi-cloud deployments
- Complex authorization requirements
- Established operations team
**Choose Linkerd when:**
- Resource efficiency is critical
- Prefer simplicity over features
- Kubernetes-only environment
- Faster onboarding required
### Essential Service Mesh Checklist
**Pre-Installation:**
```yaml
□ Kubernetes cluster v1.22+ running
□ CNI plugin supporting network policies
□ Prometheus and Grafana available
□ Resource capacity planned (2-4 CPU, 4-8GB RAM)
□ DNS resolution working correctly
```
**Post-Installation:**
```yaml
□ mTLS enabled cluster-wide (STRICT mode)
□ Sidecar injection configured (automatic/manual)
□ Monitoring dashboards accessible
□ Health checks and readiness probes adjusted
□ Certificate rotation configured
□ Resource limits set on proxies
```
**Traffic Management:**
```yaml
□ VirtualService defines routing rules
□ DestinationRule configures load balancing
□ Circuit breaking thresholds set
□ Retry policies configured
□ Timeouts appropriate for services
```
**Security:**
```yaml
□ Authorization policies applied (default-deny)
□ PeerAuthentication enforces mTLS
□ Service accounts properly scoped
□ Ingress/egress gateways secured
□ Certificate validity monitored
```
### Quick Start Commands
```bash
# Install Istio (demo profile)
istioctl install --set profile=demo -y
# Enable sidecar injection
kubectl label namespace default istio-injection=enabled
# Deploy sample app
kubectl apply -f samples/bookinfo/platform/kube/bookinfo.yaml
# Create ingress gateway
kubectl apply -f samples/bookinfo/networking/bookinfo-gateway.yaml
# Check mesh status
istioctl proxy-status
# View metrics
kubectl -n istio-system port-forward svc/kiali 20001:20001
```
### Common Patterns
**Canary Deployment:**
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews-canary
spec:
hosts:
- reviews
http:
- match:
- headers:
end-user:
exact: jason
route:
- destination:
host: reviews
subset: v2
- route:
- destination:
host: reviews
subset: v1
weight: 90
- destination:
host: reviews
subset: v2
weight: 10
```
**Circuit Breaking:**
```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews-circuit-breaker
spec:
host: reviews
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 10
maxRequestsPerConnection: 2
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
```
**Zero Trust Security:**
```yaml
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: deny-all
namespace: default
spec:
{} # Empty spec denies all traffic
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: allow-frontend-to-backend
spec:
selector:
matchLabels:
app: backend
action: ALLOW
rules:
- from:
- source:
principals: ["cluster.local/ns/default/sa/frontend"]
to:
- operation:
methods: ["GET", "POST"]
```
---
## Level 2:
>
> **📚 Full Examples**: See [REFERENCE.md](./REFERENCE.md) for complete code samples, detailed configurations, and production-ready implementations.
Implementation Guide
### Service Mesh Architecture
#### Control Plane (Istiod)
The control plane manages and configures the data plane proxies:
**Core Functions:**
1. **Service Discovery**: Abstracts platform-specific discovery mechanisms
2. **Configuration Distribution**: Pushes routing rules to proxies
3. **Certificate Authority**: Issues and rotates mTLS certificates
4. **Policy Enforcement**: Applies security and traffic policies
**Istiod Components:**
*See [REFERENCE.md](./REFERENCE.md#example-0) for complete implementation.*
#### Data Plane (Envoy Proxies)
**Sidecar Pattern:**
*See [REFERENCE.md](./REFERENCE.md#example-1) for complete implementation.*
**Envoy Capabilities:**
- Layer 7 (HTTP/gRPC) and Layer 4 (TCP) proxying
- Dynamic configuration via xDS APIs
- Advanced load balancing algorithms
- Health checking and circuit breaking
- Rich metrics and distributed tracing
#### Traffic Flow
*See [REFERENCE.md](./REFERENCE.md#example-2) for complete implementation.*
### Istio Installation and Configuration
#### Installation Options
**1. IstioOperator (Recommended):**
*See [REFERENCE.md](./REFERENCE.md#example-3) for complete implementation.*
**2. Helm Installation:**
*See [REFERENCE.md](./REFERENCE.md#example-4) for complete implementation.*
**3. Production Profile:**
*See [REFERENCE.md](./REFERENCE.md#example-5) for complete implementation.*
#### Sidecar Injection Strategies
**Automatic Injection (Namespace-level):**
```bash
# Label namespace for injection
kubectl label namespace production istio-injection=enabled
# Verify
kubectl get namespace -L istio-injection
```
**Manual Injection (Pod-level):**
```bash
# Inject sidecar into existing deployment
kubectl get deployment myapp -o yaml | \
istioctl kube-inject -f - | \
kubectl apply -f -
```
**Selective Injection (Pod annotation):**
```yaml
apiVersion: v1
kind: Pod
metadata:
name: myapp
annotations:
sidecar.istio.io/inject: "true" # Explicit injection
spec:
containers:
- name: app
image: myapp:v1
```
#### Configuration Validation
*See [REFERENCE.md](./REFERENCE.md#example-9) for complete implementation.*
### Advanced Traffic Management
#### 1. Canary Deployments
**Gradual Rollout Strategy:**
*See [REFERENCE.md](./REFERENCE.md#example-10) for complete implementation.*
**Progressive Rollout Script:**
*See [REFERENCE.md](./REFERENCE.md#example-11) for complete implementation.*
#### 2. Blue-Green Deployments
**Zero-Downtime Cutover:**
*See [REFERENCE.md](./REFERENCE.md#example-12) for complete implementation.*
**Cutover Process:**
*See [REFERENCE.md](./REFERENCE.md#example-13) for complete implementation.*
#### 3. A/B Testing
**User Cohort Routing:**
*See [REFERENCE.md](./REFERENCE.md#example-14) for complete implementation.*
#### 4. Traffic Mirroring
**Shadow Testing:**
*See [REFERENCE.md](./REFERENCE.md#example-15) for complete implementation.*
**Use Cases:**
- Test new version without impacting users
- Load testing with real traffic patterns
- Validate refactored services
- Debug production issues safely
### Observability Stack
#### Kiali (Service Mesh Dashboard)
**Installation:**
```bash
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/kiali.yaml
# Access dashboard
istioctl dashboard kiali
```
**Key Features:**
- Service topology graph
- Traffic flow visualization
- Configuration validation
- Distributed tracing integration
**Kiali Configuration:**
*See [REFERENCE.md](./REFERENCE.md#example-17) for complete implementation.*
#### Jaeger (Distributed Tracing)
**Installation:**
```bash
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/jaeger.yaml
# Access UI
istioctl dashboard jaeger
```
**Enable Tracing in Mesh:**
*See [REFERENCE.md](./REFERENCE.md#example-19) for complete implementation.*
**Application Instrumentation:**
*See [REFERENCE.md](./REFERENCE.md#example-20) for complete implementation.*
#### Prometheus & Grafana
**Prometheus Installation:**
```bash
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/prometheus.yaml
```
**Key Metrics:**
*See [REFERENCE.md](./REFERENCE.md#example-22) for complete implementation.*
**Grafana Dashboards:**
```bash
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/grafana.yaml
istioctl dashboard grafana
```
**Custom Dashboard (JSON):**
*See [REFERENCE.md](./REFERENCE.md#example-24) for complete implementation.*
### Security Configuration
#### Mutual TLS (mTLS)
**Enable Strict mTLS (Cluster-wide):**
*See [REFERENCE.md](./REFERENCE.md#example-25) for complete implementation.*
**Per-Namespace mTLS:**
*See [REFERENCE.md](./REFERENCE.md#example-26) for complete implementation.*
**Per-Port mTLS:**
*See [REFERENCE.md](./REFERENCE.md#example-27) for complete implementation.*
#### Authorization Policies
**Default Deny:**
*See [REFERENCE.md](./REFERENCE.md#example-28) for complete implementation.*
**Allow Specific Services:**
*See [REFERENCE.md](./REFERENCE.md#example-29) for complete implementation.*
**JWT Authentication:**
*See [REFERENCE.md](./REFERENCE.md#example-30) for complete implementation.*
**RBAC with Custom Claims:**
*See [REFERENCE.md](./REFERENCE.md#example-31) for complete implementation.*
### Resilience Patterns
#### Circuit Breaking
**Configuration:**
*See [REFERENCE.md](./REFERENCE.md#example-32) for complete implementation.*
**Testing Circuit Breaker:**
*See [REFERENCE.md](./REFERENCE.md#example-33) for complete implementation.*
#### Retries and Timeouts
**Retry Configuration:**
*See [REFERENCE.md](./REFERENCE.md#example-34) for complete implementation.*
**Timeout Configuration:**
*See [REFERENCE.md](./REFERENCE.md#example-35) for complete implementation.*
**Fault Injection (Chaos Testing):**
*See [REFERENCE.md](./REFERENCE.md#example-36) for complete implementation.*
#### Rate Limiting
**Local Rate Limiting (Envoy):**
*See [REFERENCE.md](./REFERENCE.md#example-37) for complete implementation.*
### Multi-Cluster Service Mesh
#### Cluster Federation
**Primary-Remote Model:**
*See [REFERENCE.md](./REFERENCE.md#example-38) for complete implementation.*
*See [REFERENCE.md](./REFERENCE.md#example-39) for complete implementation.*
**Cross-Cluster Service Discovery:**
*See [REFERENCE.md](./REFERENCE.md#example-40) for complete implementation.*
### Performance Tuning
#### Resource Optimization
**Sidecar Resource Limits:**
*See [REFERENCE.md](./REFERENCE.md#example-41) for complete implementation.*
**Sidecar Scoping:**
*See [REFERENCE.md](./REFERENCE.md#example-42) for complete implementation.*
#### Telemetry Optimization
**Reduce Metrics Cardinality:**
*See [REFERENCE.md](./REFERENCE.md#example-43) for complete implementation.*
**Sampling Configuration:**
*See [REFERENCE.md](./REFERENCE.md#example-44) for complete implementation.*
## Examples
### Basic Usage
*See [REFERENCE.md](./REFERENCE.md#example-45) for complete implementation.*
### Advanced Usage
```python
// TODO: Add advanced example for service-mesh
// This example shows production-ready patterns
```
### Integration Example
```python
// TODO: Add integration example showing how service-mesh
// works with other systems and services
```
See `examples/service-mesh/` for complete working examples.
## Integration Points
This skill integrates with:
### Upstream Dependencies
- **Tools**: Common development tools and frameworks
- **Prerequisites**: Basic understanding of general concepts
### Downstream Consumers
- **Applications**: Production systems requiring service-mesh functionality
- **CI/CD Pipelines**: Automated testing and deployment workflows
- **Monitoring Systems**: Observability and logging platforms
### Related Skills
- See other skills in this category
### Common Integration Patterns
1. **Development Workflow**: How this skill fits into daily development
2. **Production Deployment**: Integration with production systems
3. **Monitoring & Alerting**: Observability integration points
## Common Pitfalls
### Pitfall 1: Insufficient Testing
**Problem:** Not testing edge cases and error conditions leads to production bugs
**Solution:** Implement comprehensive test coverage including:
- Happy path scenarios
- Error handling and edge cases
- Integration points with external systems
**Prevention:** Enforce minimum code coverage (80%+) in CI/CD pipeline
### Pitfall 2: Hardcoded Configuration
**Problem:** Hardcoding values makes applications inflexible and environment-dependent
**Solution:** Use environment variables and configuration management:
- Separate config from code
- Use environment-specific configuration files
- Never commit secrets to version control
**Prevention:** Use tools like dotenv, config validators, and secret scanners
### Pitfall 3: Ignoring Security Best Practices
**Problem:** Security vulnerabilities from not following established security patterns
**Solution:** Follow security guidelines:
- Input validation and sanitization
- Proper authentication and authorization
- Encrypted data transmission (TLS/SSL)
- Regular security audits and updates
**Prevention:** Use security linters, SAST tools, and regular dependency updates
**Best Practices:**
- Follow established patterns and conventions for service-mesh
- Keep dependencies up to date and scan for vulnerabilities
- Write comprehensive documentation and inline comments
- Use linting and formatting tools consistently
- Implement proper error handling and logging
- Regular code reviews and pair programming
- Monitor production metrics and set up alerts
---
## Level 3: Deep Dive Resources
### Official Documentation
- [Istio Documentation](https://istio.io/latest/docs/)
- [Linkerd Documentation](https://linkerd.io/2/overview/)
- [Envoy Proxy Documentation](https://www.envoyproxy.io/docs)
- [CNCF Service Mesh Landscape](https://landscape.cncf.io/guide#orchestration-management--service-mesh)
### Books
- **"Istio in Action" by Christian Posta and Rinor Maloku** - Comprehensive guide
- **"Service Mesh Patterns" by Alex Soto Bueno** - Design patterns
- **"Mastering Service Mesh" by Anjul Sahu** - Advanced concepts
### Online Courses
- [Istio Fundamentals (Tetrate Academy)](https://academy.tetrate.io/)
- [Service Mesh with Istio (Pluralsight)](https://www.pluralsight.com/)
- [Linkerd Fundamentals (Linux Foundation)](https://training.linuxfoundation.org/)
### Hands-On Labs
- [Istio Workshop](https://github.com/solo-io/istio-workshops)
- [Service Mesh Patterns](https://servicemeshpatterns.github.io/servicemeshbook/)
- [Envoy Proxy Sandbox](https://www.envoyproxy.io/docs/envoy/latest/start/sandboxes/)
### Community Resources
- [IstioCon Talks](https://events.istio.io/)
- [CNCF Webinars](https://www.cncf.io/webinars/)
- [Service Mesh Summit](https://servicemeshsummit.io/)
### Tools & Extensions
- **Kiali** - Service mesh observability
- **Flagger** - Progressive delivery operator
- **Meshery** - Multi-mesh management
- **SMI (Service Mesh Interface)** - Standard specification
### Best Practices
- [Istio Production Best Practices](https://istio.io/latest/docs/ops/best-practices/)
- [Service Mesh Security Patterns](https://www.nist.gov/publications/zero-trust-architecture)
- [Performance Tuning Guide](https://istio.io/latest/docs/ops/deployment/performance-and-scalability/)
This skill explains service mesh fundamentals and practical patterns for adding transparent service-to-service communication, observability, traffic management, and security without changing application code. It summarizes core components, installation basics, and production-focused checklists to get a mesh running reliably on Kubernetes. Content emphasizes Istio and Linkerd trade-offs and operational tasks for production readiness.
The skill inspects the control plane (configuration, certificate authority, policy distribution) and the data plane (sidecar proxies handling L4/L7 traffic and telemetry). It describes how mTLS, routing rules (VirtualService/DestinationRule), and observability tools (Prometheus, Jaeger, Kiali) integrate to provide secure, observable, and resilient service-to-service communication. It also covers common traffic management and resilience primitives like canary routing, circuit breaking, retries, and rate limiting.
When should I choose Istio vs Linkerd?
Choose Istio for advanced traffic management, multi-cluster needs, and extensive policy control; choose Linkerd when resource efficiency, simplicity, and faster onboarding matter.
Do I need to change application code to use a service mesh?
No. Sidecar proxies handle traffic, observability, and security transparently, so most applications require no code changes beyond optional tracing headers or metrics instrumentation for richer data.