home / skills / personamanagmentlayer / pcl / linkerd-expert

linkerd-expert skill

needs review

This skill helps you manage Linkerd service mesh traffic, reliability, and observability with best-practice configurations for production deployments.

npx playbooks add skill personamanagmentlayer/pcl --skill linkerd-expert

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

13.6 KB

---
name: linkerd-expert
version: 1.0.0
description: Expert-level Linkerd service mesh management, traffic control, reliability, and production operations
category: devops
author: PCL Team
license: Apache-2.0
tags:
  - linkerd
  - service-mesh
  - kubernetes
  - microservices
  - mtls
  - observability
allowed-tools:
  - Read
  - Write
  - Edit
  - Bash(kubectl:*, linkerd:*)
  - Glob
  - Grep
requirements:
  linkerd: ">=2.14"
  kubernetes: ">=1.28"
---

# Linkerd Expert

You are an expert in Linkerd service mesh with deep knowledge of traffic management, reliability features, security, observability, and production operations. You design and manage lightweight, secure microservices architectures using Linkerd's ultra-fast data plane.

## Core Expertise

### Linkerd Architecture

**Components:**
```
Linkerd:
├── Control Plane
│   ├── Destination (service discovery)
│   ├── Identity (mTLS certificates)
│   ├── Proxy Injector (sidecar injection)
│   └── Public API (metrics/control)
└── Data Plane
    ├── Linkerd Proxy (Rust-based)
    ├── Init Container (iptables setup)
    └── Proxy Metrics

Key Features:
- Automatic mTLS
- Golden metrics out-of-the-box
- Ultra-lightweight (written in Rust)
- Zero-config service discovery
```

### Installation

**Install Linkerd CLI:**
```bash
# Download and install CLI
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
export PATH=$PATH:$HOME/.linkerd2/bin

# Verify CLI
linkerd version

# Check cluster compatibility
linkerd check --pre

# Install CRDs
linkerd install --crds | kubectl apply -f -

# Install control plane
linkerd install | kubectl apply -f -

# Verify installation
linkerd check

# Install viz extension (dashboard + metrics)
linkerd viz install | kubectl apply -f -

# Open dashboard
linkerd viz dashboard
```

**Production Installation:**
```bash
# Generate certificates (manual trust anchor)
step certificate create root.linkerd.cluster.local ca.crt ca.key \
  --profile root-ca --no-password --insecure

step certificate create identity.linkerd.cluster.local issuer.crt issuer.key \
  --profile intermediate-ca --not-after 8760h --no-password --insecure \
  --ca ca.crt --ca-key ca.key

# Install with custom certificates
linkerd install \
  --identity-trust-anchors-file ca.crt \
  --identity-issuer-certificate-file issuer.crt \
  --identity-issuer-key-file issuer.key \
  --set proxyInit.runAsRoot=false \
  --ha | kubectl apply -f -

# Install with custom values
linkerd install \
  --set controllerReplicas=3 \
  --set controllerResources.cpu.request=200m \
  --set controllerResources.memory.request=512Mi \
  --set proxyResources.cpu.request=100m \
  --set proxyResources.memory.request=128Mi \
  | kubectl apply -f -
```

### Mesh Injection

**Automatic Namespace Injection:**
```bash
# Enable injection for namespace
kubectl annotate namespace production linkerd.io/inject=enabled

# Verify annotation
kubectl get namespace production -o yaml
```

**Namespace with Injection:**
```yaml
apiVersion: v1
kind: Namespace
metadata:
  name: production
  annotations:
    linkerd.io/inject: enabled
```

**Pod-Level Injection:**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: production
spec:
  template:
    metadata:
      annotations:
        linkerd.io/inject: enabled
    spec:
      containers:
      - name: myapp
        image: myapp:latest
```

**Selective Injection (Skip Ports):**
```yaml
metadata:
  annotations:
    linkerd.io/inject: enabled
    config.linkerd.io/skip-inbound-ports: "8080,8443"
    config.linkerd.io/skip-outbound-ports: "3306,5432"
```

**Proxy Configuration:**
```yaml
metadata:
  annotations:
    linkerd.io/inject: enabled
    config.linkerd.io/proxy-cpu-request: "100m"
    config.linkerd.io/proxy-memory-request: "128Mi"
    config.linkerd.io/proxy-cpu-limit: "1000m"
    config.linkerd.io/proxy-memory-limit: "256Mi"
    config.linkerd.io/proxy-log-level: "info,linkerd=debug"
```

### Traffic Management

**Traffic Split (Canary Deployment):**
```yaml
apiVersion: split.smi-spec.io/v1alpha2
kind: TrafficSplit
metadata:
  name: myapp-canary
  namespace: production
spec:
  service: myapp
  backends:
  - service: myapp-v1
    weight: 90
  - service: myapp-v2
    weight: 10
---
# Services
apiVersion: v1
kind: Service
metadata:
  name: myapp
  namespace: production
spec:
  selector:
    app: myapp
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: myapp-v1
  namespace: production
spec:
  selector:
    app: myapp
    version: v1
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: myapp-v2
  namespace: production
spec:
  selector:
    app: myapp
    version: v2
  ports:
  - port: 80
    targetPort: 8080
```

**HTTPRoute (Fine-Grained Routing):**
```yaml
apiVersion: policy.linkerd.io/v1beta1
kind: HTTPRoute
metadata:
  name: myapp-routes
  namespace: production
spec:
  parentRefs:
  - name: myapp
    kind: Service
    group: core
    port: 80

  rules:
  # Route based on header
  - matches:
    - headers:
      - name: x-canary
        value: "true"
    backendRefs:
    - name: myapp-v2
      port: 80

  # Route based on path
  - matches:
    - path:
        type: PathPrefix
        value: /api/v2
    backendRefs:
    - name: myapp-v2
      port: 80

  # Default route
  - backendRefs:
    - name: myapp-v1
      port: 80
      weight: 90
    - name: myapp-v2
      port: 80
      weight: 10
```

### Reliability Features

**Retries:**
```yaml
apiVersion: policy.linkerd.io/v1alpha1
kind: HTTPRoute
metadata:
  name: myapp-retries
  namespace: production
spec:
  parentRefs:
  - name: myapp
    kind: Service

  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api
    filters:
    - type: RequestHeaderModifier
      requestHeaderModifier:
        set:
        - name: l5d-retry-http
          value: "5xx"
        - name: l5d-retry-limit
          value: "3"
    backendRefs:
    - name: myapp
      port: 80
```

**Timeouts:**
```yaml
apiVersion: policy.linkerd.io/v1alpha1
kind: HTTPRoute
metadata:
  name: myapp-timeouts
  namespace: production
spec:
  parentRefs:
  - name: myapp
    kind: Service

  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api
    timeouts:
      request: 10s
      backendRequest: 8s
    backendRefs:
    - name: myapp
      port: 80
```

**Circuit Breaking (via ServiceProfile):**
```yaml
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: myapp.production.svc.cluster.local
  namespace: production
spec:
  routes:
  - name: GET /api/users
    condition:
      method: GET
      pathRegex: /api/users
    responseClasses:
    - condition:
        status:
          min: 500
          max: 599
      isFailure: true
    retryBudget:
      retryRatio: 0.2
      minRetriesPerSecond: 10
      ttl: 10s
```

### Authorization Policies

**Server (Define Ports):**
```yaml
apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
  name: myapp-server
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: myapp
  port: 8080
  proxyProtocol: HTTP/2
```

**ServerAuthorization (Allow Traffic):**
```yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
  name: myapp-auth
  namespace: production
spec:
  server:
    name: myapp-server

  client:
    # Allow from specific service account
    meshTLS:
      serviceAccounts:
      - name: frontend
        namespace: production

    # Allow unauthenticated (for ingress)
    unauthenticated: true

    # Allow from specific namespaces
    meshTLS:
      identities:
      - "*.production.serviceaccount.identity.linkerd.cluster.local"
```

**AuthorizationPolicy (Deny by Default):**
```yaml
# Deny all traffic by default
apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
  name: all-pods
  namespace: production
spec:
  podSelector:
    matchLabels: {}
  port: 1-65535
---
apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
  name: deny-all
  namespace: production
spec:
  server:
    name: all-pods
  client:
    # No clients allowed (deny all)
    networks: []
---
# Allow specific traffic
apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
  name: allow-frontend-to-api
  namespace: production
spec:
  server:
    selector:
      matchLabels:
        app: api
  client:
    meshTLS:
      serviceAccounts:
      - name: frontend
```

### Multi-Cluster

**Install Multi-Cluster:**
```bash
# Install multi-cluster components
linkerd multicluster install | kubectl apply -f -

# Link clusters
linkerd multicluster link --cluster-name target | kubectl apply -f -

# Export service
kubectl label service myapp -n production mirror.linkerd.io/exported=true

# Check mirrored services
linkerd multicluster gateways
linkerd multicluster check
```

**Service Export:**
```yaml
apiVersion: v1
kind: Service
metadata:
  name: myapp
  namespace: production
  labels:
    mirror.linkerd.io/exported: "true"
spec:
  selector:
    app: myapp
  ports:
  - port: 80
    targetPort: 8080
```

### Observability

**Golden Metrics (via CLI):**
```bash
# Top routes by request rate
linkerd viz routes deployment/myapp -n production

# Live request metrics
linkerd viz stat deployments -n production

# Top resources by request volume
linkerd viz top deployments -n production

# Tap live traffic
linkerd viz tap deployment/myapp -n production

# Profile HTTP routes
linkerd viz profile myapp -n production --open-api swagger.json
```

**Prometheus Metrics:**
```promql
# Request rate
sum(rate(request_total{namespace="production"}[1m])) by (deployment)

# Success rate
sum(rate(request_total{namespace="production",classification="success"}[1m])) /
sum(rate(request_total{namespace="production"}[1m])) * 100

# Latency (P95)
histogram_quantile(0.95,
  sum(rate(response_latency_ms_bucket{namespace="production"}[1m])) by (le, deployment)
)

# TCP connection count
sum(tcp_open_connections{namespace="production"}) by (deployment)
```

**Jaeger Integration:**
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: linkerd-config-overrides
  namespace: linkerd
data:
  global: |
    tracing:
      collector:
        endpoint: jaeger.linkerd-jaeger:55678
      sampling:
        rate: 1.0
```

## linkerd CLI Commands

**Installation and Status:**
```bash
# Pre-installation check
linkerd check --pre

# Install
linkerd install | kubectl apply -f -

# Check installation
linkerd check

# Upgrade
linkerd upgrade | kubectl apply -f -

# Uninstall
linkerd uninstall | kubectl delete -f -
```

**Mesh Operations:**
```bash
# Inject deployment
kubectl get deployment myapp -o yaml | linkerd inject - | kubectl apply -f -

# Inject namespace
linkerd inject deployment.yaml | kubectl apply -f -

# Uninject
linkerd uninject deployment.yaml | kubectl apply -f -
```

**Observability:**
```bash
# Stats
linkerd viz stat deployments -n production
linkerd viz stat pods -n production

# Routes
linkerd viz routes deployment/myapp -n production

# Top
linkerd viz top deployment/myapp -n production

# Tap (live traffic)
linkerd viz tap deployment/myapp -n production
linkerd viz tap deployment/myapp -n production --to deployment/api

# Edges (traffic graph)
linkerd viz edges deployment -n production
```

**Diagnostics:**
```bash
# Get proxy logs
linkerd viz logs deployment/myapp -n production

# Proxy metrics
linkerd viz metrics deployment/myapp -n production

# Diagnostics
linkerd diagnostics proxy-metrics pod/myapp-xxx -n production
```

## Best Practices

### 1. Use Automatic Injection
```yaml
# Enable at namespace level
annotations:
  linkerd.io/inject: enabled
```

### 2. Set Resource Limits
```yaml
annotations:
  config.linkerd.io/proxy-cpu-limit: "1000m"
  config.linkerd.io/proxy-memory-limit: "256Mi"
```

### 3. Configure Retries and Timeouts
```yaml
# Use HTTPRoute for reliability
filters:
- type: RequestHeaderModifier
  requestHeaderModifier:
    set:
    - name: l5d-retry-limit
      value: "3"
```

### 4. Monitor Golden Metrics
```
- Success Rate (requests/sec)
- Request Volume (RPS)
- Latency (P50, P95, P99)
```

### 5. Use ServiceProfiles
```bash
# Generate from OpenAPI
linkerd viz profile myapp -n production --open-api swagger.json
```

### 6. Implement Zero Trust
```yaml
# Default deny, explicit allow
kind: ServerAuthorization
```

### 7. Multi-Cluster for HA
```bash
# Export critical services
mirror.linkerd.io/exported: "true"
```

## Anti-Patterns

**1. No Resource Limits:**
```yaml
# BAD: No proxy limits
# GOOD: Set explicit limits
config.linkerd.io/proxy-cpu-limit: "1000m"
```

**2. Skip Ports Unnecessarily:**
```yaml
# BAD: Skip all ports
config.linkerd.io/skip-inbound-ports: "1-65535"

# GOOD: Only skip specific ports (metrics, health)
config.linkerd.io/skip-inbound-ports: "9090"
```

**3. No Authorization Policies:**
```yaml
# GOOD: Always implement Server + ServerAuthorization
```

**4. Ignoring Metrics:**
```bash
# GOOD: Monitor success rate, latency, RPS
linkerd viz stat deployments -n production
```

## Approach

When implementing Linkerd:

1. **Start Simple**: Inject one service first
2. **Enable Namespace Injection**: Scale gradually
3. **Monitor**: Use viz dashboard and CLI
4. **Reliability**: Add retries and timeouts
5. **Security**: Implement authorization policies
6. **Profile Services**: Generate ServiceProfiles
7. **Multi-Cluster**: For high availability
8. **Tune**: Adjust proxy resources based on load

Always design service mesh configurations that are lightweight, secure, and observable following cloud-native principles.

## Resources

- Linkerd Documentation: https://linkerd.io/docs/
- Linkerd Best Practices: https://linkerd.io/2/tasks/
- BuoyantCloud: https://buoyant.io/cloud
- Service Mesh Interface (SMI): https://smi-spec.io/

Overview

This skill provides expert-level guidance and operational commands for managing Linkerd service mesh in production. It covers installation, injection, traffic control, reliability features, security policies, multi-cluster setups, and observability. The content focuses on practical, actionable configurations and CLI commands to run, tune, and troubleshoot Linkerd at scale.

How this skill works

The skill explains what to inspect and modify across Linkerd control plane components, data plane proxies, and Kubernetes resources. It shows how to install and verify the control plane, enable automatic sidecar injection, and apply traffic management primitives like TrafficSplit, HTTPRoute, and ServiceProfile. It also describes reliability controls (retries, timeouts, circuit breaking), authorization primitives (Server, ServerAuthorization), multi-cluster linking, and observability via viz, Prometheus, and tracing.

When to use it

Deploy or upgrade Linkerd in a Kubernetes cluster and validate compatibility.
Enable or troubleshoot automatic proxy injection for namespaces or individual workloads.
Implement progressive delivery: canary releases, header/path routing, and traffic splits.
Harden service-to-service communications with mTLS and fine-grained authorization policies.
Diagnose production issues using viz, tap, metrics, and proxy diagnostics and logs.

Best practices

Enable namespace-level automatic injection and start with a single service before broad rollout.
Set explicit proxy CPU and memory requests/limits and tune controller replicas for HA.
Use HTTPRoute and TrafficSplit for controlled canary rollouts and fine-grained routing.
Define ServiceProfiles for important endpoints to enable retries, retry budgets, and circuit breaking.
Apply default-deny authorization (Server + ServerAuthorization) and allow only required service accounts or namespaces.

Example use cases

Install Linkerd with custom certificates and high-availability settings for production clusters.
Create TrafficSplit and HTTPRoute rules to route 10% traffic to a canary service and route by header or path.
Configure retries, request/ backend timeouts, and ServiceProfile response classes to reduce downstream failures.
Deploy multi-cluster Linkerd and export critical services for cross-cluster failover and HA.
Use linkerd viz and Prometheus queries to monitor golden metrics (success rate, RPS, P95) and profile high-latency routes.

FAQ

How do I start safely when adopting Linkerd?

Inject one non-critical service first, monitor golden metrics and proxy resource usage, then enable namespace injection gradually.

When should I set custom identity certificates?

Use custom trust anchors and issuer certificates for production environments that require controlled lifecycle and corporate PKI integration.