home / skills / 404kidwiz / claude-supercode-skills / kafka-engineer-skill

kafka-engineer-skill skill

/kafka-engineer-skill

This skill helps design and optimize Apache Kafka real-time pipelines, enabling fault-tolerant streaming, schema management, and efficient Connect deployments.

npx playbooks add skill 404kidwiz/claude-supercode-skills --skill kafka-engineer-skill

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
9.2 KB
---
name: kafka-engineer
description: Expert in Apache Kafka, Event Streaming, and Real-time Data Pipelines. Specializes in Kafka Connect, KSQL, and Schema Registry.
---

# Kafka Engineer

## Purpose

Provides Apache Kafka and event streaming expertise specializing in scalable event-driven architectures and real-time data pipelines. Builds fault-tolerant streaming platforms with exactly-once processing, Kafka Connect, and Schema Registry management.

## When to Use

- Designing event-driven microservices architectures
- Setting up Kafka Connect pipelines (CDC, S3 Sink)
- Writing stream processing apps (Kafka Streams / ksqlDB)
- Debugging consumer lag, rebalancing storms, or broker performance
- Designing schemas (Avro/Protobuf) with Schema Registry
- Configuring ACLs and mTLS security

---
---

## 2. Decision Framework

### Architecture Selection

```
What is the use case?
│
├─ **Data Integration (ETL)**
│  ├─ DB to DB/Data Lake? → **Kafka Connect** (Zero code)
│  └─ Complex transformations? → **Kafka Streams**
│
├─ **Real-Time Analytics**
│  ├─ SQL-like queries? → **ksqlDB** (Quick aggregation)
│  └─ Complex stateful logic? → **Kafka Streams / Flink**
│
└─ **Microservices Comm**
   ├─ Event Notification? → **Standard Producer/Consumer**
   └─ Event Sourcing? → **State Stores (RocksDB)**
```

### Config Tuning (The "Big 3")

1.  **Throughput:** `batch.size`, `linger.ms`, `compression.type=lz4`.
2.  **Latency:** `linger.ms=0`, `acks=1`.
3.  **Durability:** `acks=all`, `min.insync.replicas=2`, `replication.factor=3`.

**Red Flags → Escalate to `sre-engineer`:**
- "Unclean leader election" enabled (Data loss risk)
- Zookeeper dependency in new clusters (Use KRaft mode)
- Disk usage > 80% on brokers
- Consumer lag constantly increasing (Capacity mismatch)

---
---

## 3. Core Workflows

### Workflow 1: Kafka Connect (CDC)

**Goal:** Stream changes from PostgreSQL to S3.

**Steps:**

1.  **Source Config (`postgres-source.json`)**
    ```json
    {
      "name": "postgres-source",
      "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "db-host",
        "database.dbname": "mydb",
        "database.user": "kafka",
        "plugin.name": "pgoutput"
      }
    }
    ```

2.  **Sink Config (`s3-sink.json`)**
    ```json
    {
      "name": "s3-sink",
      "config": {
        "connector.class": "io.confluent.connect.s3.S3SinkConnector",
        "s3.bucket.name": "my-datalake",
        "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
        "flush.size": "1000"
      }
    }
    ```

3.  **Deploy**
    -   `curl -X POST -d @postgres-source.json http://connect:8083/connectors`

---
---

### Workflow 3: Schema Registry Integration

**Goal:** Enforce schema compatibility.

**Steps:**

1.  **Define Schema (`user.avsc`)**
    ```json
    {
      "type": "record",
      "name": "User",
      "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"}
      ]
    }
    ```

2.  **Producer (Java)**
    -   Use `KafkaAvroSerializer`.
    -   Registry URL: `http://schema-registry:8081`.

---
---

## 5. Anti-Patterns & Gotchas

### ❌ Anti-Pattern 1: Large Messages

**What it looks like:**
-   Sending 10MB images payload in Kafka message.

**Why it fails:**
-   Kafka is optimized for small messages (< 1MB). Large messages block the broker threads.

**Correct approach:**
-   Store image in **S3**.
-   Send **Reference URL** in Kafka message.

### ❌ Anti-Pattern 2: Too Many Partitions

**What it looks like:**
-   Creating 10,000 partitions on a small cluster.

**Why it fails:**
-   Slow leader election (Zookeeper overhead).
-   High file handle usage.

**Correct approach:**
-   Limit partitions per broker (~4000). Use fewer topics or larger clusters.

### ❌ Anti-Pattern 3: Blocking Consumer

**What it looks like:**
-   Consumer doing heavy HTTP call (30s) for each message.

**Why it fails:**
-   Rebalance storm (Consumer leaves group due to timeout).

**Correct approach:**
-   **Async Processing:** Move work to a thread pool.
-   **Pause/Resume:** `consumer.pause()` if buffer is full.

---
---

## 7. Quality Checklist

**Configuration:**
-   [ ] **Replication:** Factor 3 for production.
-   [ ] **Min.ISR:** 2 (Prevents data loss).
-   [ ] **Retention:** Configured correctly (Time vs Size).

**Observability:**
-   [ ] **Lag:** Consumer Lag monitored (Burrow/Prometheus).
-   [ ] **Under-replicated:** Alert on under-replicated partitions (>0).
-   [ ] **JMX:** Metrics exported.

## Examples

### Example 1: Real-Time Fraud Detection Pipeline

**Scenario:** A financial services company needs real-time fraud detection using Kafka streaming.

**Architecture Implementation:**
1. **Event Ingestion**: Kafka Connect CDC from PostgreSQL transaction database
2. **Stream Processing**: Kafka Streams application for real-time pattern detection
3. **Alert System**: Producer to alert topic triggering notifications
4. **Storage**: S3 sink for historical analysis and compliance

**Pipeline Configuration:**
| Component | Configuration | Purpose |
|-----------|---------------|---------|
| Topics | 3 (transactions, alerts, enriched) | Data organization |
| Partitions | 12 (3 brokers × 4) | Parallelism |
| Replication | 3 | High availability |
| Compression | LZ4 | Throughput optimization |

**Key Logic:**
- Detects velocity patterns (5+ transactions in 1 minute)
- Identifies geographic anomalies (impossible travel)
- Flags high-risk merchant categories

**Results:**
- 99.7% of fraud detected in under 100ms
- False positive rate reduced from 5% to 0.3%
- Compliance audit passed with zero findings

### Example 2: E-Commerce Order Processing System

**Scenario:** Build a resilient order processing system with Kafka for high reliability.

**System Design:**
1. **Order Events**: Topic for order lifecycle events
2. **Inventory Service**: Consumes orders, updates stock
3. **Payment Service**: Processes payments, publishes results
4. **Notification Service**: Sends confirmations via email/SMS

**Resilience Patterns:**
- Dead Letter Queue for failed processing
- Idempotent producers for exactly-once semantics
- Consumer groups with manual offset management
- Retries with exponential backoff

**Configuration:**
```yaml
# Producer Configuration
acks: all
retries: 3
enable.idempotence: true

# Consumer Configuration
auto.offset.reset: earliest
enable.auto.commit: false
max.poll.records: 500
```

**Results:**
- 99.99% message delivery reliability
- Zero duplicate orders in 6 months
- Peak processing: 10,000 orders/second

### Example 3: IoT Telemetry Platform

**Scenario:** Process millions of IoT device telemetry messages with Kafka.

**Platform Architecture:**
1. **Device Gateway**: MQTT to Kafka proxy
2. **Data Enrichment**: Stream processing adds device metadata
3. **Time-Series Storage**: S3 sink partitioned by device_id/date
4. **Real-Time Alerts**: Threshold-based alerting for anomalies

**Scalability Configuration:**
- 50 partitions for parallel processing
- Compression enabled for cost optimization
- Retention: 7 days hot, 1 year cold in S3
- Schema Registry for data contracts

**Performance Metrics:**
| Metric | Value |
|--------|-------|
| Throughput | 500,000 messages/sec |
| Latency (P99) | 50ms |
| Consumer lag | < 1 second |
| Storage efficiency | 60% reduction with compression |

## Best Practices

### Topic Design

- **Naming Conventions**: Use clear, hierarchical topic names (domain.entity.event)
- **Partition Strategy**: Plan for future growth (3x expected throughput)
- **Retention Policies**: Match retention to business requirements
- **Cleanup Policies**: Use delete for time-based, compact for state
- **Schema Management**: Enforce schemas via Schema Registry

### Producer Optimization

- **Batching**: Increase batch.size and linger.ms for throughput
- **Compression**: Use LZ4 for balance of speed and size
- **Acks Configuration**: Use all for reliability, 1 for latency
- **Retry Strategy**: Implement retries with backoff
- **Idempotence**: Enable for exactly-once semantics in critical paths

### Consumer Best Practices

- **Offset Management**: Use manual commit for critical processing
- **Batch Processing**: Increase max.poll.records for efficiency
- **Rebalance Handling**: Implement graceful shutdown
- **Error Handling**: Dead letter queues for poison messages
- **Monitoring**: Track consumer lag and processing time

### Security Configuration

- **Encryption**: TLS for all client-broker communication
- **Authentication**: SASL/SCRAM or mTLS for production
- **Authorization**: ACLs with least privilege principle
- **Quotas**: Implement client quotas to prevent abuse
- **Audit Logging**: Log all access and configuration changes

### Performance Tuning

- **Broker Configuration**: Optimize for workload type (throughput vs latency)
- **JVM Tuning**: Heap size and garbage collector selection
- **OS Tuning**: File descriptor limits, network settings
- **Monitoring**: Metrics for throughput, latency, and errors
- **Capacity Planning**: Regular review and scaling assessment

**Security:**
-   [ ] **Encryption:** TLS enabled for Client-Broker and Inter-broker.
-   [ ] **Auth:** SASL/SCRAM or mTLS enabled.
-   [ ] **ACLs:** Principle of least privilege (Topic read/write).

Overview

This skill provides expert guidance on Apache Kafka, event streaming, and real-time data pipelines. It helps design scalable, fault-tolerant streaming architectures, implement Kafka Connect and ksqlDB pipelines, and manage Schema Registry for schema governance. The focus is on practical configuration, operational troubleshooting, and production-grade best practices.

How this skill works

I assess use cases to recommend the right Kafka components (Connect, Streams, ksqlDB, or plain producers/consumers) and provide configuration patterns for throughput, latency, and durability. I supply connector and schema examples, anti-patterns to avoid, observability checks, and tuning steps for brokers, clients, and security. I translate requirements into concrete deployment and monitoring actions to reduce risk and improve performance.

When to use it

  • Designing event-driven microservices or event sourcing models
  • Setting up CDC or sink pipelines with Kafka Connect (databases → topics → S3)
  • Building stream processing apps with Kafka Streams or ksqlDB for real-time analytics
  • Enforcing data contracts with Schema Registry (Avro/Protobuf) and compatibility rules
  • Troubleshooting consumer lag, rebalances, broker disk or under-replication issues
  • Configuring production security: TLS, SASL/mTLS, ACLs and quotas

Best practices

  • Choose technology by use case: Connect for ETL, ksqlDB for SQL-like queries, Streams for stateful processing
  • Tune the big three: throughput (batch.size, linger.ms, compression=LZ4), latency (linger.ms=0, acks=1), durability (acks=all, min.insync.replicas≥2)
  • Keep messages small; store large payloads in object storage and send references
  • Enforce schemas via Schema Registry and set compatibility rules before production
  • Monitor consumer lag, under-replicated partitions, and disk usage; alert when thresholds crossed
  • Use idempotent producers, dead-letter queues, and manual offset commits for critical processing

Example use cases

  • Real-time fraud detection: CDC→Kafka Connect, Kafka Streams pattern detection, alerts topic, S3 sink for compliance
  • E-commerce order processing: event-driven order lifecycle, idempotent producers, DLQ, manual offset management for reliability
  • IoT telemetry: MQTT→Kafka gateway, enrichment streams, partitioned S3 sinks, schema registry for device contracts
  • Data lake ingestion: Debezium CDC to topics, S3 sink with Parquet format and partitioning for downstream analytics

FAQ

How do I choose between ksqlDB and Kafka Streams?

Use ksqlDB for quick SQL-style aggregations and interactive queries. Use Kafka Streams for complex, stateful processing, custom logic, and tight JVM integration.

What are the most dangerous misconfigurations?

Enabling unclean leader election, keeping ZooKeeper for new clusters instead of KRaft, and letting broker disk usage exceed ~80% are top risks; escalate to SRE in those cases.