home / skills / 404kidwiz / claude-supercode-skills / websocket-engineer-skill

websocket-engineer-skill skill

This skill helps design, deploy, and scale real-time WebSocket systems with low latency, reliability, and multi-node resilience using Redis adapters.

npx playbooks add skill 404kidwiz/claude-supercode-skills --skill websocket-engineer-skill

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

9.4 KB

---
name: websocket-engineer
description: Expert in real-time communication systems, including WebSockets, Socket.IO, SSE, and WebRTC.
---

# WebSocket & Real-Time Engineer

## Purpose

Provides real-time communication expertise specializing in WebSocket architecture, Socket.IO, and event-driven systems. Builds low-latency, bidirectional communication systems scaling to millions of concurrent connections.

## When to Use

- Building chat apps, live dashboards, or multiplayer games
- Scaling WebSocket servers horizontally (Redis Adapter)
- Implementing "Server-Sent Events" (SSE) for one-way updates
- Troubleshooting connection drops, heartbeat failures, or CORS issues
- Designing stateful connection architectures
- Migrating from polling to push technology

## Examples

### Example 1: Real-Time Chat Application

**Scenario:** Building a scalable chat platform for enterprise use.

**Implementation:**
1. Designed WebSocket architecture with Socket.IO
2. Implemented Redis Adapter for horizontal scaling
3. Created room-based message routing
4. Added message persistence and history
5. Implemented presence system (online/offline)

**Results:**
- Supports 100,000+ concurrent connections
- 50ms average message delivery
- 99.99% connection stability
- Seamless horizontal scaling

### Example 2: Live Dashboard System

**Scenario:** Real-time analytics dashboard with sub-second updates.

**Implementation:**
1. Implemented WebSocket server with low latency
2. Created efficient message batching strategy
3. Added Redis pub/sub for multi-server support
4. Implemented client-side update coalescing
5. Added compression for large payloads

**Results:**
- Dashboard updates in under 100ms
- Handles 10,000 concurrent dashboard views
- 80% reduction in server load vs polling
- Zero data loss during reconnections

### Example 3: Multiplayer Game Backend

**Scenario:** Low-latency multiplayer game server.

**Implementation:**
1. Implemented WebSocket server with binary protocols
2. Created authoritative server architecture
3. Added client-side prediction and reconciliation
4. Implemented lag compensation algorithms
5. Set up server-side physics and collision detection

**Results:**
- 30ms end-to-end latency
- Supports 1000 concurrent players per server
- Smooth gameplay despite network variations
- Cheat-resistant server authority

## Best Practices

### Connection Management

- **Heartbeats**: Implement ping/pong for connection health
- **Reconnection**: Automatic reconnection with backoff
- **State Cleanup**: Proper cleanup on disconnect
- **Connection Limits**: Prevent resource exhaustion

### Scaling

- **Horizontal Scaling**: Use Redis Adapter for multi-server
- **Sticky Sessions**: Proper load balancer configuration
- **Message Routing**: Efficient routing for broadcast/unicast
- **Rate Limiting**: Prevent abuse and overload

### Performance

- **Message Batching**: Batch messages where appropriate
- **Compression**: Compress messages (permessage-deflate)
- **Binary Protocols**: Use binary for performance-critical data
- **Connection Pooling**: Efficient client connection reuse

### Security

- **Authentication**: Validate on handshake
- **TLS**: Always use WSS
- **Input Validation**: Validate all incoming messages
- **Rate Limiting**: Limit connection/message rates

---
---

## 2. Decision Framework

### Protocol Selection

```
What is the communication pattern?
│
├─ **Bi-directional (Chat/Game)**
│  ├─ Low Latency needed? → **WebSockets (Raw)**
│  ├─ Fallbacks/Auto-reconnect needed? → **Socket.IO**
│  └─ P2P Video/Audio? → **WebRTC**
│
├─ **One-way (Server → Client)**
│  ├─ Stock Ticker / Notifications? → **Server-Sent Events (SSE)**
│  └─ Large File Download? → **HTTP Stream**
│
└─ **High Frequency (IoT)**
   └─ Constrained device? → **MQTT** (over TCP/WS)
```

### Scaling Strategy

| Scale | Architecture | Backend |
|-------|--------------|---------|
| **< 10k Users** | Monolith Node.js | Single Instance |
| **10k - 100k** | Clustering | Node.js Cluster + Redis Adapter |
| **100k - 1M** | Microservices | Go/Elixir/Rust + NATS/Kafka |
| **Global** | Edge | Cloudflare Workers / PubNub / Pusher |

### Load Balancer Config

*   **Sticky Sessions:** **REQUIRED** for Socket.IO (handshake phase).
*   **Timeouts:** Increase idle timeouts (e.g., 60s+).
*   **Headers:** `Upgrade: websocket`, `Connection: Upgrade`.

**Red Flags → Escalate to `security-engineer`:**
- Accepting connections from any Origin (`*`) with credentials
- No Rate Limiting on connection requests (DoS risk)
- Sending JWTs in URL query params (Logged in proxy logs) - Use Cookie or Initial Message instead

---
---

## 3. Core Workflows

### Workflow 1: Scalable Socket.IO Server (Node.js)

**Goal:** Chat server capable of scaling across multiple cores/instances.

**Steps:**

1.  **Install Dependencies**
    ```bash
    npm install socket.io redis @socket.io/redis-adapter
    ```

2.  **Implementation (`server.js`)**
    ```javascript
    const { Server } = require("socket.io");
    const { createClient } = require("redis");
    const { createAdapter } = require("@socket.io/redis-adapter");

    const pubClient = createClient({ url: "redis://localhost:6379" });
    const subClient = pubClient.duplicate();

    Promise.all([pubClient.connect(), subClient.connect()]).then(() => {
      const io = new Server(3000, {
        adapter: createAdapter(pubClient, subClient),
        cors: {
          origin: "https://myapp.com",
          methods: ["GET", "POST"]
        }
      });

      io.on("connection", (socket) => {
        // User joins a room (e.g., "chat-123")
        socket.on("join", (room) => {
          socket.join(room);
        });

        // Send message to room (propagates via Redis to all nodes)
        socket.on("message", (data) => {
          io.to(data.room).emit("chat", data.text);
        });
      });
    });
    ```

---
---

### Workflow 3: Production Tuning (Linux)

**Goal:** Handle 50k concurrent connections on a single server.

**Steps:**

1.  **File Descriptors**
    -   Increase limit: `ulimit -n 65535`.
    -   Edit `/etc/security/limits.conf`.

2.  **Ephemeral Ports**
    -   Increase range: `sysctl -w net.ipv4.ip_local_port_range="1024 65535"`.

3.  **Memory Optimization**
    -   Use `ws` (lighter) instead of Socket.IO if features not needed.
    -   Disable "Per-Message Deflate" (Compression) if CPU is high.

---
---

## 5. Anti-Patterns & Gotchas

### ❌ Anti-Pattern 1: Stateful Monolith

**What it looks like:**
-   Storing `users = []` array in Node.js memory.

**Why it fails:**
-   When you scale to 2 servers, User A on Server 1 cannot talk to User B on Server 2.
-   Memory leaks crash the process.

**Correct approach:**
-   Use **Redis** as the state store (Adapter).
-   Stateless servers, Stateful backend (Redis).

### ❌ Anti-Pattern 2: The "Thundering Herd"

**What it looks like:**
-   Server restarts. 100,000 clients reconnect instantly.
-   Server crashes again due to CPU spike.

**Why it fails:**
-   Connection handshakes are expensive (TLS + Auth).

**Correct approach:**
-   **Randomized Jitter:** Clients wait `random(0, 10s)` before reconnecting.
-   **Exponential Backoff:** Wait 1s, then 2s, then 4s...

### ❌ Anti-Pattern 3: Blocking the Event Loop

**What it looks like:**
-   `socket.on('message', () => { heavyCalculation(); })`

**Why it fails:**
-   Node.js is single-threaded. One heavy task blocks *all* 10,000 connections.

**Correct approach:**
-   Offload work to a **Worker Thread** or **Message Queue** (RabbitMQ/Bull).

---
---

## 7. Quality Checklist

**Scalability:**
-   [ ] **Adapter:** Redis/NATS adapter configured for multi-node.
-   [ ] **Load Balancer:** Sticky sessions enabled (if using polling fallback).
-   [ ] **OS Limits:** File descriptors limit increased.

**Resilience:**
-   [ ] **Reconnection:** Exponential backoff + Jitter implemented.
-   [ ] **Heartbeat:** Ping/Pong interval configured (< LB timeout).
-   [ ] **Fallback:** Socket.IO fallbacks (HTTP Long Polling) enabled/tested.

**Security:**
-   [ ] **WSS:** TLS enabled (Secure WebSockets).
-   [ ] **Auth:** Handshake validates credentials properly.
-   [ ] **Rate Limit:** Connection rate limiting active.

## Anti-Patterns

### Connection Management Anti-Patterns

- **No Heartbeats**: Not detecting dead connections - implement ping/pong
- **Memory Leaks**: Not cleaning up closed connections - implement proper cleanup
- **Infinite Reconnects**: Reloop without backoff - implement exponential backoff
- **Sticky Sessions Required**: Not designing for stateless - use Redis for state

### Scaling Anti-Patterns

- **Single Server**: Not scaling beyond one instance - use Redis adapter
- **No Load Balancing**: Direct connections to servers - use proper load balancer
- **Broadcast Storm**: Sending to all connections blindly - target specific connections
- **Connection Saturation**: Too many connections per server - scale horizontally

### Performance Anti-Patterns

- **Message Bloat**: Large unstructured messages - use efficient message formats
- **No Throttling**: Unlimited send rates - implement rate limiting
- **Blocking Operations**: Synchronous processing - use async processing
- **No Monitoring**: Operating blind - implement connection metrics

### Security Anti-Patterns

- **No TLS**: Using unencrypted connections - always use WSS
- **Weak Auth**: Simple token validation - implement proper authentication
- **No Rate Limits**: Vulnerable to abuse - implement connection/message limits
- **CORS Exposed**: Open cross-origin access - configure proper CORS

Overview

This skill provides expert guidance for designing, building, and operating real-time communication systems using WebSockets, Socket.IO, SSE, and WebRTC. It focuses on low-latency, bidirectional architectures that scale to large numbers of concurrent connections while maintaining reliability and security. Practical patterns cover connection management, horizontal scaling with adapters, production tuning, and protocol selection.

How this skill works

The skill inspects communication patterns, latency requirements, and deployment constraints to recommend the right protocol and architecture. It prescribes concrete implementations (e.g., Socket.IO + Redis adapter), production hardening steps (ulimit, port ranges), and operational safeguards like heartbeats, reconnection backoff, and rate limiting. It also highlights anti-patterns and provides a checklist for scalability, resilience, and security.

When to use it

Building chat platforms, live dashboards, or real-time multiplayer games
Scaling WebSocket servers horizontally across multiple instances
Migrating from polling to push or adding SSE for one-way updates
Troubleshooting connection drops, heartbeat failures, or CORS issues
Designing stateful connection models and presence systems

Best practices

Implement ping/pong heartbeats and exponential reconnection with jitter
Use Redis (or NATS/Kafka) adapters for multi-node message routing
Enforce TLS (WSS), validate auth on handshake, and apply rate limits
Batch messages and use binary protocols for high-frequency/large payloads
Tune OS limits (file descriptors, ephemeral ports) and avoid blocking the event loop

Example use cases

Scalable enterprise chat: Socket.IO with Redis adapter, room routing, presence, and message persistence
Live analytics dashboard: low-latency WebSocket server, batching, pub/sub, and client coalescing
Multiplayer game backend: binary WebSocket protocol, authoritative server, prediction and reconciliation
IoT telemetry: lightweight MQTT over WebSockets for constrained devices
Notification service: SSE for efficient one-way updates where reactivity beats interactivity

FAQ

When should I choose Socket.IO over raw WebSockets?

Choose Socket.IO when you need built-in reconnection, fallbacks, and higher-level features; use raw WebSockets for minimal overhead and lower latency when you control both client and server.

How do I avoid the thundering herd on reconnect?

Implement exponential backoff combined with randomized jitter so clients stagger reconnection attempts rather than reconnecting all at once.