home / skills / darshil321 / fynd-backend-skills / fynd-backend-microservices

fynd-backend-microservices skill

/skills/fynd-backend-microservices

This skill provides expert debugging for FYND's Kubernetes, Kafka, Redis, and Node.js backend to reduce outages and latency.

npx playbooks add skill darshil321/fynd-backend-skills --skill fynd-backend-microservices

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
1.9 KB
---
name: fynd-backend-microservices
description: Expert debugging for FYND's Kubernetes/GCP/Kafka/Node.js backend. Use for pods crashing, Kafka lag, Redis memory high, API latency, DB migration failures, LLM cost spikes, memory leaks, and service failures.
---

# FYND Backend Microservices 🛠️

**Expert debugging for FYND's Kubernetes/GCP/Kafka/Node.js backend.**

## 🎯 Use When
```
"Pods crashing" | "Kafka lag" | "Redis memory high" | "API latency"
"Database migration failed" | "LLM costs spiking" | "Memory leak" | "Service failures"
```

## 🛠️ 8 Core Skills
1. **K8s/GCP Deployment** (40% issues) - pods, scaling, graceful shutdown
2. **Kafka Resilience** (25%) - consumer lag, DLQ, rebalancing
3. **Redis Optimization** (15%) - memory, TTL, pub/sub
4. **Distributed Tracing** (10%) - correlation IDs, Langfuse
5. **Database Patterns** (8%) - Sequelize, pgvector, MongoDB
6. **LangGraph Orchestration** (5%) - multi-LLM, token counting
7. **Performance Analysis** (4%) - heap profiling, slow queries
8. **Resilience Patterns** (3%) - circuit breaker, backoff

## 🔍 Diagnostic Flow
1. Gather metrics (kubectl, kafka-consumer-groups, redis-cli)
2. Form hypotheses (OOM? poison pill? slow query?)
3. Test systematically
4. Provide fix + monitoring

## 📦 Bundled Resources (load as needed)
- `scripts/diagnose.js` - quick pod snapshot (`kubectl get pods`)
- `references/patterns.md` - fast CLI patterns for pod crash, Kafka lag, Redis memory
- `references/fynd-backend-skills.md` - full architecture + skills matrix + flowcharts + checklists
- `references/fynd-agent-integration.md` - LangGraph/LangChain tool integration guide
- `references/fynd-backend-skill-template.md` - template for creating new skills or extensions

**Search tips:** use `rg -n "Use Cases|Agent Actions|Checklist|Flowchart"` in the reference files to jump to relevant sections quickly.

## 📊 Success
94% accuracy | <5s response | 30% MTTR reduction | $740k/year savings

Overview

This skill provides expert debugging and operational guidance for FYND’s Node.js backend running on Kubernetes, GCP, Kafka, Redis, and related services. I focus on fast triage and practical fixes for pods crashing, Kafka lag, Redis memory issues, API latency, DB migration failures, memory leaks, LLM cost spikes, and general service failures. The goal is to reduce mean time to recovery and prevent recurrence with concrete remediation and monitoring guidance.

How this skill works

I start by collecting key signals: Kubernetes pod states and logs, consumer group offsets, Redis memory usage, application traces, and database errors. I form hypotheses (OOM, poison message, slow query, leaking allocation), test them with targeted commands or small experiments, and deliver step-by-step fixes plus monitoring and resilience recommendations. Bundled scripts and checklists speed common diagnostics and ensure repeatable remediation.

When to use it

  • Pod restarts, CrashLoopBackOff, or failing deployments on Kubernetes
  • Sustained Kafka consumer lag, DLQ growth, or rebalancing failures
  • Redis memory spikes, TTL misconfigurations, or pub/sub issues
  • Unexplained API latency, high error rates, or throughput drops
  • Database migration errors, schema drift, or replication problems
  • Unexpected LLM cost increases or runaway token usage

Best practices

  • Gather structured signals first: kubectl, consumer offsets, Redis INFO, APM traces
  • Reproduce minimally and test hypotheses before wide changes
  • Add graceful shutdown handlers and resource limits to pods
  • Implement circuit breakers, backoff, and DLQ strategies for resilience
  • Profile memory and CPU regularly to catch leaks early
  • Automate alerts with meaningful thresholds and postmortem checklists

Example use cases

  • Diagnose and fix a CrashLoopBackOff caused by missing graceful shutdown and OOM
  • Recover consumers from persistent Kafka lag by identifying poison messages and reprocessing via DLQ
  • Reduce Redis memory footprint by auditing keys, adjusting TTLs, and refining eviction policy
  • Trace an API latency spike to a slow DB query and deploy a short-term cache plus optimized query
  • Contain rising LLM costs by adding token limits, batching requests, and routing to cheaper models

FAQ

How quickly can I get a triage plan?

I provide an initial diagnostic checklist and prioritized hypotheses within minutes, and a tested remediation plan within the same incident session.

Do you require production access for diagnostics?

I can work from logs and metrics snapshots, but live access accelerates root-cause identification and fixes; minimal, read-only access is sufficient for most diagnostics.