home / skills / kienhaminh / anti-chaotic / ai-engineer

ai-engineer skill

safe

This skill guides building production-grade GenAI and agentic systems with robust evaluation, advanced RAG, and scalable MLOps.

npx playbooks add skill kienhaminh/anti-chaotic --skill ai-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

1.5 KB

---
name: ai-engineer
description: Use when building production-grade GenAI, Agentic Systems, Advanced RAG, or setting up rigorous Evaluation pipelines.
license: MIT
metadata:
  version: "2.0"
---

# AI Engineering Standards

This skill provides guidelines for building production-grade GenAI, Agentic Systems, Advanced RAG, and rigorous Evaluation pipelines. Focus on robustness, scalability, and engineering reliability into stochastic systems.

## Core Responsibilities

1.  **Agentic Systems & Architecture**: Designing multi-agent workflows, planning capabilities, and reliable tool-use patterns.
2.  **Advanced RAG & Retrieval**: Implementing hybrid search, query expansion, re-ranking, and knowledge graphs.
3.  **Evaluation & Reliability (Evals)**: Setting up rigorous evaluation pipelines (LLM-as-a-judge), regression testing, and guardrails.
4.  **Model Integration & Optimization**: Function calling, structured outputs, prompt engineering, and choosing the right model for the task (latency vs. intelligence trade-offs).
5.  **MLOps & Serving**: Observability, tracing, caching, and cost management.

## Dynamic Stack Loading

- **Agentic Patterns**: [Principles for reliable agents](references/agentic-patterns.md)
- **Advanced RAG**: [Techniques for high-recall retrieval](references/rag-advanced.md)
- **Evaluation Frameworks**: [Testing & Metrics](references/evaluation.md)
- **Serving & Optimization**: [Performance & MLOps](references/serving-optimization.md)
- **LLM Fundamentals**: [Prompting & SDKs](references/llm.md)

Overview

This skill codifies engineering standards for building production-grade generative AI, agentic systems, advanced retrieval-augmented generation (RAG), and rigorous evaluation pipelines. It focuses on robustness, scalability, and operational reliability for inherently stochastic systems. The guidance helps teams move from prototypes to repeatable, observable production services.

How this skill works

It inspects system design across five core areas: multi-agent orchestration, advanced retrieval and knowledge integration, evaluation and regression testing, model integration and optimization, and production serving with MLOps practices. The skill maps common failure modes to practical mitigations—re-ranking, hybrid search, structured outputs, traceable prompts, and observability hooks. It also provides modular patterns so teams can load only the capabilities they need for a given project.

When to use it

Building multi-agent workflows that need reliable planning and tool use.
Implementing high-recall, high-precision RAG with hybrid search and re-ranking.
Setting up evaluation pipelines with automated metrics and LLM-as-judge checks.
Integrating models into services where latency, cost, and accuracy must be balanced.
Operating GenAI systems in production with monitoring, tracing, and caching.

Best practices

Design agents with clear capabilities, contracts, and failure-handling strategies.
Use hybrid retrieval (dense + sparse) plus query expansion and re-ranking for robustness.
Treat evaluations as CI: automated tests, regression baselines, and human-in-the-loop checks.
Prefer structured outputs and function calling to reduce hallucinations and parsing errors.
Instrument models and agents for observability: latency, token usage, drift, and error rates.

Example use cases

A customer support system combining retrieval over a knowledge base with agents that call backend APIs safely.
A research assistant that merges web, vector store, and knowledge-graph signals with re-ranking for precise answers.
An evaluation pipeline that runs nightly regression tests with LLM judges and alerts on metric regression.
A document ingestion pipeline that uses query expansion, chunking, and caching to speed up RAG responses.

FAQ

How do I choose between latency-optimized and accuracy-optimized models?

Base the choice on task requirements: use smaller, faster models for interactive latency constraints and larger models or cascaded rerankers when higher accuracy justifies cost and delay.

What are the first observability metrics to add?

Start with request latency, token consumption, success/failure rates, and semantic regression checks against baseline outputs.