home / skills / yonatangross / orchestkit / llm-integration

llm-integration skill

/plugins/ork/skills/llm-integration

This skill provides production-ready LLM integration patterns for function calling, streaming, local inference, and fine-tuning to streamline tool use.

npx playbooks add skill yonatangross/orchestkit --skill llm-integration

Review the files below or copy the command above to add this skill to your agents.

Files (37)
SKILL.md
7.7 KB
---
name: llm-integration
license: MIT
compatibility: "Claude Code 2.1.34+."
description: LLM integration patterns for function calling, streaming responses, local inference with Ollama, and fine-tuning customization. Use when implementing tool use, SSE streaming, local model deployment, LoRA/QLoRA fine-tuning, or multi-provider LLM APIs.
tags: [llm, function-calling, streaming, ollama, fine-tuning, lora, tool-use, local-inference]
context: fork
agent: llm-integrator
version: 2.0.0
author: OrchestKit
user-invocable: false
complexity: medium
metadata:
  category: mcp-enhancement
---

# LLM Integration

Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in `rules/` loaded on-demand.

## Quick Reference

| Category | Rules | Impact | When to Use |
|----------|-------|--------|-------------|
| [Function Calling](#function-calling) | 3 | CRITICAL | Tool definitions, parallel execution, input validation |
| [Streaming](#streaming) | 3 | HIGH | SSE endpoints, structured streaming, backpressure handling |
| [Local Inference](#local-inference) | 3 | HIGH | Ollama setup, model selection, GPU optimization |
| [Fine-Tuning](#fine-tuning) | 3 | HIGH | LoRA/QLoRA training, dataset preparation, evaluation |
| [Context Optimization](#context-optimization) | 2 | HIGH | Window management, compression, caching, budget scaling |
| [Evaluation](#evaluation) | 2 | HIGH | LLM-as-judge, RAGAS metrics, quality gates, benchmarks |
| [Prompt Engineering](#prompt-engineering) | 2 | HIGH | CoT, few-shot, versioning, DSPy optimization |

**Total: 18 rules across 7 categories**

## Quick Start

```python
# Function calling: strict mode tool definition
tools = [{
    "type": "function",
    "function": {
        "name": "search_documents",
        "description": "Search knowledge base",
        "strict": True,
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "limit": {"type": "integer", "description": "Max results"}
            },
            "required": ["query", "limit"],
            "additionalProperties": False
        }
    }
}]
```

```python
# Streaming: SSE endpoint with FastAPI
@app.get("/chat/stream")
async def stream_chat(prompt: str):
    async def generate():
        async for token in async_stream(prompt):
            yield {"event": "token", "data": token}
        yield {"event": "done", "data": ""}
    return EventSourceResponse(generate())
```

```python
# Local inference: Ollama with LangChain
llm = ChatOllama(
    model="deepseek-r1:70b",
    base_url="http://localhost:11434",
    temperature=0.0,
    num_ctx=32768,
)
```

```python
# Fine-tuning: QLoRA with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=2048, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)
```

## Function Calling

Enable LLMs to use external tools and return structured data. Use strict mode schemas (2026 best practice) for reliability. Limit to 5-15 tools per request, validate all inputs with Pydantic/Zod, and return errors as tool results.

- `calling-tool-definition.md` -- Strict mode schemas, OpenAI/Anthropic formats, LangChain binding
- `calling-parallel.md` -- Parallel tool execution, asyncio.gather, strict mode constraints
- `calling-validation.md` -- Input validation, error handling, tool execution loops

## Streaming

Deliver LLM responses in real-time for better UX. Use SSE for web, WebSocket for bidirectional. Handle backpressure with bounded queues.

- `streaming-sse.md` -- FastAPI SSE endpoints, frontend consumers, async iterators
- `streaming-structured.md` -- Streaming with tool calls, partial JSON parsing, chunk accumulation
- `streaming-backpressure.md` -- Backpressure handling, bounded buffers, cancellation

## Local Inference

Run LLMs locally with Ollama for cost savings (93% vs cloud), privacy, and offline development. Pre-warm models, use provider factory for cloud/local switching.

- `local-ollama-setup.md` -- Installation, model pulling, environment configuration
- `local-model-selection.md` -- Model comparison by task, hardware profiles, quantization
- `local-gpu-optimization.md` -- Apple Silicon tuning, keep-alive, CI integration

## Fine-Tuning

Customize LLMs with parameter-efficient techniques. Fine-tune ONLY after exhausting prompt engineering and RAG. Requires 1000+ quality examples.

- `tuning-lora.md` -- LoRA/QLoRA configuration, Unsloth training, adapter merging
- `tuning-dataset-prep.md` -- Synthetic data generation, quality validation, deduplication
- `tuning-evaluation.md` -- DPO alignment, evaluation metrics, anti-patterns

## Context Optimization

Manage context windows, compression, and attention-aware positioning. Optimize for tokens-per-task.

- `context-window-management.md` -- Five-layer architecture, anchored summarization, compression triggers
- `context-caching.md` -- Just-in-time loading, budget scaling, probe evaluation, CC 2.1.32+

## Evaluation

Evaluate LLM outputs with multi-dimension scoring, quality gates, and benchmarks.

- `evaluation-metrics.md` -- LLM-as-judge, RAGAS metrics, hallucination detection
- `evaluation-benchmarks.md` -- Quality gates, batch evaluation, pairwise comparison

## Prompt Engineering

Design, version, and optimize prompts for production LLM applications.

- `prompt-design.md` -- Chain-of-Thought, few-shot learning, pattern selection guide
- `prompt-testing.md` -- Langfuse versioning, DSPy optimization, A/B testing, self-consistency

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Tool schema mode | `strict: true` (2026 best practice) |
| Tool count | 5-15 max per request |
| Streaming protocol | SSE for web, WebSocket for bidirectional |
| Buffer size | 50-200 tokens |
| Local model (reasoning) | `deepseek-r1:70b` |
| Local model (coding) | `qwen2.5-coder:32b` |
| Fine-tuning approach | LoRA/QLoRA (try prompting first) |
| LoRA rank | 16-64 typical |
| Training epochs | 1-3 (more risks overfitting) |
| Context compression | Anchored iterative (60-80%) |
| Compress trigger | 70% utilization, target 50% |
| Judge model | GPT-5.2-mini or Haiku 4.5 |
| Quality threshold | 0.7 production, 0.6 drafts |
| Few-shot examples | 3-5 diverse, representative |
| Prompt versioning | Langfuse with labels |
| Auto-optimization | DSPy MIPROv2 |

## Related Skills

- `rag-retrieval` -- Embedding patterns, when RAG is better than fine-tuning
- `agent-loops` -- Multi-step tool use with reasoning
- `llm-evaluation` -- Evaluate fine-tuned and local models
- `langfuse-observability` -- Track training experiments

## Capability Details

### function-calling
**Keywords:** tool, function, define tool, tool schema, function schema, strict mode, parallel tools
**Solves:**
- Define tools with clear descriptions and strict schemas
- Execute tool calls in parallel with asyncio.gather
- Validate inputs and handle errors in tool execution loops

### streaming
**Keywords:** streaming, SSE, Server-Sent Events, real-time, backpressure, token stream
**Solves:**
- Stream LLM tokens via SSE endpoints
- Handle tool calls within streams
- Manage backpressure with bounded queues

### local-inference
**Keywords:** Ollama, local, self-hosted, model selection, GPU, Apple Silicon
**Solves:**
- Set up Ollama for local LLM inference
- Select models based on task and hardware
- Optimize GPU usage and CI integration

### fine-tuning
**Keywords:** LoRA, QLoRA, fine-tune, DPO, synthetic data, PEFT, alignment
**Solves:**
- Configure LoRA/QLoRA for parameter-efficient training
- Generate and validate synthetic training data
- Align models with DPO and evaluate results

Overview

This skill provides production-ready LLM integration patterns for function calling, streaming, local inference with Ollama, and fine-tuning customization (LoRA/QLoRA). It consolidates recommended architectures, operational rules, and concrete code patterns for building resilient, observable AI services in TypeScript-based stacks. Use it to standardize tool use, SSE/WebSocket streaming, local model deployment, and parameter-efficient tuning workflows.

How this skill works

The skill organizes guidance into rule sets loaded on demand for categories like function-calling, streaming, local-inference, fine-tuning, context-optimization, evaluation, and prompt-engineering. Each rule codifies concrete recommendations: strict JSON schemas for tool definitions, SSE/WebSocket examples and backpressure controls, Ollama local model setup and GPU tuning, and LoRA/QLoRA training recipes with dataset hygiene and evaluation. Patterns include code snippets, buffer sizing, tool-count limits, and decision tables to speed implementation.

When to use it

  • Implementing reliable tool use and structured function calling with strict schemas
  • Building real-time LLM-driven UX with SSE or bidirectional WebSocket streams
  • Deploying local inference via Ollama for cost, privacy, or offline needs
  • Running parameter-efficient fine-tuning (LoRA/QLoRA) after prompt/RAG exhaustion
  • Optimizing context windows, compression, and caching for long-document tasks

Best practices

  • Define tools with strict JSON schemas (validate inputs with Pydantic/Zod) and limit to 5–15 tools per request
  • Use SSE for unidirectional web streams and WebSocket for bidirectional needs; bound token buffers (50–200 tokens)
  • Pre-warm local models, use provider factories for cloud/local switching, and tune GPU/Apple Silicon settings
  • Treat fine-tuning as a last resort: require 1k+ high-quality examples, use LoRA/QLoRA ranks 16–64, and limit epochs to 1–3
  • Implement LLM-as-judge evaluation, quality gates (0.7 production), and continual benchmarked testing

Example use cases

  • A chat UI that streams token-by-token via FastAPI SSE with backpressure handling and token-level events
  • A retrieval-augmented assistant that calls external tools using strict function schemas and parallel execution
  • On-prem inference platform using Ollama to serve a 70B reasoning model with pre-warming and keep-alive
  • Fine-tuning a specialized support agent with QLoRA and curated synthetic examples followed by DPO-style evaluation

FAQ

When should I fine-tune instead of improving prompts or using RAG?

Fine-tune only after you exhaust prompt engineering and RAG; require 1k+ quality examples and validate with held-out benchmarks.

How many tools should I expose to the LLM?

Keep tool count between 5 and 15 per request to avoid decision noise and maintain reliable function selection.