home / skills / doanchienthangdev / omgkit / deployment-paradigms

deployment-paradigms skill

safe

/plugin/skills/ml-systems/deployment-paradigms

This skill helps you compare and select ML deployment paradigms, optimizing latency, cost, and scalability across batch, real time, streaming, edge, and

npx playbooks add skill doanchienthangdev/omgkit --skill deployment-paradigms

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.5 KB

---
name: deployment-paradigms
description: ML deployment paradigms including batch vs real-time inference, online vs offline serving, edge deployment, and serverless ML.
---

# Deployment Paradigms

Understanding ML deployment patterns and trade-offs.

## Deployment Modes

### Batch Inference
```python
# Process all data at once
def batch_inference(model, data_path, output_path):
    data = pd.read_parquet(data_path)
    predictions = model.predict(data)
    predictions.to_parquet(output_path)

# Schedule: Daily/Hourly
# Airflow DAG example
with DAG('batch_inference', schedule_interval='@daily') as dag:
    inference_task = PythonOperator(
        task_id='run_inference',
        python_callable=batch_inference
    )
```

### Real-time Inference
```python
from fastapi import FastAPI
import torch

app = FastAPI()
model = torch.jit.load("model.pt")

@app.post("/predict")
async def predict(request: PredictRequest):
    features = preprocess(request.data)
    with torch.no_grad():
        prediction = model(features)
    return {"prediction": prediction.tolist()}
```

### Streaming Inference
```python
from kafka import KafkaConsumer, KafkaProducer

consumer = KafkaConsumer('input-topic')
producer = KafkaProducer()

for message in consumer:
    data = deserialize(message.value)
    prediction = model.predict(data)
    producer.send('output-topic', serialize(prediction))
```

## Serving Patterns

### Online Serving
- Sub-second latency
- Feature store for features
- Model caching
- Auto-scaling

### Offline Serving
- Batch processing
- Precomputed predictions
- Lower cost
- Higher throughput

### Hybrid Serving
```python
class HybridPredictor:
    def __init__(self, cache_ttl=3600):
        self.cache = {}
        self.cache_ttl = cache_ttl
        self.model = load_model()

    def predict(self, user_id, context):
        # Check precomputed cache
        cache_key = f"{user_id}:{hash(context)}"
        if cache_key in self.cache:
            return self.cache[cache_key]

        # Compute real-time
        prediction = self.model.predict(context)
        self.cache[cache_key] = prediction
        return prediction
```

## Edge Deployment

```python
# TFLite for mobile/embedded
import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

# ONNX for cross-platform
import onnx
torch.onnx.export(model, dummy_input, "model.onnx")

# CoreML for iOS
import coremltools as ct
mlmodel = ct.convert(model, inputs=[ct.TensorType(shape=(1, 3, 224, 224))])
```

## Serverless ML

```python
# AWS Lambda
import json
import boto3

def lambda_handler(event, context):
    runtime = boto3.client('runtime.sagemaker')
    response = runtime.invoke_endpoint(
        EndpointName='my-model',
        ContentType='application/json',
        Body=json.dumps(event['body'])
    )
    return json.loads(response['Body'].read())
```

## Deployment Comparison

| Pattern | Latency | Cost | Complexity | Use Case |
|---------|---------|------|------------|----------|
| Batch | High | Low | Low | Reports, ETL |
| Real-time | Low | High | Medium | User-facing |
| Streaming | Medium | Medium | High | Event-driven |
| Edge | Very Low | Low | High | Offline, IoT |
| Serverless | Variable | Pay-per-use | Low | Sporadic traffic |

## Commands
- `/omgdeploy:serve` - Deploy serving
- `/omgdeploy:edge` - Edge deployment

## Best Practices

1. Match paradigm to requirements
2. Consider latency vs cost trade-offs
3. Plan for scaling
4. Test under realistic conditions
5. Monitor deployed models

Overview

This skill explains common ML deployment paradigms and trade-offs so you can pick the right pattern for production. It covers batch, real-time, streaming, edge, hybrid, and serverless serving, with practical considerations for latency, cost, and complexity. The guidance is targeted at engineers building JavaScript/Node.js and cross-platform ML services.

How this skill works

I describe what each deployment mode inspects and the runtime behaviors to expect: batch runs large jobs on schedules; real-time exposes low-latency endpoints; streaming handles continuous event flows; edge converts models for on-device inference; serverless routes invocation to managed endpoints. For hybrid patterns I show how to combine precomputed results and online prediction with caching and TTL logic. The goal is to make trade-offs explicit so you can design predictable SLAs and cost profiles.

When to use it

Batch inference when you can tolerate high latency and want low cost or precomputed outputs (reports, ETL).
Real-time inference for user-facing features requiring sub-second responses (recommendations, fraud checks).
Streaming inference for event-driven pipelines and continuous scoring (clickstreams, telemetry).
Edge deployment for offline, privacy-sensitive, or ultra-low-latency use cases on mobile or IoT.
Serverless for sporadic traffic or pay-per-use scenarios where operational overhead must be minimized.

Best practices

Match the paradigm to business SLAs and throughput needs before optimizing.
Benchmark latency and cost with realistic traffic, not synthetic microbenchmarks.
Use feature stores or validated preprocessing pipelines to ensure parity between training and serving.
Implement monitoring, logging, and drift detection to catch production issues early.
Plan autoscaling, caching, and fallback strategies (e.g., precomputed predictions or degraded modes).

Example use cases

Nightly batch scoring to refresh features and feeds for dashboards.
HTTP/GRPC real-time endpoint behind a load balancer for interactive recommendations.
Kafka-based streaming pipeline that scores events and publishes enriched tuples.
TFLite/ONNX/CoreML conversion to run image models on mobile or embedded devices.
Lambda or serverless function that forwards requests to a managed model endpoint for sporadic inference.

FAQ

How do I choose between online and offline serving?

Base the choice on latency requirements, cost, and throughput. Use online for sub-second responses and offline for high-volume, tolerant-latency workloads.

When should I use serverless for ML?

Choose serverless for unpredictable or low-volume traffic to reduce ops overhead, but beware cold-starts and model size limits—use provisioned concurrency or an external endpoint for larger models.