home / skills / doanchienthangdev / omgkit / deployment-paradigms

deployment-paradigms skill

/plugin/skills/ml-systems/deployment-paradigms

This skill helps you compare and select ML deployment paradigms, optimizing latency, cost, and scalability across batch, real time, streaming, edge, and

npx playbooks add skill doanchienthangdev/omgkit --skill deployment-paradigms

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.5 KB
---
name: deployment-paradigms
description: ML deployment paradigms including batch vs real-time inference, online vs offline serving, edge deployment, and serverless ML.
---

# Deployment Paradigms

Understanding ML deployment patterns and trade-offs.

## Deployment Modes

### Batch Inference
```python
# Process all data at once
def batch_inference(model, data_path, output_path):
    data = pd.read_parquet(data_path)
    predictions = model.predict(data)
    predictions.to_parquet(output_path)

# Schedule: Daily/Hourly
# Airflow DAG example
with DAG('batch_inference', schedule_interval='@daily') as dag:
    inference_task = PythonOperator(
        task_id='run_inference',
        python_callable=batch_inference
    )
```

### Real-time Inference
```python
from fastapi import FastAPI
import torch

app = FastAPI()
model = torch.jit.load("model.pt")

@app.post("/predict")
async def predict(request: PredictRequest):
    features = preprocess(request.data)
    with torch.no_grad():
        prediction = model(features)
    return {"prediction": prediction.tolist()}
```

### Streaming Inference
```python
from kafka import KafkaConsumer, KafkaProducer

consumer = KafkaConsumer('input-topic')
producer = KafkaProducer()

for message in consumer:
    data = deserialize(message.value)
    prediction = model.predict(data)
    producer.send('output-topic', serialize(prediction))
```

## Serving Patterns

### Online Serving
- Sub-second latency
- Feature store for features
- Model caching
- Auto-scaling

### Offline Serving
- Batch processing
- Precomputed predictions
- Lower cost
- Higher throughput

### Hybrid Serving
```python
class HybridPredictor:
    def __init__(self, cache_ttl=3600):
        self.cache = {}
        self.cache_ttl = cache_ttl
        self.model = load_model()

    def predict(self, user_id, context):
        # Check precomputed cache
        cache_key = f"{user_id}:{hash(context)}"
        if cache_key in self.cache:
            return self.cache[cache_key]

        # Compute real-time
        prediction = self.model.predict(context)
        self.cache[cache_key] = prediction
        return prediction
```

## Edge Deployment

```python
# TFLite for mobile/embedded
import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

# ONNX for cross-platform
import onnx
torch.onnx.export(model, dummy_input, "model.onnx")

# CoreML for iOS
import coremltools as ct
mlmodel = ct.convert(model, inputs=[ct.TensorType(shape=(1, 3, 224, 224))])
```

## Serverless ML

```python
# AWS Lambda
import json
import boto3

def lambda_handler(event, context):
    runtime = boto3.client('runtime.sagemaker')
    response = runtime.invoke_endpoint(
        EndpointName='my-model',
        ContentType='application/json',
        Body=json.dumps(event['body'])
    )
    return json.loads(response['Body'].read())
```

## Deployment Comparison

| Pattern | Latency | Cost | Complexity | Use Case |
|---------|---------|------|------------|----------|
| Batch | High | Low | Low | Reports, ETL |
| Real-time | Low | High | Medium | User-facing |
| Streaming | Medium | Medium | High | Event-driven |
| Edge | Very Low | Low | High | Offline, IoT |
| Serverless | Variable | Pay-per-use | Low | Sporadic traffic |

## Commands
- `/omgdeploy:serve` - Deploy serving
- `/omgdeploy:edge` - Edge deployment

## Best Practices

1. Match paradigm to requirements
2. Consider latency vs cost trade-offs
3. Plan for scaling
4. Test under realistic conditions
5. Monitor deployed models

Overview

This skill explains common ML deployment paradigms and trade-offs so you can pick the right pattern for production. It covers batch, real-time, streaming, edge, hybrid, and serverless serving, with practical considerations for latency, cost, and complexity. The guidance is targeted at engineers building JavaScript/Node.js and cross-platform ML services.

How this skill works

I describe what each deployment mode inspects and the runtime behaviors to expect: batch runs large jobs on schedules; real-time exposes low-latency endpoints; streaming handles continuous event flows; edge converts models for on-device inference; serverless routes invocation to managed endpoints. For hybrid patterns I show how to combine precomputed results and online prediction with caching and TTL logic. The goal is to make trade-offs explicit so you can design predictable SLAs and cost profiles.

When to use it

  • Batch inference when you can tolerate high latency and want low cost or precomputed outputs (reports, ETL).
  • Real-time inference for user-facing features requiring sub-second responses (recommendations, fraud checks).
  • Streaming inference for event-driven pipelines and continuous scoring (clickstreams, telemetry).
  • Edge deployment for offline, privacy-sensitive, or ultra-low-latency use cases on mobile or IoT.
  • Serverless for sporadic traffic or pay-per-use scenarios where operational overhead must be minimized.

Best practices

  • Match the paradigm to business SLAs and throughput needs before optimizing.
  • Benchmark latency and cost with realistic traffic, not synthetic microbenchmarks.
  • Use feature stores or validated preprocessing pipelines to ensure parity between training and serving.
  • Implement monitoring, logging, and drift detection to catch production issues early.
  • Plan autoscaling, caching, and fallback strategies (e.g., precomputed predictions or degraded modes).

Example use cases

  • Nightly batch scoring to refresh features and feeds for dashboards.
  • HTTP/GRPC real-time endpoint behind a load balancer for interactive recommendations.
  • Kafka-based streaming pipeline that scores events and publishes enriched tuples.
  • TFLite/ONNX/CoreML conversion to run image models on mobile or embedded devices.
  • Lambda or serverless function that forwards requests to a managed model endpoint for sporadic inference.

FAQ

How do I choose between online and offline serving?

Base the choice on latency requirements, cost, and throughput. Use online for sub-second responses and offline for high-volume, tolerant-latency workloads.

When should I use serverless for ML?

Choose serverless for unpredictable or low-volume traffic to reduce ops overhead, but beware cold-starts and model size limits—use provisioned concurrency or an external endpoint for larger models.