Deploy ML Models to Production: 2-4x Faster with ONNX

Your Jupyter Notebook Is Not a Production System

You've trained a model. It works in your notebook. Metrics look great. Now what? Deploying ML models in production is where most data science projects go to die. The gap between a working prototype and a reliable, scalable inference service is enormous -- and it's mostly an engineering problem, not a data science one.

I've deployed models ranging from lightweight scikit-learn classifiers to multi-GPU transformer stacks on Kubernetes. The tooling has gotten dramatically better in the last two years, but the fundamental challenges remain: model serialization, inference serving, containerization, GPU scheduling, versioning, and monitoring. Here's the end-to-end path from notebook to production.

What Does ML Model Deployment Mean?

Definition: ML model deployment is the process of making a trained machine learning model available for real-time or batch inference in a production environment. This involves exporting the model to a portable format, wrapping it in an API or inference server, containerizing it, and orchestrating it with monitoring, versioning, and scaling infrastructure.

Step 1: Export Your Model to a Portable Format

Your training framework's native format is rarely what you want to serve in production. Export to a format optimized for inference.

Format	Framework	Best For	Key Advantage
ONNX	Any (PyTorch, TF, sklearn)	Cross-platform deployment	Framework-agnostic, hardware-optimized runtimes
TorchScript	PyTorch	PyTorch-native serving	Preserves dynamic computation graphs
SavedModel	TensorFlow	TF Serving, TFLite	TensorFlow ecosystem integration
GGUF	llama.cpp	LLM inference on CPU/consumer GPU	Quantized, runs on commodity hardware

import torch
import onnx

# PyTorch to ONNX export
model = MyModel()
model.load_state_dict(torch.load("model.pt"))
model.set_eval_mode()  # switch to inference mode

dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["image"],
    output_names=["prediction"],
    dynamic_axes={"image": {0: "batch_size"}, "prediction": {0: "batch_size"}}
)

# Verify the exported model
onnx_model = onnx.load("model.onnx")
onnx.checker.check_model(onnx_model)

Pro tip: ONNX Runtime (ORT) provides significant inference speedups over native PyTorch -- typically 2-4x for transformer models on CPU. Even if you're staying on PyTorch for training, exporting to ONNX for inference is almost always worth it. Test accuracy on a validation set after export to catch conversion issues.

Step 2: Build an Inference API

Wrap your model in an HTTP API. You have two main paths: build your own with FastAPI, or use an inference server like Triton.

FastAPI Approach (Simple, Flexible)

from fastapi import FastAPI
from pydantic import BaseModel
import onnxruntime as ort
import numpy as np

app = FastAPI()
session = ort.InferenceSession("model.onnx")

class PredictionRequest(BaseModel):
    features: list[float]

class PredictionResponse(BaseModel):
    prediction: float
    confidence: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    input_array = np.array([request.features], dtype=np.float32)
    outputs = session.run(None, {"input": input_array})
    return PredictionResponse(
        prediction=float(outputs[0][0]),
        confidence=float(outputs[1][0])
    )

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": session is not None}

NVIDIA Triton Inference Server (High-Performance)

Triton handles model loading, batching, GPU scheduling, and multi-model serving out of the box. It's more complex to set up but essential for high-throughput production workloads.

# Model repository structure for Triton
models/
  my_model/
    config.pbtxt
    1/
      model.onnx

# config.pbtxt
name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 32
input [
  { name: "input" dims: [3, 224, 224] data_type: TYPE_FP32 }
]
output [
  { name: "prediction" dims: [1000] data_type: TYPE_FP32 }
]
dynamic_batching {
  preferred_batch_size: [8, 16]
  max_queue_delay_microseconds: 100
}

Direct answer: Use FastAPI for simple models, low traffic (under 100 requests/second), or when you need custom pre/post-processing logic. Use Triton for high-throughput GPU inference, multi-model serving, or when you need dynamic batching to maximize GPU utilization. Many teams start with FastAPI and migrate to Triton as traffic grows.

Step 3: Containerize with Docker

A Docker container packages your model, dependencies, and inference code into a reproducible, deployable unit.

# Multi-stage build for smaller image
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY model.onnx .
COPY app.py .

# Non-root user for security
RUN useradd -m appuser
USER appuser

EXPOSE 8000
HEALTHCHECK --interval=30s CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Watch out: Don't bake large model files into Docker images. A 5GB model file means a 5GB+ image that takes forever to pull. Instead, store models in object storage (S3, GCS) and download them at container startup, or use model registries like MLflow. Cache the model on persistent volumes in Kubernetes so restarts are fast.

Step 4: Deploy on Kubernetes with GPU Support

Kubernetes is the standard orchestration platform for ML workloads, but GPU scheduling adds complexity.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-inference
  template:
    metadata:
      labels:
        app: ml-inference
    spec:
      containers:
      - name: inference
        image: your-registry/ml-model:v1.2.0
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "4"
            memory: "8Gi"
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        env:
        - name: MODEL_PATH
          value: "s3://models/my-model/v1.2.0/model.onnx"
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      nodeSelector:
        accelerator: nvidia-a100
---
apiVersion: v1
kind: Service
metadata:
  name: ml-inference
spec:
  selector:
    app: ml-inference
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP

GPU Scheduling Essentials

Install the NVIDIA device plugin -- this makes GPUs visible to the Kubernetes scheduler as nvidia.com/gpu resources
Use node selectors or taints -- isolate GPU nodes from general workloads to prevent CPU pods from landing on expensive GPU instances
Set resource limits precisely -- GPUs can't be shared between pods by default (GPU time-slicing and MIG change this, but add complexity)
Plan for cold starts -- model loading takes time, especially for large models. Set generous initialDelaySeconds on readiness probes
Use Horizontal Pod Autoscaler -- scale based on GPU utilization or request queue depth, not just CPU

Step 5: Model Versioning and Deployment Strategies

Strategy	How It Works	Risk Level	Best For
Blue/Green	Deploy new version alongside old, switch traffic all at once	Medium	When you can validate fully before switching
Canary	Route small % of traffic to new version, gradually increase	Low	Production changes where regression is costly
A/B Testing	Route different user segments to different models	Low	Comparing model performance on real traffic
Shadow	Run new model in parallel, compare outputs without serving them	Very low	Validating before any production exposure

# Istio VirtualService for canary deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ml-inference
spec:
  hosts:
  - ml-inference
  http:
  - route:
    - destination:
        host: ml-inference-v1
      weight: 90
    - destination:
        host: ml-inference-v2
      weight: 10

Step 6: Monitoring with Prometheus

Standard application metrics aren't enough for ML systems. You need model-specific metrics.

from prometheus_client import Histogram, Counter, Gauge

PREDICTION_LATENCY = Histogram(
    "model_prediction_seconds",
    "Time to generate a prediction",
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)
PREDICTION_COUNT = Counter(
    "model_predictions_total",
    "Total predictions",
    ["model_version", "status"]
)
GPU_MEMORY = Gauge(
    "model_gpu_memory_bytes",
    "GPU memory usage"
)
PREDICTION_DISTRIBUTION = Histogram(
    "model_output_value",
    "Distribution of model output values",
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)

@app.post("/predict")
async def predict(request: PredictionRequest):
    with PREDICTION_LATENCY.time():
        result = run_inference(request)
    PREDICTION_COUNT.labels(model_version="v1.2.0", status="success").inc()
    PREDICTION_DISTRIBUTION.observe(result.confidence)
    return result

Pro tip: Track prediction output distributions over time. A sudden shift in the distribution of model outputs -- even if no errors are thrown -- is a strong signal of data drift. Set alerts on distribution divergence metrics like KL divergence or Population Stability Index (PSI) to catch model degradation before users notice.

Managed Alternatives: When to Skip the Infrastructure

Service	Best For	Starting Price	Key Advantage
AWS SageMaker	Full MLOps lifecycle	~$0.05/hour (ml.t3.medium)	Deep AWS integration, built-in A/B testing
Google Vertex AI	GCP-native teams	~$0.05/hour (n1-standard-2)	AutoML, pipeline orchestration, Gemini integration
Azure ML	Enterprise / Microsoft shops	~$0.05/hour (Standard_DS2_v2)	Azure DevOps integration, managed endpoints
Modal	Fast iteration, serverless GPU	Pay per second of compute	Deploy from a Python decorator, zero infra config
Replicate	Open-source model hosting	Pay per second of compute	One-line deployment of popular models

Frequently Asked Questions

Do I need Kubernetes to deploy ML models?

No. For simple models with predictable traffic, a single container on ECS, Cloud Run, or even a VM is fine. Kubernetes adds value when you need GPU scheduling, multi-model serving, canary deployments, or auto-scaling based on custom metrics. Don't adopt Kubernetes just for ML -- adopt it when orchestration complexity justifies the operational overhead.

How do I handle models that are too large for a single GPU?

Use model parallelism to split the model across multiple GPUs. Frameworks like vLLM, TensorRT-LLM, and DeepSpeed Inference handle this automatically for transformer models. For custom architectures, you'll need to implement tensor or pipeline parallelism manually. Consider quantization first -- a 4-bit quantized model often fits on a single GPU with minimal quality loss.

What's the difference between ONNX and TorchScript?

ONNX is framework-agnostic and has optimized runtimes for various hardware (CPU, GPU, edge devices). TorchScript is PyTorch-specific but preserves dynamic control flow that ONNX may not support. Use ONNX when deploying to non-PyTorch environments or when you need ONNX Runtime's optimizations. Use TorchScript when your model uses complex dynamic logic.

How do I version ML models in production?

Use a model registry (MLflow, Weights and Biases, SageMaker Model Registry) that tracks model artifacts, metrics, lineage, and deployment status. Tag each model with a semantic version. Never overwrite a model artifact -- always create a new version. Store models in object storage with versioned paths like s3://models/my-model/v1.2.0/.

What is data drift and how do I detect it?

Data drift is when the distribution of production input data diverges from the training data distribution. This causes model accuracy to degrade silently. Detect it by monitoring input feature distributions with statistical tests (KS test, PSI) and by tracking prediction output distributions. Retrain when drift exceeds your defined thresholds.

Should I use serverless GPU for ML inference?

Serverless GPU (Modal, Banana, Replicate) is excellent for bursty workloads with periods of zero traffic. Cold starts are the tradeoff -- spinning up a GPU container takes 10-60 seconds. For consistent traffic above a few requests per second, dedicated GPU instances are more cost-effective. Serverless shines for batch processing and development/staging environments.

How do I reduce inference latency for transformer models?

Stack multiple techniques: quantize to INT8 or INT4 (2-4x speedup), use KV-cache for autoregressive generation, enable Flash Attention, batch requests with dynamic batching, use TensorRT or ONNX Runtime for graph optimization. For LLMs specifically, vLLM's PagedAttention provides the best throughput-to-latency ratio on NVIDIA GPUs.

Ship It, Then Improve It

The biggest mistake I see teams make is trying to build the perfect ML infrastructure before they've deployed a single model. Start with the simplest thing that works: a FastAPI container, a health check, basic latency and error rate monitoring. Deploy to a single node. Get real traffic flowing and real feedback loops established. Then add versioning, canary deployments, drift detection, and auto-scaling as the system matures. Every piece of infrastructure you add should solve a problem you've actually experienced, not one you've imagined.

Deploying ML Models in Production: From Notebook to Kubernetes