Deploying ML Models in Production: From Notebook to Kubernetes
End-to-end guide to deploying ML models -- from ONNX export and FastAPI serving to Kubernetes GPU workloads, canary deployments, and Prometheus monitoring.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Your Jupyter Notebook Is Not a Production System
You've trained a model. It works in your notebook. Metrics look great. Now what? Deploying ML models in production is where most data science projects go to die. The gap between a working prototype and a reliable, scalable inference service is enormous -- and it's mostly an engineering problem, not a data science one.
I've deployed models ranging from lightweight scikit-learn classifiers to multi-GPU transformer stacks on Kubernetes. The tooling has gotten dramatically better in the last two years, but the fundamental challenges remain: model serialization, inference serving, containerization, GPU scheduling, versioning, and monitoring. Here's the end-to-end path from notebook to production.
What Does ML Model Deployment Mean?
Definition: ML model deployment is the process of making a trained machine learning model available for real-time or batch inference in a production environment. This involves exporting the model to a portable format, wrapping it in an API or inference server, containerizing it, and orchestrating it with monitoring, versioning, and scaling infrastructure.
Step 1: Export Your Model to a Portable Format
Your training framework's native format is rarely what you want to serve in production. Export to a format optimized for inference.
| Format | Framework | Best For | Key Advantage |
|---|---|---|---|
| ONNX | Any (PyTorch, TF, sklearn) | Cross-platform deployment | Framework-agnostic, hardware-optimized runtimes |
| TorchScript | PyTorch | PyTorch-native serving | Preserves dynamic computation graphs |
| SavedModel | TensorFlow | TF Serving, TFLite | TensorFlow ecosystem integration |
| GGUF | llama.cpp | LLM inference on CPU/consumer GPU | Quantized, runs on commodity hardware |
import torch
import onnx
# PyTorch to ONNX export
model = MyModel()
model.load_state_dict(torch.load("model.pt"))
model.set_eval_mode() # switch to inference mode
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
"model.onnx",
input_names=["image"],
output_names=["prediction"],
dynamic_axes={"image": {0: "batch_size"}, "prediction": {0: "batch_size"}}
)
# Verify the exported model
onnx_model = onnx.load("model.onnx")
onnx.checker.check_model(onnx_model)
Pro tip: ONNX Runtime (ORT) provides significant inference speedups over native PyTorch -- typically 2-4x for transformer models on CPU. Even if you're staying on PyTorch for training, exporting to ONNX for inference is almost always worth it. Test accuracy on a validation set after export to catch conversion issues.
Step 2: Build an Inference API
Wrap your model in an HTTP API. You have two main paths: build your own with FastAPI, or use an inference server like Triton.
FastAPI Approach (Simple, Flexible)
from fastapi import FastAPI
from pydantic import BaseModel
import onnxruntime as ort
import numpy as np
app = FastAPI()
session = ort.InferenceSession("model.onnx")
class PredictionRequest(BaseModel):
features: list[float]
class PredictionResponse(BaseModel):
prediction: float
confidence: float
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
input_array = np.array([request.features], dtype=np.float32)
outputs = session.run(None, {"input": input_array})
return PredictionResponse(
prediction=float(outputs[0][0]),
confidence=float(outputs[1][0])
)
@app.get("/health")
async def health():
return {"status": "healthy", "model_loaded": session is not None}
NVIDIA Triton Inference Server (High-Performance)
Triton handles model loading, batching, GPU scheduling, and multi-model serving out of the box. It's more complex to set up but essential for high-throughput production workloads.
# Model repository structure for Triton
models/
my_model/
config.pbtxt
1/
model.onnx
# config.pbtxt
name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 32
input [
{ name: "input" dims: [3, 224, 224] data_type: TYPE_FP32 }
]
output [
{ name: "prediction" dims: [1000] data_type: TYPE_FP32 }
]
dynamic_batching {
preferred_batch_size: [8, 16]
max_queue_delay_microseconds: 100
}
Direct answer: Use FastAPI for simple models, low traffic (under 100 requests/second), or when you need custom pre/post-processing logic. Use Triton for high-throughput GPU inference, multi-model serving, or when you need dynamic batching to maximize GPU utilization. Many teams start with FastAPI and migrate to Triton as traffic grows.
Step 3: Containerize with Docker
A Docker container packages your model, dependencies, and inference code into a reproducible, deployable unit.
# Multi-stage build for smaller image
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY model.onnx .
COPY app.py .
# Non-root user for security
RUN useradd -m appuser
USER appuser
EXPOSE 8000
HEALTHCHECK --interval=30s CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
Watch out: Don't bake large model files into Docker images. A 5GB model file means a 5GB+ image that takes forever to pull. Instead, store models in object storage (S3, GCS) and download them at container startup, or use model registries like MLflow. Cache the model on persistent volumes in Kubernetes so restarts are fast.
Step 4: Deploy on Kubernetes with GPU Support
Kubernetes is the standard orchestration platform for ML workloads, but GPU scheduling adds complexity.
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference
spec:
replicas: 3
selector:
matchLabels:
app: ml-inference
template:
metadata:
labels:
app: ml-inference
spec:
containers:
- name: inference
image: your-registry/ml-model:v1.2.0
resources:
requests:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
env:
- name: MODEL_PATH
value: "s3://models/my-model/v1.2.0/model.onnx"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
accelerator: nvidia-a100
---
apiVersion: v1
kind: Service
metadata:
name: ml-inference
spec:
selector:
app: ml-inference
ports:
- port: 80
targetPort: 8000
type: ClusterIP
GPU Scheduling Essentials
- Install the NVIDIA device plugin -- this makes GPUs visible to the Kubernetes scheduler as
nvidia.com/gpuresources - Use node selectors or taints -- isolate GPU nodes from general workloads to prevent CPU pods from landing on expensive GPU instances
- Set resource limits precisely -- GPUs can't be shared between pods by default (GPU time-slicing and MIG change this, but add complexity)
- Plan for cold starts -- model loading takes time, especially for large models. Set generous
initialDelaySecondson readiness probes - Use Horizontal Pod Autoscaler -- scale based on GPU utilization or request queue depth, not just CPU
Step 5: Model Versioning and Deployment Strategies
| Strategy | How It Works | Risk Level | Best For |
|---|---|---|---|
| Blue/Green | Deploy new version alongside old, switch traffic all at once | Medium | When you can validate fully before switching |
| Canary | Route small % of traffic to new version, gradually increase | Low | Production changes where regression is costly |
| A/B Testing | Route different user segments to different models | Low | Comparing model performance on real traffic |
| Shadow | Run new model in parallel, compare outputs without serving them | Very low | Validating before any production exposure |
# Istio VirtualService for canary deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ml-inference
spec:
hosts:
- ml-inference
http:
- route:
- destination:
host: ml-inference-v1
weight: 90
- destination:
host: ml-inference-v2
weight: 10
Step 6: Monitoring with Prometheus
Standard application metrics aren't enough for ML systems. You need model-specific metrics.
from prometheus_client import Histogram, Counter, Gauge
PREDICTION_LATENCY = Histogram(
"model_prediction_seconds",
"Time to generate a prediction",
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)
PREDICTION_COUNT = Counter(
"model_predictions_total",
"Total predictions",
["model_version", "status"]
)
GPU_MEMORY = Gauge(
"model_gpu_memory_bytes",
"GPU memory usage"
)
PREDICTION_DISTRIBUTION = Histogram(
"model_output_value",
"Distribution of model output values",
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)
@app.post("/predict")
async def predict(request: PredictionRequest):
with PREDICTION_LATENCY.time():
result = run_inference(request)
PREDICTION_COUNT.labels(model_version="v1.2.0", status="success").inc()
PREDICTION_DISTRIBUTION.observe(result.confidence)
return result
Pro tip: Track prediction output distributions over time. A sudden shift in the distribution of model outputs -- even if no errors are thrown -- is a strong signal of data drift. Set alerts on distribution divergence metrics like KL divergence or Population Stability Index (PSI) to catch model degradation before users notice.
Managed Alternatives: When to Skip the Infrastructure
| Service | Best For | Starting Price | Key Advantage |
|---|---|---|---|
| AWS SageMaker | Full MLOps lifecycle | ~$0.05/hour (ml.t3.medium) | Deep AWS integration, built-in A/B testing |
| Google Vertex AI | GCP-native teams | ~$0.05/hour (n1-standard-2) | AutoML, pipeline orchestration, Gemini integration |
| Azure ML | Enterprise / Microsoft shops | ~$0.05/hour (Standard_DS2_v2) | Azure DevOps integration, managed endpoints |
| Modal | Fast iteration, serverless GPU | Pay per second of compute | Deploy from a Python decorator, zero infra config |
| Replicate | Open-source model hosting | Pay per second of compute | One-line deployment of popular models |
Frequently Asked Questions
Do I need Kubernetes to deploy ML models?
No. For simple models with predictable traffic, a single container on ECS, Cloud Run, or even a VM is fine. Kubernetes adds value when you need GPU scheduling, multi-model serving, canary deployments, or auto-scaling based on custom metrics. Don't adopt Kubernetes just for ML -- adopt it when orchestration complexity justifies the operational overhead.
How do I handle models that are too large for a single GPU?
Use model parallelism to split the model across multiple GPUs. Frameworks like vLLM, TensorRT-LLM, and DeepSpeed Inference handle this automatically for transformer models. For custom architectures, you'll need to implement tensor or pipeline parallelism manually. Consider quantization first -- a 4-bit quantized model often fits on a single GPU with minimal quality loss.
What's the difference between ONNX and TorchScript?
ONNX is framework-agnostic and has optimized runtimes for various hardware (CPU, GPU, edge devices). TorchScript is PyTorch-specific but preserves dynamic control flow that ONNX may not support. Use ONNX when deploying to non-PyTorch environments or when you need ONNX Runtime's optimizations. Use TorchScript when your model uses complex dynamic logic.
How do I version ML models in production?
Use a model registry (MLflow, Weights and Biases, SageMaker Model Registry) that tracks model artifacts, metrics, lineage, and deployment status. Tag each model with a semantic version. Never overwrite a model artifact -- always create a new version. Store models in object storage with versioned paths like s3://models/my-model/v1.2.0/.
What is data drift and how do I detect it?
Data drift is when the distribution of production input data diverges from the training data distribution. This causes model accuracy to degrade silently. Detect it by monitoring input feature distributions with statistical tests (KS test, PSI) and by tracking prediction output distributions. Retrain when drift exceeds your defined thresholds.
Should I use serverless GPU for ML inference?
Serverless GPU (Modal, Banana, Replicate) is excellent for bursty workloads with periods of zero traffic. Cold starts are the tradeoff -- spinning up a GPU container takes 10-60 seconds. For consistent traffic above a few requests per second, dedicated GPU instances are more cost-effective. Serverless shines for batch processing and development/staging environments.
How do I reduce inference latency for transformer models?
Stack multiple techniques: quantize to INT8 or INT4 (2-4x speedup), use KV-cache for autoregressive generation, enable Flash Attention, batch requests with dynamic batching, use TensorRT or ONNX Runtime for graph optimization. For LLMs specifically, vLLM's PagedAttention provides the best throughput-to-latency ratio on NVIDIA GPUs.
Ship It, Then Improve It
The biggest mistake I see teams make is trying to build the perfect ML infrastructure before they've deployed a single model. Start with the simplest thing that works: a FastAPI container, a health check, basic latency and error rate monitoring. Deploy to a single node. Get real traffic flowing and real feedback loops established. Then add versioning, canary deployments, drift detection, and auto-scaling as the system matures. Every piece of infrastructure you add should solve a problem you've actually experienced, not one you've imagined.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
How LLM Inference Works: Tokens, Context Windows, and KV Cache
Language models process tokens, not words. Learn how BPE tokenization works, what the context window really is, and how the KV cache speeds up generation — with real pricing comparisons across OpenAI, Anthropic, and Google.
12 min read
SecurityCertificate Management at Scale: Let's Encrypt, ACME, and cert-manager
Automate TLS certificates with Let's Encrypt, ACME protocol, and cert-manager in Kubernetes. Covers HTTP-01, DNS-01, wildcards, private CAs, and expiry monitoring.
9 min read
SecuritySecret Management: HashiCorp Vault vs AWS Secrets Manager vs Kubernetes Secrets
Compare Vault, AWS Secrets Manager, and Kubernetes Secrets. Learn about dynamic secrets, rotation, injection patterns, and when to use each tool.
9 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.