Getting a model to 90% accuracy in a notebook is the easy part. Getting it to serve 50,000 requests per second with p99 latency under 100ms is where most ML projects stall. This guide covers the architecture, tooling, and trade-offs you need to deploy ML at real scale.

Choosing a Serving Stack

The serving stack depends on your model type, traffic pattern, and latency requirements:

  • FastAPI + Uvicorn: Simple, CPU-friendly models. <1000 req/s. Great for internal tools.
  • TorchServe / BentoML: PyTorch models with basic batching. Medium traffic.
  • Triton Inference Server: GPU models, mixed frameworks (TF/PyTorch/ONNX), dynamic batching, model ensembles. Designed for high throughput.
  • vLLM / TGI: LLM-specific serving with PagedAttention and continuous batching. For transformer models only.

Dynamic Batching: The Key to GPU Utilisation

GPUs are parallelism machines. Serving one request at a time wastes 95% of GPU capacity. Dynamic batching queues requests over a time window and processes them together:

Triton config.pbtxt
dynamic_batching { preferred_batch_size: [ 8, 16, 32 ] max_queue_delay_microseconds: 5000 # wait up to 5ms to fill a batch priority_queue_policy { default_priority_level: PRIORITY_MEDIUM } }

Kubernetes for ML Serving

K8s handles auto-scaling, rolling updates, and health checks. Key configurations for ML workloads:

YAML
resources: requests: memory: "4Gi" cpu: "2" nvidia.com/gpu: "1" limits: memory: "8Gi" nvidia.com/gpu: "1" readinessProbe: httpGet: path: /health/ready port: 8000 initialDelaySeconds: 30 periodSeconds: 5 horizontalPodAutoscaler: minReplicas: 2 maxReplicas: 20 metrics: - type: External external: metric: name: inference_queue_depth target: type: AverageValue averageValue: "10"
💡

Scale on queue depth, not CPU

CPU utilisation is a lagging indicator for ML workloads. Scale your inference pods based on the request queue depth for much more responsive auto-scaling.

Model Optimisation for Serving

TechniqueSpeedupAccuracy LossEffort
ONNX Export1.2–2xNoneLow
TensorRT (FP16)2–4x<0.1%Medium
INT8 Quantisation3–6x0.5–2%Medium
Knowledge Distillation5–10x1–5%High