Deploying ML at Scale — Complete Guide (7)

Getting a model to 90% accuracy in a notebook is the easy part. Getting it to serve 50,000 requests per second with p99 latency under 100ms is where most ML projects stall. This guide covers the architecture, tooling, and trade-offs you need to deploy ML at real scale.

Choosing a Serving Stack

The serving stack depends on your model type, traffic pattern, and latency requirements:

FastAPI + Uvicorn: Simple, CPU-friendly models. <1000 req/s. Great for internal tools.
TorchServe / BentoML: PyTorch models with basic batching. Medium traffic.
Triton Inference Server: GPU models, mixed frameworks (TF/PyTorch/ONNX), dynamic batching, model ensembles. Designed for high throughput.
vLLM / TGI: LLM-specific serving with PagedAttention and continuous batching. For transformer models only.

Dynamic Batching: The Key to GPU Utilisation

GPUs are parallelism machines. Serving one request at a time wastes 95% of GPU capacity. Dynamic batching queues requests over a time window and processes them together:

Triton config.pbtxt
dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 5000   # wait up to 5ms to fill a batch
  priority_queue_policy {
    default_priority_level: PRIORITY_MEDIUM
  }
}

Kubernetes for ML Serving

K8s handles auto-scaling, rolling updates, and health checks. Key configurations for ML workloads:

YAML
resources:
  requests:
    memory: "4Gi"
    cpu: "2"
    nvidia.com/gpu: "1"
  limits:
    memory: "8Gi"
    nvidia.com/gpu: "1"

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 5

horizontalPodAutoscaler:
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: External
    external:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "10"

💡

Scale on queue depth, not CPU

CPU utilisation is a lagging indicator for ML workloads. Scale your inference pods based on the request queue depth for much more responsive auto-scaling.

Model Optimisation for Serving

Technique	Speedup	Accuracy Loss	Effort
ONNX Export	1.2–2x	None	Low
TensorRT (FP16)	2–4x	<0.1%	Medium
INT8 Quantisation	3–6x	0.5–2%	Medium
Knowledge Distillation	5–10x	1–5%	High

Model Serving Kubernetes Triton Scaling Infrastructure

← Back Portfolio Home Let's talk → Get in Touch with Junaid

Back to Portfolio

Deploying ML at Scale — Complete Guide (7)

Choosing a Serving Stack

Dynamic Batching: The Key to GPU Utilisation

Kubernetes for ML Serving

Scale on queue depth, not CPU

Model Optimisation for Serving

Related Articles