MLOps is not DevOps with a model bolted on. It has unique challenges: training data changes, model accuracy degrades silently, experiments need reproducibility, and serving requires low-latency infrastructure. Here's the playbook for building ML systems that last.
The ML Lifecycle (and Where It Breaks)
Most ML projects fail not because the model is bad — but because the pipeline around it is fragile. The five stages where things go wrong:
- Data Ingestion: Silent schema changes upstream break feature pipelines.
- Feature Engineering: Training/serving skew — different transformations at train vs. inference time.
- Training: Non-reproducible experiments, forgotten hyperparameters.
- Evaluation: Metrics on stale test sets miss real-world distribution shifts.
- Serving: Cold start latency, memory limits, no rollback plan.
Experiment Tracking with MLflow
Every training run should be logged. MLflow makes this frictionless:
Pythonimport mlflow, mlflow.sklearn from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import f1_score, roc_auc_score mlflow.set_experiment("churn-prediction-v3") with mlflow.start_run(run_name="GBT-depth6"): params = {"n_estimators": 300, "max_depth": 6, "learning_rate": 0.05} model = GradientBoostingClassifier(**params) model.fit(X_train, y_train) preds = model.predict(X_test) mlflow.log_params(params) mlflow.log_metrics({ "f1": f1_score(y_test, preds), "roc_auc": roc_auc_score(y_test, model.predict_proba(X_test)[:,1]) }) mlflow.sklearn.log_model(model, "model", registered_model_name="ChurnModel") print(f"Run ID: {mlflow.active_run().info.run_id}")
CI/CD for ML Models
Training pipelines need automated quality gates before any model reaches production:
- Data validation: Great Expectations or Deepchecks on every new dataset batch.
- Model validation: New model must beat the current champion on a held-out evaluation set.
- Shadow deployment: Run new model in parallel, log predictions, compare distributions before switching traffic.
- Automated rollback: Monitor p95 latency and error rate; auto-rollback if thresholds breach.
Key Monitoring Metrics
Track: prediction distribution drift (PSI), input feature drift (KL divergence), label drift (if ground truth available), and business KPIs. Alert when PSI > 0.2.
Serving Architecture
| Pattern | Use Case | Latency Target |
|---|---|---|
| REST API (FastAPI) | General purpose, <100 req/s | <200ms p99 |
| Triton Inference Server | GPU models, high throughput | <20ms p99 |
| Batch Scoring | Nightly predictions at scale | Hours OK |
| Streaming (Kafka) | Real-time event scoring | <50ms p99 |