Lifecycle // 08 Stages

The Production ML Lifecycle

From a raw partition in object storage to a continuously-retrained model serving millions of inferences per second. Each stage has its own contracts, artifacts, and failure modes — decomposed below with objectives, activities, inputs/outputs, metrics, and the pitfalls teams actually hit.

Stage Map

Linear · Cyclical · Always-on

Phase_01Lane // DATA

Data Ingestion & Versioning

Pull from streaming and batch sources. Snapshot, hash, and version every dataset for full reproducibility.

Objective

Produce an immutable, auditable record of every byte that enters the ML system, so any model can be re-trained from its exact source data months or years later.

Key Activities

→Schedule batch pulls (S3, warehouses) and tail streams (Kafka, Kinesis)
→Enforce schema contracts at the ingestion boundary
→Snapshot raw partitions to immutable object storage by date / event-time
→Hash and version each dataset; record lineage in a catalog
→Quarantine bad rows; emit ingestion telemetry

Common Pitfalls

!Silent schema drift from upstream producers
!Mutating raw data in place (loses reproducibility)
!No PII classification at the boundary

Inputs

· Operational DBs (CDC)
· Event streams
· 3rd-party APIs
· Manual uploads

Outputs

· Bronze tables
· Versioned snapshots
· Lineage edges
· Data quality reports

Metrics / SLOs

· Freshness lag
· Row-count delta
· Schema-violation rate
· Ingest SLA hits

Artifacts

> Raw partitions
> Schema contracts
> DVC manifests

Tools

AirflowdbtDVCKafkaLakeFS

Phase_02Lane // DATA

Feature Engineering

Transform raw signals into model-ready features. Centralize in a feature store to eliminate train/serve skew.

Objective

Compute, store, and serve features through a single source of truth so training and online inference always see the same logic.

Key Activities

→Author feature definitions as code, versioned in Git
→Backfill historical feature values for training
→Materialize features online for low-latency serving
→Compute point-in-time joins to prevent label leakage
→Document ownership, freshness, and SLAs per feature view

Common Pitfalls

!Different code paths for train vs. serve
!Future leakage from non-point-in-time joins
!Unbounded cardinality blowing up the online store

Inputs

· Bronze / silver tables
· Streaming events
· Embeddings
· External enrichments

Outputs

· Offline training tables
· Online key-value features
· Feature documentation

Metrics / SLOs

· Train/serve skew %
· Feature freshness
· Online lookup p99
· Coverage / null rate

Artifacts

> Feature views
> Backfills
> Online materializations

Tools

FeastTectonSparkPandasFlink

Phase_03Lane // MODEL

Training & Experimentation

Distributed training with hyperparameter sweeps. Every run logged with metrics, parameters, and artifacts.

Objective

Run reproducible experiments at scale, with full provenance from a Git SHA + data version to a candidate model.

Key Activities

→Define experiments declaratively (config + code SHA + data hash)
→Launch distributed training on GPUs/TPUs
→Run hyperparameter sweeps (Bayesian / population-based)
→Log metrics, params, system stats, and checkpoints to a tracking server
→Compare runs and tag promotion candidates

Common Pitfalls

!Non-deterministic training (no seeds, non-pinned CUDA)
!Hidden config drift between local and cluster runs
!Tracking only the winners — losing negative results

Inputs

· Feature snapshots
· Labels
· Base models / weights
· Compute quotas

Outputs

· Trained checkpoints
· Run metadata
· Leaderboards

Metrics / SLOs

· Eval loss / accuracy
· Time-to-train
· Cost / run
· GPU utilization

Artifacts

> Run logs
> Checkpoints
> Hyperparameter trials

Tools

MLflowWeights & BiasesRay TrainOptunaPyTorch Lightning

Phase_04Lane // MODEL

Validation & Evaluation

Offline metrics, fairness audits, slice analysis, and shadow scoring against the current production model.

Objective

Decide whether a candidate is materially better than the incumbent — globally and on every business-critical slice — before any user sees it.

Key Activities

→Compute headline metrics on holdout + temporal splits
→Slice metrics by cohort (geo, device, tenant, sensitive attribute)
→Run fairness, calibration, and robustness checks
→Shadow-score against the live production model on a real traffic sample
→Generate a model card and sign off via review gate

Common Pitfalls

!Optimizing aggregate metrics while a key slice regresses
!Eval set leaking into training
!No threshold defined before the experiment runs

Inputs

· Candidate model
· Champion model
· Eval datasets
· Slice definitions

Outputs

· Eval reports
· Pass/fail gate decision
· Model card

Metrics / SLOs

· Lift vs. champion
· Worst-slice delta
· Calibration error
· Robustness score

Artifacts

> Eval reports
> Slice metrics
> Bias scorecards

Tools

Great ExpectationsDeepchecksTFX EvaluatorFairlearn

Phase_05Lane // MODEL

Packaging & Registry

Containerize the model with its inference contract. Promote through staging tiers in a model registry.

Objective

Turn a checkpoint into a portable, signed, version-pinned artifact with a stable inference API and clear promotion lineage.

Key Activities

→Freeze dependencies and build a reproducible container
→Wrap weights with a typed inference handler (predict / explain / health)
→Generate SBOM and scan for CVEs
→Sign the image; push to OCI + model registry with stage tags
→Attach model card, eval report, and approval metadata

Common Pitfalls

!Latest tags instead of immutable digests
!Packaging weights separately from the handler that expects them
!Unsigned images promoted to prod

Inputs

· Approved checkpoint
· Inference handler code
· Base image

Outputs

· Signed OCI image
· Registry entry (staging→prod)
· SBOM

Metrics / SLOs

· Image size
· Cold-start time
· Vulnerability count
· Promotion latency

Artifacts

> OCI images
> Model cards
> Signed manifests

Tools

DockerBentoMLMLflow RegistryCosignSyft

Phase_06Lane // DATA

Deployment & Serving

Roll out via canary, shadow, or blue/green. Auto-scale serving replicas based on traffic and latency SLOs.

Objective

Move the signed artifact to production safely, with traffic-shifting strategies that catch regressions before they hit 100% of users.

Key Activities

→Render Kubernetes / serverless manifests from the registry entry
→Roll out progressively: shadow → canary → 50/50 → full
→Auto-scale replicas on QPS, latency, and GPU memory
→Wire request/response logging for downstream monitoring
→Define and test rollback automation

Common Pitfalls

!No shadow stage — first signal of breakage is user impact
!Autoscaler tuned for CPU on a GPU workload
!Rollback is manual and untested

Inputs

· Signed image
· Traffic policy
· Compute budget

Outputs

· Live endpoint
· Routing rules
· Rollout report

Metrics / SLOs

· p50 / p99 latency
· Error rate
· Cost / 1K predictions
· Rollback MTTR

Artifacts

> Endpoints
> Routing rules
> Autoscaler configs

Tools

KServeSeldon CoreTritonIstioArgo Rollouts

Phase_07Lane // OPS

Monitoring & Drift Detection

Track latency, throughput, prediction distributions, and feature drift. Trigger alerts when SLOs slip.

Objective

Detect operational, statistical, and business regressions fast enough to act before customers or revenue are harmed.

Key Activities

→Emit RED metrics (rate, errors, duration) per model + version
→Compare live feature and prediction distributions vs. training baseline
→Detect concept drift via delayed-label evaluation
→Route alerts by severity to on-call with runbooks
→Track business KPIs alongside model metrics

Common Pitfalls

!Alerting on raw drift without business context (alarm fatigue)
!No baseline captured at training time
!Ground-truth pipeline lags so far that drift is detected too late

Inputs

· Inference logs
· Ground-truth labels (delayed)
· Training baselines

Outputs

· Dashboards
· Drift reports
· Pages / tickets

Metrics / SLOs

· PSI / KL drift
· Live accuracy
· SLO burn rate
· Alert precision

Artifacts

> Time-series metrics
> Drift reports
> Alert routes

Tools

PrometheusGrafanaEvidentlyWhyLabsArize

Phase_08Lane // OPS

Retraining & Feedback

Close the loop. Schedule retraining on fresh data, capture user feedback, and promote winners automatically.

Objective

Keep the model honest as the world changes — automate the path from a drift alert to a re-trained, re-validated, re-deployed model.

Key Activities

→Trigger retraining on schedule, drift, or performance threshold
→Curate feedback datasets from labels, clicks, and human review
→Re-run the full train → eval → package pipeline
→Run champion/challenger experiments online
→Auto-promote winners; archive losers with full provenance

Common Pitfalls

!Feedback loops that reinforce the model's own bias
!Retraining on contaminated logs (model's own predictions as labels)
!No kill-switch when an auto-promoted model misbehaves

Inputs

· Drift signals
· Fresh labels
· Feedback streams

Outputs

· New model versions
· Experiment results
· Promotion decisions

Metrics / SLOs

· Retraining cadence
· Win rate vs. champion
· Time-from-drift-to-deploy
· Label cost

Artifacts

> Retraining triggers
> Feedback datasets
> Champion/challenger logs

Tools

Argo WorkflowsKubeflow PipelinesFlyteMetaflow