Sys.Op. Active

Aegis // MLOps

Lifecycle // 08 Stages

The Production ML Lifecycle

From a raw partition in object storage to a continuously-retrained model serving millions of inferences per second. Each stage has its own contracts, artifacts, and failure modes — decomposed below with objectives, activities, inputs/outputs, metrics, and the pitfalls teams actually hit.

Stage Map

Linear · Cyclical · Always-on
Phase_01Lane // DATA

Data Ingestion & Versioning

Pull from streaming and batch sources. Snapshot, hash, and version every dataset for full reproducibility.

Objective

Produce an immutable, auditable record of every byte that enters the ML system, so any model can be re-trained from its exact source data months or years later.

Key Activities
  • Schedule batch pulls (S3, warehouses) and tail streams (Kafka, Kinesis)
  • Enforce schema contracts at the ingestion boundary
  • Snapshot raw partitions to immutable object storage by date / event-time
  • Hash and version each dataset; record lineage in a catalog
  • Quarantine bad rows; emit ingestion telemetry
Common Pitfalls
  • !Silent schema drift from upstream producers
  • !Mutating raw data in place (loses reproducibility)
  • !No PII classification at the boundary
Inputs
  • · Operational DBs (CDC)
  • · Event streams
  • · 3rd-party APIs
  • · Manual uploads
Outputs
  • · Bronze tables
  • · Versioned snapshots
  • · Lineage edges
  • · Data quality reports
Metrics / SLOs
  • · Freshness lag
  • · Row-count delta
  • · Schema-violation rate
  • · Ingest SLA hits
Artifacts
  • > Raw partitions
  • > Schema contracts
  • > DVC manifests
Tools
AirflowdbtDVCKafkaLakeFS
Phase_02Lane // DATA

Feature Engineering

Transform raw signals into model-ready features. Centralize in a feature store to eliminate train/serve skew.

Objective

Compute, store, and serve features through a single source of truth so training and online inference always see the same logic.

Key Activities
  • Author feature definitions as code, versioned in Git
  • Backfill historical feature values for training
  • Materialize features online for low-latency serving
  • Compute point-in-time joins to prevent label leakage
  • Document ownership, freshness, and SLAs per feature view
Common Pitfalls
  • !Different code paths for train vs. serve
  • !Future leakage from non-point-in-time joins
  • !Unbounded cardinality blowing up the online store
Inputs
  • · Bronze / silver tables
  • · Streaming events
  • · Embeddings
  • · External enrichments
Outputs
  • · Offline training tables
  • · Online key-value features
  • · Feature documentation
Metrics / SLOs
  • · Train/serve skew %
  • · Feature freshness
  • · Online lookup p99
  • · Coverage / null rate
Artifacts
  • > Feature views
  • > Backfills
  • > Online materializations
Tools
FeastTectonSparkPandasFlink
Phase_03Lane // MODEL

Training & Experimentation

Distributed training with hyperparameter sweeps. Every run logged with metrics, parameters, and artifacts.

Objective

Run reproducible experiments at scale, with full provenance from a Git SHA + data version to a candidate model.

Key Activities
  • Define experiments declaratively (config + code SHA + data hash)
  • Launch distributed training on GPUs/TPUs
  • Run hyperparameter sweeps (Bayesian / population-based)
  • Log metrics, params, system stats, and checkpoints to a tracking server
  • Compare runs and tag promotion candidates
Common Pitfalls
  • !Non-deterministic training (no seeds, non-pinned CUDA)
  • !Hidden config drift between local and cluster runs
  • !Tracking only the winners — losing negative results
Inputs
  • · Feature snapshots
  • · Labels
  • · Base models / weights
  • · Compute quotas
Outputs
  • · Trained checkpoints
  • · Run metadata
  • · Leaderboards
Metrics / SLOs
  • · Eval loss / accuracy
  • · Time-to-train
  • · Cost / run
  • · GPU utilization
Artifacts
  • > Run logs
  • > Checkpoints
  • > Hyperparameter trials
Tools
MLflowWeights & BiasesRay TrainOptunaPyTorch Lightning
Phase_04Lane // MODEL

Validation & Evaluation

Offline metrics, fairness audits, slice analysis, and shadow scoring against the current production model.

Objective

Decide whether a candidate is materially better than the incumbent — globally and on every business-critical slice — before any user sees it.

Key Activities
  • Compute headline metrics on holdout + temporal splits
  • Slice metrics by cohort (geo, device, tenant, sensitive attribute)
  • Run fairness, calibration, and robustness checks
  • Shadow-score against the live production model on a real traffic sample
  • Generate a model card and sign off via review gate
Common Pitfalls
  • !Optimizing aggregate metrics while a key slice regresses
  • !Eval set leaking into training
  • !No threshold defined before the experiment runs
Inputs
  • · Candidate model
  • · Champion model
  • · Eval datasets
  • · Slice definitions
Outputs
  • · Eval reports
  • · Pass/fail gate decision
  • · Model card
Metrics / SLOs
  • · Lift vs. champion
  • · Worst-slice delta
  • · Calibration error
  • · Robustness score
Artifacts
  • > Eval reports
  • > Slice metrics
  • > Bias scorecards
Tools
Great ExpectationsDeepchecksTFX EvaluatorFairlearn
Phase_05Lane // MODEL

Packaging & Registry

Containerize the model with its inference contract. Promote through staging tiers in a model registry.

Objective

Turn a checkpoint into a portable, signed, version-pinned artifact with a stable inference API and clear promotion lineage.

Key Activities
  • Freeze dependencies and build a reproducible container
  • Wrap weights with a typed inference handler (predict / explain / health)
  • Generate SBOM and scan for CVEs
  • Sign the image; push to OCI + model registry with stage tags
  • Attach model card, eval report, and approval metadata
Common Pitfalls
  • !Latest tags instead of immutable digests
  • !Packaging weights separately from the handler that expects them
  • !Unsigned images promoted to prod
Inputs
  • · Approved checkpoint
  • · Inference handler code
  • · Base image
Outputs
  • · Signed OCI image
  • · Registry entry (staging→prod)
  • · SBOM
Metrics / SLOs
  • · Image size
  • · Cold-start time
  • · Vulnerability count
  • · Promotion latency
Artifacts
  • > OCI images
  • > Model cards
  • > Signed manifests
Tools
DockerBentoMLMLflow RegistryCosignSyft
Phase_06Lane // DATA

Deployment & Serving

Roll out via canary, shadow, or blue/green. Auto-scale serving replicas based on traffic and latency SLOs.

Objective

Move the signed artifact to production safely, with traffic-shifting strategies that catch regressions before they hit 100% of users.

Key Activities
  • Render Kubernetes / serverless manifests from the registry entry
  • Roll out progressively: shadow → canary → 50/50 → full
  • Auto-scale replicas on QPS, latency, and GPU memory
  • Wire request/response logging for downstream monitoring
  • Define and test rollback automation
Common Pitfalls
  • !No shadow stage — first signal of breakage is user impact
  • !Autoscaler tuned for CPU on a GPU workload
  • !Rollback is manual and untested
Inputs
  • · Signed image
  • · Traffic policy
  • · Compute budget
Outputs
  • · Live endpoint
  • · Routing rules
  • · Rollout report
Metrics / SLOs
  • · p50 / p99 latency
  • · Error rate
  • · Cost / 1K predictions
  • · Rollback MTTR
Artifacts
  • > Endpoints
  • > Routing rules
  • > Autoscaler configs
Tools
KServeSeldon CoreTritonIstioArgo Rollouts
Phase_07Lane // OPS

Monitoring & Drift Detection

Track latency, throughput, prediction distributions, and feature drift. Trigger alerts when SLOs slip.

Objective

Detect operational, statistical, and business regressions fast enough to act before customers or revenue are harmed.

Key Activities
  • Emit RED metrics (rate, errors, duration) per model + version
  • Compare live feature and prediction distributions vs. training baseline
  • Detect concept drift via delayed-label evaluation
  • Route alerts by severity to on-call with runbooks
  • Track business KPIs alongside model metrics
Common Pitfalls
  • !Alerting on raw drift without business context (alarm fatigue)
  • !No baseline captured at training time
  • !Ground-truth pipeline lags so far that drift is detected too late
Inputs
  • · Inference logs
  • · Ground-truth labels (delayed)
  • · Training baselines
Outputs
  • · Dashboards
  • · Drift reports
  • · Pages / tickets
Metrics / SLOs
  • · PSI / KL drift
  • · Live accuracy
  • · SLO burn rate
  • · Alert precision
Artifacts
  • > Time-series metrics
  • > Drift reports
  • > Alert routes
Tools
PrometheusGrafanaEvidentlyWhyLabsArize
Phase_08Lane // OPS

Retraining & Feedback

Close the loop. Schedule retraining on fresh data, capture user feedback, and promote winners automatically.

Objective

Keep the model honest as the world changes — automate the path from a drift alert to a re-trained, re-validated, re-deployed model.

Key Activities
  • Trigger retraining on schedule, drift, or performance threshold
  • Curate feedback datasets from labels, clicks, and human review
  • Re-run the full train → eval → package pipeline
  • Run champion/challenger experiments online
  • Auto-promote winners; archive losers with full provenance
Common Pitfalls
  • !Feedback loops that reinforce the model's own bias
  • !Retraining on contaminated logs (model's own predictions as labels)
  • !No kill-switch when an auto-promoted model misbehaves
Inputs
  • · Drift signals
  • · Fresh labels
  • · Feedback streams
Outputs
  • · New model versions
  • · Experiment results
  • · Promotion decisions
Metrics / SLOs
  • · Retraining cadence
  • · Win rate vs. champion
  • · Time-from-drift-to-deploy
  • · Label cost
Artifacts
  • > Retraining triggers
  • > Feedback datasets
  • > Champion/challenger logs
Tools
Argo WorkflowsKubeflow PipelinesFlyteMetaflow