Data Ingestion & Versioning
Pull from streaming and batch sources. Snapshot, hash, and version every dataset for full reproducibility.
Produce an immutable, auditable record of every byte that enters the ML system, so any model can be re-trained from its exact source data months or years later.
- →Schedule batch pulls (S3, warehouses) and tail streams (Kafka, Kinesis)
- →Enforce schema contracts at the ingestion boundary
- →Snapshot raw partitions to immutable object storage by date / event-time
- →Hash and version each dataset; record lineage in a catalog
- →Quarantine bad rows; emit ingestion telemetry
- !Silent schema drift from upstream producers
- !Mutating raw data in place (loses reproducibility)
- !No PII classification at the boundary
- · Operational DBs (CDC)
- · Event streams
- · 3rd-party APIs
- · Manual uploads
- · Bronze tables
- · Versioned snapshots
- · Lineage edges
- · Data quality reports
- · Freshness lag
- · Row-count delta
- · Schema-violation rate
- · Ingest SLA hits
- > Raw partitions
- > Schema contracts
- > DVC manifests