Case Study — Product Data Scientist Portfolio
Data products fail when infrastructure does not scale with growth. Startups, growth-stage, and enterprise companies need different architectures.
Built a simulator that dynamically reconfigures data architecture (storage, processing, modeling, deployment, monitoring) based on organizational scale. Outputs latency, monthly cost, reliability, and complexity.
Three presets (Startup, Growth, Enterprise) with layered architecture diagram and key metrics. Demonstrates evolution from Postgres + cron to Kafka + K8s.
Choose architecture tier based on traffic volume, experiment velocity, and reliability requirements.
| Time Horizon | Milestone | Success Criteria |
|---|---|---|
| 0–30 days | Current state documented | Inventory of pipelines, tables, and SLAs; pain points identified |
| 30–60 days | Target architecture designed | Tool choices justified; cost and latency projections |
| 60–90 days | Migration roadmap | Phased rollout plan; success metrics per phase |
Success: Pipeline latency within SLA; data freshness meets product needs; cost within budget; zero data loss incidents.
Tripwire (anti-success): Repeated pipeline failures; model staleness; runaway cloud costs.
| Event Name | Required Properties | Notes |
|---|---|---|
| pipeline_run_start | pipeline_id, run_id, timestamp | Orchestration trigger |
| pipeline_run_end | pipeline_id, run_id, status, duration_sec | Success/failure |
| data_freshness | table_name, last_updated, expected_freshness_hours | SLA monitoring |
Staging: Raw tables from sources (APIs, DBs, events). stg_* with dedup and type casting.
Marts: fct_* and dim_* for business logic. Incremental models where appropriate.
Tests: dbt uniqueness, not_null, relationships, accepted_values. Freshness checks on critical sources.
Documentation: dbt docs; lineage; column descriptions; README per model.