Healthcare Analytics Platform
ELT pipeline with data quality governance on Snowflake
The Problem
Healthcare cost transparency requires trustworthy data. CMS Medicare data has quality issues - duplicates, orphaned records, business rule violations. I built a platform that processes 160K+ records with a data quality framework that tracks every issue without losing a single row.
Architecture
CSV Files ──▶ AWS S3 (Bronze) ──▶ Snowflake External Stage ──▶ Raw Schema
│
dbt Staging (views, cleaning) ◄─┘
│
dbt Intermediate (views, business logic)
│
dbt Marts (tables, star schema) ──▶ Tableau Dashboards
│
dbt Tests (58+) + Great Expectations (5 suites)
│
GitHub Actions CI/CD (dev → prod)Key Technical Decisions
ELT over ETL
Transform in Snowflake leverages its compute elasticity. Raw data lands first, transformations are version-controlled in dbt.
Data Quality Flags over Rejection
Preserve every row. Flag issues (covered charges < payments, outliers) without altering original values. Analysts can filter by flags.
Views for Silver, Tables for Gold
Staging/intermediate as views: always fresh, zero storage cost. Marts as tables: pre-aggregated for BI performance.
SCD Type 2 for Hospitals
Hospital attributes (ownership, ratings) change over time. dbt snapshots preserve full history for longitudinal analysis.
Results & Impact
- ▸Built medallion data lake (Bronze/Silver/Gold) on AWS S3 with Snowflake external stages for automated ingestion
- ▸Designed star schema with 4 dimensions and 4 fact tables, processing 166K+ rows in the final dimensional model
- ▸Achieved 97.7% test pass rate across 58+ dbt tests and 88.9% Great Expectations validation across 5 suites
- ▸Implemented SCD Type 2 via dbt snapshots for hospital historical tracking - enabling longitudinal trend analysis
- ▸Created data quality flagging system tracking 176+ business rule violations without any data loss
- ▸Automated CI/CD pipeline with GitHub Actions (dev/prod branch strategy), reducing manual testing effort by 80%+
- ▸Built Tableau dashboards with interactive KPIs, geographic visualizations, and custom SQL data sources for stakeholder self-service