Data Engineering / Analytics

Healthcare Analytics Platform

ELT pipeline with data quality governance on Snowflake

160K+
Records Processed
97.7%
Test Pass Rate
58+
Automated Tests
14
dbt Models
8
Star Schema Tables
5
GX Validation Suites

The Problem

Healthcare cost transparency requires trustworthy data. CMS Medicare data has quality issues - duplicates, orphaned records, business rule violations. I built a platform that processes 160K+ records with a data quality framework that tracks every issue without losing a single row.

Architecture

CSV Files ──▶ AWS S3 (Bronze) ──▶ Snowflake External Stage ──▶ Raw Schema
                                                                 │
                                  dbt Staging (views, cleaning) ◄─┘
                                              │
                           dbt Intermediate (views, business logic)
                                              │
                      dbt Marts (tables, star schema) ──▶ Tableau Dashboards
                                              │
                  dbt Tests (58+) + Great Expectations (5 suites)
                                              │
                       GitHub Actions CI/CD (dev → prod)
1
Source: 3 CMS CSV files (160K+ records)
2
Data Lake: AWS S3 (Bronze / Silver / Gold)
3
Warehouse: Snowflake (External Stage → Raw)
4
Transform: dbt (Staging → Intermediate → Marts)
5
Quality: dbt Tests (58+) + Great Expectations (5 suites)
6
BI: Tableau (Custom SQL dashboards)
7
CI/CD: GitHub Actions (dev / prod)

Key Technical Decisions

ELT over ETL

Transform in Snowflake leverages its compute elasticity. Raw data lands first, transformations are version-controlled in dbt.

Data Quality Flags over Rejection

Preserve every row. Flag issues (covered charges < payments, outliers) without altering original values. Analysts can filter by flags.

Views for Silver, Tables for Gold

Staging/intermediate as views: always fresh, zero storage cost. Marts as tables: pre-aggregated for BI performance.

SCD Type 2 for Hospitals

Hospital attributes (ownership, ratings) change over time. dbt snapshots preserve full history for longitudinal analysis.

Results & Impact

  • Built medallion data lake (Bronze/Silver/Gold) on AWS S3 with Snowflake external stages for automated ingestion
  • Designed star schema with 4 dimensions and 4 fact tables, processing 166K+ rows in the final dimensional model
  • Achieved 97.7% test pass rate across 58+ dbt tests and 88.9% Great Expectations validation across 5 suites
  • Implemented SCD Type 2 via dbt snapshots for hospital historical tracking - enabling longitudinal trend analysis
  • Created data quality flagging system tracking 176+ business rule violations without any data loss
  • Automated CI/CD pipeline with GitHub Actions (dev/prod branch strategy), reducing manual testing effort by 80%+
  • Built Tableau dashboards with interactive KPIs, geographic visualizations, and custom SQL data sources for stakeholder self-service

Technologies Used

Cloud Storage

AWS S3 (data lake)

Warehouse

Snowflake

ELT

dbt Coredbt_utils

Data Quality

dbt Tests (58+)Great Expectations

BI

TableauPower BI

CI/CD

GitHub Actions

Languages

SQLPythonYAML