Data Engineering / BI

Enterprise Customer Analytics Platform

Customer lifecycle analytics with operational dashboards

Schema Tables

5K+

Transactions Processed

4,268

Customer Records

BI Dashboards

100%

Dashboard Uptime

Transform Modules

Challenge

The Problem

Customer data is messy - duplicates, multiple currencies, inconsistent timezones. I built a platform that deduplicates customers using fuzzy matching, normalizes financial data across currencies, and surfaces CLV, retention, and cohort insights through self-service dashboards.

System Design

Architecture

Raw Data ──▶ Pydantic Validation ──▶ Transformation Modules
                                     ┌──────────┼────────────┐
                                     │          │            │
                               Dedup (Fuzz)  Currency   Timezone
                                     │          │            │
                                     └──────────┴────────────┘
                                              │
                      PostgreSQL Star Schema (6 dims, 3 facts)
                                              │
                  Airflow Orchestration + GX Validation
                                              │
                  Metabase Dashboards (3) + Looker Studio

Source: Raw customer, transaction, session data

Ingestion: Python ETL + Pydantic validation

Transform: 4 modules (dedup, currency, timezone, quality)

Warehouse: PostgreSQL (6 dims + 3 facts)

Orchestration: Apache Airflow

BI: Metabase + Looker Studio

Infrastructure: Docker Compose (100% uptime)

Reasoning

Key Technical Decisions

PostgreSQL over Cloud DW

Right-sized for the data volume. Docker-deployable for portability. Full SQL power without cloud costs for this scale.

RapidFuzz for Deduplication

Fuzzy matching handles real-world customer name variations that exact matching misses. Configurable thresholds balance precision and recall.

Metabase over Tableau

Open-source, Docker-deployable, low barrier for stakeholder self-service. Embeddable for product analytics use cases.

Pydantic + Great Expectations

Pydantic for row-level schema validation at ingestion. Great Expectations for dataset-level quality assertions. Complementary layers.

Outcomes

Results & Impact

▸Designed star schema with 6 dimension tables and 3 fact tables supporting customer behavior and revenue analysis
▸Implemented SCD Type 2 for the customer dimension, enabling historical journey analysis and cohort tracking across 4,268 records
▸Built 4 transformation modules: customer deduplication (RapidFuzz fuzzy matching), multi-currency conversion, timezone normalization, and data quality validation
▸Created aggregated data marts producing daily revenue summaries, customer lifetime value (CLV), cohort retention curves, and product performance KPIs
▸Deployed 3 Metabase dashboards with 100% availability using Docker Compose, with structured logging for pipeline health monitoring
▸Developed data quality framework with Great Expectations + Pydantic producing schema documentation and test coverage reports

Stack

Technologies Used

Orchestration

Apache Airflow

Database

PostgreSQL

Data Quality

Great ExpectationsPydantic

BI

MetabaseLooker Studio

Infrastructure

Docker Compose

Languages

PythonSQL

Libraries

RapidFuzzpandas

View on GitHub →