Data Engineering / BI

Enterprise Customer Analytics Platform

Customer lifecycle analytics with operational dashboards

9
Schema Tables
5K+
Transactions Processed
4,268
Customer Records
3
BI Dashboards
100%
Dashboard Uptime
4
Transform Modules

The Problem

Customer data is messy - duplicates, multiple currencies, inconsistent timezones. I built a platform that deduplicates customers using fuzzy matching, normalizes financial data across currencies, and surfaces CLV, retention, and cohort insights through self-service dashboards.

Architecture

Raw Data ──▶ Pydantic Validation ──▶ Transformation Modules
                                     ┌──────────┼────────────┐
                                     │          │            │
                               Dedup (Fuzz)  Currency   Timezone
                                     │          │            │
                                     └──────────┴────────────┘
                                              │
                      PostgreSQL Star Schema (6 dims, 3 facts)
                                              │
                  Airflow Orchestration + GX Validation
                                              │
                  Metabase Dashboards (3) + Looker Studio
1
Source: Raw customer, transaction, session data
2
Ingestion: Python ETL + Pydantic validation
3
Transform: 4 modules (dedup, currency, timezone, quality)
4
Warehouse: PostgreSQL (6 dims + 3 facts)
5
Orchestration: Apache Airflow
6
BI: Metabase + Looker Studio
7
Infrastructure: Docker Compose (100% uptime)

Key Technical Decisions

PostgreSQL over Cloud DW

Right-sized for the data volume. Docker-deployable for portability. Full SQL power without cloud costs for this scale.

RapidFuzz for Deduplication

Fuzzy matching handles real-world customer name variations that exact matching misses. Configurable thresholds balance precision and recall.

Metabase over Tableau

Open-source, Docker-deployable, low barrier for stakeholder self-service. Embeddable for product analytics use cases.

Pydantic + Great Expectations

Pydantic for row-level schema validation at ingestion. Great Expectations for dataset-level quality assertions. Complementary layers.

Results & Impact

  • Designed star schema with 6 dimension tables and 3 fact tables supporting customer behavior and revenue analysis
  • Implemented SCD Type 2 for the customer dimension, enabling historical journey analysis and cohort tracking across 4,268 records
  • Built 4 transformation modules: customer deduplication (RapidFuzz fuzzy matching), multi-currency conversion, timezone normalization, and data quality validation
  • Created aggregated data marts producing daily revenue summaries, customer lifetime value (CLV), cohort retention curves, and product performance KPIs
  • Deployed 3 Metabase dashboards with 100% availability using Docker Compose, with structured logging for pipeline health monitoring
  • Developed data quality framework with Great Expectations + Pydantic producing schema documentation and test coverage reports

Technologies Used

Orchestration

Apache Airflow

Database

PostgreSQL

Data Quality

Great ExpectationsPydantic

BI

MetabaseLooker Studio

Infrastructure

Docker Compose

Languages

PythonSQL

Libraries

RapidFuzzpandas