Ecommerce ELT Pipeline

Abderrahmane Ouaday | Dec 9, 2024 min read

Introduction

In this post, we’ll explore the Ecommerce ELT Pipeline project, which provides an efficient way to handle and analyze ecommerce data. The project, hosted on GitHub, showcases a robust ELT pipeline built using Spark, Hive, Airflow and Hadoop..

Key Highlights

  • Data Extraction: Ingest raw e-commerce transaction data from CSV files.
  • Data Transformation: Clean, aggregate, and enrich data using Apache Spark.
  • Data Warehousing: Store processed data in Hive tables for querying.
  • Workflow Orchestration: Manage pipeline tasks using Apache Airflow.
  • Scalability: Leverage Hadoop for distributed data storage and processing.
  • Containerized Deployment: Use Docker to run the entire stack seamlessly.

Project Architecture

The pipeline is designed around the core ELT (Extract, Load, Transform) principles:

  1. Extraction: Ingests raw e-commerce transaction data from CSV files
  2. Loading: Stages raw data in Hadoop Distributed File System (HDFS)
  3. Transformation: Applies data cleaning and enrichment techniques using Apache Spark and stores processed data in Hive.

Here’s an overview of the architecture: Airflow Pipeline Graph


Quick Start Guide

Prerequisites

  • Docker
  • Docker Compose
  • Minimum 16GB RAM recommended
  • Git

Installation Steps

  1. Clone the Repository

    git clone https://github.com/AbderrahmaneOd/ecommerce-elt-pipeline.git
    cd ecommerce-elt-pipeline
    
  2. Launch Infrastructure

    docker-compose up -d
    
  3. Access Services

    Service URL Credentials
    HDFS Namenode http://localhost:9870 -
    YARN ResourceManager http://localhost:8088 -
    Spark Master http://localhost:8080 -
    Spark Worker http://localhost:8081 -
    Zeppelin http://localhost:8082 -
    Airflow http://localhost:3000 admin@gmail.com / admin
  4. Configure Spark Connection

    • Navigate to Airflow UI
    • Go to Admin > Connections
    • Edit “spark_default”
      • Host: spark://namenode
      • Port: 7077

Pipeline Workflow

The Airflow DAG demonstrates a workflow:

  1. Wait for data file
  2. Upload data to HDFS
  3. Load data using Spark
  4. Transform data using Spark

Airflow DAG (Directed Acyclic Graph)

Airflow Pipeline Graph


Data Visualization

The project includes a Streamlit app for intuitive visualization of business metrics, such as total revenue, top-selling products, and customer trends.

The Streamlit dashboard offers comprehensive e-commerce analytics:

  • Key Performance Metrics:

    • Total Transactions
    • Total Quantity Sold
    • Total Revenue
  • Interactive Visualizations:

    • Sales by Country
    • Top Selling Products
    • Sales Trend Over Time
    • Sales by Year
  • Filtering Capabilities:

    • Country-based filtering
    • Date range selection
    • Downloadable filtered data

Airflow Pipeline Graph