Introduction
In this post, we’ll explore the Ecommerce ELT Pipeline project, which provides an efficient way to handle and analyze ecommerce data. The project, hosted on GitHub, showcases a robust ELT pipeline built using Spark, Hive, Airflow and Hadoop..
Key Highlights
- Data Extraction: Ingest raw e-commerce transaction data from CSV files.
- Data Transformation: Clean, aggregate, and enrich data using Apache Spark.
- Data Warehousing: Store processed data in Hive tables for querying.
- Workflow Orchestration: Manage pipeline tasks using Apache Airflow.
- Scalability: Leverage Hadoop for distributed data storage and processing.
- Containerized Deployment: Use Docker to run the entire stack seamlessly.
Project Links
- GitHub Repository: E-Commerce ELT Pipeline
Project Architecture
The pipeline is designed around the core ELT (Extract, Load, Transform) principles:
- Extraction: Ingests raw e-commerce transaction data from CSV files
- Loading: Stages raw data in Hadoop Distributed File System (HDFS)
- Transformation: Applies data cleaning and enrichment techniques using Apache Spark and stores processed data in Hive.
Here’s an overview of the architecture:
Quick Start Guide
Prerequisites
- Docker
- Docker Compose
- Minimum 16GB RAM recommended
- Git
Installation Steps
-
Clone the Repository
git clone https://github.com/AbderrahmaneOd/ecommerce-elt-pipeline.git cd ecommerce-elt-pipeline
-
Launch Infrastructure
docker-compose up -d
-
Access Services
Service URL Credentials HDFS Namenode http://localhost:9870 - YARN ResourceManager http://localhost:8088 - Spark Master http://localhost:8080 - Spark Worker http://localhost:8081 - Zeppelin http://localhost:8082 - Airflow http://localhost:3000 admin@gmail.com / admin -
Configure Spark Connection
- Navigate to Airflow UI
- Go to Admin > Connections
- Edit “spark_default”
- Host: spark://namenode
- Port: 7077
Pipeline Workflow
The Airflow DAG demonstrates a workflow:
- Wait for data file
- Upload data to HDFS
- Load data using Spark
- Transform data using Spark
Data Visualization
The project includes a Streamlit app for intuitive visualization of business metrics, such as total revenue, top-selling products, and customer trends.
The Streamlit dashboard offers comprehensive e-commerce analytics:
-
Key Performance Metrics:
- Total Transactions
- Total Quantity Sold
- Total Revenue
-
Interactive Visualizations:
- Sales by Country
- Top Selling Products
- Sales Trend Over Time
- Sales by Year
-
Filtering Capabilities:
- Country-based filtering
- Date range selection
- Downloadable filtered data