Piscine Data Science
42 Abu Dhabi

Piscine Data Science

End-to-End Data Science Pipeline

2 weeks
Individual Project
PythonPostgreSQLDockerPandasMatplotlibSeaborn

A comprehensive data science project covering database creation, data warehouse development, and data visualization. Implements a complete data pipeline from raw data to actionable insights using PostgreSQL, Python, and modern visualization tools.

Key Features

Database Infrastructure

PostgreSQL database setup with Docker containerization, table creation, and basic SQL query operations.

Data Warehouse

Complete data warehouse implementation with customer and item data management, ETL processes, and data fusion.

Data Cleaning

Comprehensive data cleaning and preprocessing techniques to ensure data quality and consistency.

Data Visualization

Interactive pie charts and visualizations using Matplotlib and Seaborn for insightful data presentation.

Docker Integration

Containerized PostgreSQL environment for consistent development and deployment across systems.

Modular Architecture

Progressive module structure allowing independent execution while maintaining data dependencies.

Development Journey

Module 00

Database Creation

Docker setup for PostgreSQL, basic SQL queries, table operations, and automatic table creation from raw data files.

Module 01

Data Warehouse Development

Customer table creation, comprehensive data cleaning, and data fusion techniques for warehouse population.

Module 02

Data Visualization

Creation of insightful visualizations including pie charts and statistical graphs from warehouse data.

Challenges & Solutions

Database Container Management

Problem:

Setting up a consistent PostgreSQL environment that works across different development machines.

Solution:

Implemented Docker containerization with environment variables for database credentials, ensuring reproducible deployments.

Data Quality Assurance

Problem:

Handling inconsistent and messy raw data from multiple sources with varying formats.

Solution:

Developed comprehensive data cleaning pipelines using Pandas to normalize, validate, and transform data before warehouse insertion.

ETL Pipeline Design

Problem:

Creating an efficient extract, transform, load process that handles large datasets without memory issues.

Solution:

Implemented modular ETL processes with batch processing and proper connection management for scalable data operations.

data_visualization.py
python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import psycopg2

def create_pie_chart(connection):
    """Generate pie chart from customer data"""
    query = """
        SELECT category, COUNT(*) as count
        FROM customers
        GROUP BY category
        ORDER BY count DESC
    """
    
    df = pd.read_sql(query, connection)
    
    plt.figure(figsize=(10, 8))
    colors = sns.color_palette("husl", len(df))
    
    plt.pie(df['count'], 
            labels=df['category'],
            autopct='%1.1f%%',
            colors=colors,
            startangle=90)
    
    plt.title('Customer Distribution by Category')
    plt.tight_layout()
    plt.savefig('customer_distribution.png', dpi=300)
    plt.show()