
Piscine Data Science
End-to-End Data Science Pipeline
A comprehensive data science project covering database creation, data warehouse development, and data visualization. Implements a complete data pipeline from raw data to actionable insights using PostgreSQL, Python, and modern visualization tools.
Key Features
Database Infrastructure
PostgreSQL database setup with Docker containerization, table creation, and basic SQL query operations.
Data Warehouse
Complete data warehouse implementation with customer and item data management, ETL processes, and data fusion.
Data Cleaning
Comprehensive data cleaning and preprocessing techniques to ensure data quality and consistency.
Data Visualization
Interactive pie charts and visualizations using Matplotlib and Seaborn for insightful data presentation.
Docker Integration
Containerized PostgreSQL environment for consistent development and deployment across systems.
Modular Architecture
Progressive module structure allowing independent execution while maintaining data dependencies.
Development Journey
Database Creation
Docker setup for PostgreSQL, basic SQL queries, table operations, and automatic table creation from raw data files.
Data Warehouse Development
Customer table creation, comprehensive data cleaning, and data fusion techniques for warehouse population.
Data Visualization
Creation of insightful visualizations including pie charts and statistical graphs from warehouse data.
Challenges & Solutions
Database Container Management
Setting up a consistent PostgreSQL environment that works across different development machines.
Implemented Docker containerization with environment variables for database credentials, ensuring reproducible deployments.
Data Quality Assurance
Handling inconsistent and messy raw data from multiple sources with varying formats.
Developed comprehensive data cleaning pipelines using Pandas to normalize, validate, and transform data before warehouse insertion.
ETL Pipeline Design
Creating an efficient extract, transform, load process that handles large datasets without memory issues.
Implemented modular ETL processes with batch processing and proper connection management for scalable data operations.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import psycopg2
def create_pie_chart(connection):
"""Generate pie chart from customer data"""
query = """
SELECT category, COUNT(*) as count
FROM customers
GROUP BY category
ORDER BY count DESC
"""
df = pd.read_sql(query, connection)
plt.figure(figsize=(10, 8))
colors = sns.color_palette("husl", len(df))
plt.pie(df['count'],
labels=df['category'],
autopct='%1.1f%%',
colors=colors,
startangle=90)
plt.title('Customer Distribution by Category')
plt.tight_layout()
plt.savefig('customer_distribution.png', dpi=300)
plt.show()