In this case study, we will walk through a comprehensive analysis of sales data. The goal is to apply the techniques and methods learned in the previous modules to a real-world dataset, enabling you to gain practical experience in data analysis.

Objectives

  • Understand the sales dataset and its structure.
  • Perform data cleaning and preparation.
  • Conduct exploratory data analysis (EDA).
  • Build and evaluate predictive models.
  • Communicate findings effectively.

Dataset Overview

The dataset contains sales data for a retail company. The key columns include:

  • Date: The date of the sale.
  • Store: The store where the sale occurred.
  • Product: The product sold.
  • Sales: The amount of sales in dollars.
  • Quantity: The number of units sold.
  • Price: The price per unit.

Step 1: Data Collection and Preparation

1.1 Loading the Dataset

First, we need to load the dataset into our analysis environment. We will use Python and the pandas library for this task.

import pandas as pd

# Load the dataset
data = pd.read_csv('sales_data.csv')

# Display the first few rows of the dataset
print(data.head())

1.2 Data Cleaning

Identify and handle missing data, duplicates, and incorrect data types.

# Check for missing values
print(data.isnull().sum())

# Drop rows with missing values
data = data.dropna()

# Check for duplicates
print(data.duplicated().sum())

# Drop duplicates
data = data.drop_duplicates()

# Convert 'Date' column to datetime type
data['Date'] = pd.to_datetime(data['Date'])

# Display the cleaned dataset
print(data.info())

1.3 Data Transformation

Normalize the Sales column to ensure consistency.

# Normalize the Sales column
data['Sales'] = (data['Sales'] - data['Sales'].mean()) / data['Sales'].std()

# Display the transformed dataset
print(data.head())

Step 2: Exploratory Data Analysis (EDA)

2.1 Descriptive Statistics

Calculate basic statistics to understand the dataset better.

# Descriptive statistics
print(data.describe())

2.2 Data Visualization

Visualize the sales trends over time and across different stores and products.

import matplotlib.pyplot as plt

# Sales trend over time
plt.figure(figsize=(10, 6))
plt.plot(data['Date'], data['Sales'])
plt.title('Sales Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()

# Sales by store
plt.figure(figsize=(10, 6))
data.groupby('Store')['Sales'].sum().plot(kind='bar')
plt.title('Total Sales by Store')
plt.xlabel('Store')
plt.ylabel('Sales')
plt.show()

# Sales by product
plt.figure(figsize=(10, 6))
data.groupby('Product')['Sales'].sum().plot(kind='bar')
plt.title('Total Sales by Product')
plt.xlabel('Product')
plt.ylabel('Sales')
plt.show()

Step 3: Data Modeling

3.1 Building a Predictive Model

We will build a linear regression model to predict sales based on the available features.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Prepare the data for modeling
X = data[['Quantity', 'Price']]
y = data['Sales']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Step 4: Communication of Results

4.1 Summary of Findings

Summarize the key findings from the analysis.

  • The sales trend shows a seasonal pattern with peaks during certain periods.
  • Store 3 has the highest total sales, while Store 1 has the lowest.
  • Product B is the best-selling product across all stores.

4.2 Visualization of Results

Create visualizations to support the findings.

# Predicted vs Actual Sales
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred)
plt.title('Predicted vs Actual Sales')
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.show()

4.3 Reporting

Prepare a report to communicate the results to stakeholders.

  • Introduction: Brief overview of the analysis objectives and dataset.
  • Data Preparation: Steps taken to clean and prepare the data.
  • Exploratory Analysis: Key insights from the EDA.
  • Modeling: Description of the predictive model and its performance.
  • Conclusion: Summary of findings and recommendations.

Conclusion

In this case study, we applied various data analysis techniques to a sales dataset. We performed data cleaning, exploratory data analysis, and built a predictive model. Finally, we communicated our findings through visualizations and a detailed report. This hands-on experience reinforces the concepts learned in the course and prepares you for real-world data analysis tasks.

© Copyright 2024. All rights reserved