Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. While there are many types of regression analysis, the most basic and commonly used is linear regression. This module will cover the fundamental concepts, methods, and applications of regression analysis.

Key Concepts

  1. Dependent and Independent Variables

  • Dependent Variable (Y): The outcome, or the variable you are trying to predict or explain.
  • Independent Variable (X): The predictor, or the variable you are using to predict the dependent variable.

  1. Simple Linear Regression

  • Equation: \( Y = \beta_0 + \beta_1X + \epsilon \)
    • \( Y \): Dependent variable
    • \( X \): Independent variable
    • \( \beta_0 \): Intercept
    • \( \beta_1 \): Slope
    • \( \epsilon \): Error term

  1. Multiple Linear Regression

  • Equation: \( Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \epsilon \)
    • \( X_1, X_2, \ldots, X_n \): Multiple independent variables

Steps in Regression Analysis

  1. Data Collection and Preparation

  • Gather data relevant to the variables of interest.
  • Clean the data to handle missing values, outliers, and ensure consistency.

  1. Exploratory Data Analysis (EDA)

  • Use graphical and numerical methods to understand the data distribution and relationships.
  • Scatter plots, correlation matrices, and summary statistics are useful tools.

  1. Model Fitting

  • Use statistical software to fit the regression model to the data.
  • Estimate the coefficients (\( \beta_0, \beta_1, \ldots, \beta_n \)).

  1. Model Evaluation

  • Assess the model's performance using metrics such as R-squared, Adjusted R-squared, and Root Mean Squared Error (RMSE).
  • Check for assumptions of regression (linearity, independence, homoscedasticity, normality).

  1. Interpretation

  • Interpret the coefficients to understand the relationship between the dependent and independent variables.
  • Make predictions using the regression equation.

Practical Example

Simple Linear Regression Example

Let's consider a dataset where we want to predict a student's final exam score (Y) based on the number of hours studied (X).

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Sample data
hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
exam_scores = np.array([50, 55, 60, 65, 70, 75, 80, 85, 90, 95])

# Create and fit the model
model = LinearRegression()
model.fit(hours_studied, exam_scores)

# Coefficients
intercept = model.intercept_
slope = model.coef_[0]

print(f"Intercept: {intercept}")
print(f"Slope: {slope}")

# Predicting exam scores
predicted_scores = model.predict(hours_studied)

# Plotting the results
plt.scatter(hours_studied, exam_scores, color='blue', label='Actual Scores')
plt.plot(hours_studied, predicted_scores, color='red', label='Fitted Line')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.legend()
plt.show()

Explanation

  • Data Preparation: We have two arrays, hours_studied and exam_scores.
  • Model Fitting: We use LinearRegression from sklearn to fit the model.
  • Coefficients: The intercept and slope are printed.
  • Prediction: We predict the exam scores based on the hours studied.
  • Visualization: A scatter plot of actual scores and a line plot of predicted scores are displayed.

Multiple Linear Regression Example

Consider a dataset where we want to predict a house price (Y) based on its size (X1) and the number of bedrooms (X2).

import pandas as pd
from sklearn.model_selection import train_test_split

# Sample data
data = {
    'Size': [1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400],
    'Bedrooms': [3, 3, 3, 4, 4, 4, 5, 5, 5, 5],
    'Price': [300000, 320000, 340000, 360000, 380000, 400000, 420000, 440000, 460000, 480000]
}
df = pd.DataFrame(data)

# Independent variables
X = df[['Size', 'Bedrooms']]
# Dependent variable
Y = df['Price']

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

# Create and fit the model
model = LinearRegression()
model.fit(X_train, Y_train)

# Coefficients
intercept = model.intercept_
coefficients = model.coef_

print(f"Intercept: {intercept}")
print(f"Coefficients: {coefficients}")

# Predicting house prices
predicted_prices = model.predict(X_test)

# Comparing actual and predicted prices
comparison = pd.DataFrame({'Actual': Y_test, 'Predicted': predicted_prices})
print(comparison)

Explanation

  • Data Preparation: We create a DataFrame with house size, number of bedrooms, and price.
  • Model Fitting: We split the data into training and testing sets and fit the model.
  • Coefficients: The intercept and coefficients for size and bedrooms are printed.
  • Prediction: We predict house prices for the test set and compare them with actual prices.

Exercises

Exercise 1: Simple Linear Regression

Given the following data, fit a simple linear regression model and predict the dependent variable.

Hours Studied Exam Score
2 51
3 53
5 60
7 68
8 72

Task: Fit a linear regression model and predict the exam score for a student who studied for 6 hours.

Solution

import numpy as np
from sklearn.linear_model import LinearRegression

# Data
hours_studied = np.array([2, 3, 5, 7, 8]).reshape(-1, 1)
exam_scores = np.array([51, 53, 60, 68, 72])

# Create and fit the model
model = LinearRegression()
model.fit(hours_studied, exam_scores)

# Predicting exam score for 6 hours of study
predicted_score = model.predict(np.array([[6]]))
print(f"Predicted Exam Score for 6 hours of study: {predicted_score[0]}")

Exercise 2: Multiple Linear Regression

Given the following data, fit a multiple linear regression model and predict the house price.

Size (sq ft) Bedrooms Price
1500 3 300000
1600 3 320000
1700 3 340000
1800 4 360000
1900 4 380000

Task: Fit a multiple linear regression model and predict the price of a house with 2000 sq ft and 4 bedrooms.

Solution

import pandas as pd
from sklearn.linear_model import LinearRegression

# Data
data = {
    'Size': [1500, 1600, 1700, 1800, 1900],
    'Bedrooms': [3, 3, 3, 4, 4],
    'Price': [300000, 320000, 340000, 360000, 380000]
}
df = pd.DataFrame(data)

# Independent variables
X = df[['Size', 'Bedrooms']]
# Dependent variable
Y = df['Price']

# Create and fit the model
model = LinearRegression()
model.fit(X, Y)

# Predicting house price for 2000 sq ft and 4 bedrooms
predicted_price = model.predict(np.array([[2000, 4]]))
print(f"Predicted House Price for 2000 sq ft and 4 bedrooms: {predicted_price[0]}")

Common Mistakes and Tips

  • Overfitting: Ensure your model is not too complex for the amount of data you have.
  • Assumption Violations: Check for linearity, independence, homoscedasticity, and normality.
  • Multicollinearity: In multiple regression, ensure that independent variables are not highly correlated.

Conclusion

Regression analysis is a fundamental tool in statistics for understanding relationships between variables and making predictions. By mastering both simple and multiple linear regression, you can apply these techniques to a wide range of practical problems in various fields.

© Copyright 2024. All rights reserved