Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. While there are many types of regression analysis, the most basic and commonly used is linear regression. This module will cover the fundamental concepts, methods, and applications of regression analysis.
Key Concepts
- Dependent and Independent Variables
- Dependent Variable (Y): The outcome, or the variable you are trying to predict or explain.
- Independent Variable (X): The predictor, or the variable you are using to predict the dependent variable.
- Simple Linear Regression
- Equation: \( Y = \beta_0 + \beta_1X + \epsilon \)
- \( Y \): Dependent variable
- \( X \): Independent variable
- \( \beta_0 \): Intercept
- \( \beta_1 \): Slope
- \( \epsilon \): Error term
- Multiple Linear Regression
- Equation: \( Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \epsilon \)
- \( X_1, X_2, \ldots, X_n \): Multiple independent variables
Steps in Regression Analysis
- Data Collection and Preparation
- Gather data relevant to the variables of interest.
- Clean the data to handle missing values, outliers, and ensure consistency.
- Exploratory Data Analysis (EDA)
- Use graphical and numerical methods to understand the data distribution and relationships.
- Scatter plots, correlation matrices, and summary statistics are useful tools.
- Model Fitting
- Use statistical software to fit the regression model to the data.
- Estimate the coefficients (\( \beta_0, \beta_1, \ldots, \beta_n \)).
- Model Evaluation
- Assess the model's performance using metrics such as R-squared, Adjusted R-squared, and Root Mean Squared Error (RMSE).
- Check for assumptions of regression (linearity, independence, homoscedasticity, normality).
- Interpretation
- Interpret the coefficients to understand the relationship between the dependent and independent variables.
- Make predictions using the regression equation.
Practical Example
Simple Linear Regression Example
Let's consider a dataset where we want to predict a student's final exam score (Y) based on the number of hours studied (X).
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression # Sample data hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) exam_scores = np.array([50, 55, 60, 65, 70, 75, 80, 85, 90, 95]) # Create and fit the model model = LinearRegression() model.fit(hours_studied, exam_scores) # Coefficients intercept = model.intercept_ slope = model.coef_[0] print(f"Intercept: {intercept}") print(f"Slope: {slope}") # Predicting exam scores predicted_scores = model.predict(hours_studied) # Plotting the results plt.scatter(hours_studied, exam_scores, color='blue', label='Actual Scores') plt.plot(hours_studied, predicted_scores, color='red', label='Fitted Line') plt.xlabel('Hours Studied') plt.ylabel('Exam Score') plt.legend() plt.show()
Explanation
- Data Preparation: We have two arrays,
hours_studied
andexam_scores
. - Model Fitting: We use
LinearRegression
fromsklearn
to fit the model. - Coefficients: The intercept and slope are printed.
- Prediction: We predict the exam scores based on the hours studied.
- Visualization: A scatter plot of actual scores and a line plot of predicted scores are displayed.
Multiple Linear Regression Example
Consider a dataset where we want to predict a house price (Y) based on its size (X1) and the number of bedrooms (X2).
import pandas as pd from sklearn.model_selection import train_test_split # Sample data data = { 'Size': [1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400], 'Bedrooms': [3, 3, 3, 4, 4, 4, 5, 5, 5, 5], 'Price': [300000, 320000, 340000, 360000, 380000, 400000, 420000, 440000, 460000, 480000] } df = pd.DataFrame(data) # Independent variables X = df[['Size', 'Bedrooms']] # Dependent variable Y = df['Price'] # Split the data into training and testing sets X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0) # Create and fit the model model = LinearRegression() model.fit(X_train, Y_train) # Coefficients intercept = model.intercept_ coefficients = model.coef_ print(f"Intercept: {intercept}") print(f"Coefficients: {coefficients}") # Predicting house prices predicted_prices = model.predict(X_test) # Comparing actual and predicted prices comparison = pd.DataFrame({'Actual': Y_test, 'Predicted': predicted_prices}) print(comparison)
Explanation
- Data Preparation: We create a DataFrame with house size, number of bedrooms, and price.
- Model Fitting: We split the data into training and testing sets and fit the model.
- Coefficients: The intercept and coefficients for size and bedrooms are printed.
- Prediction: We predict house prices for the test set and compare them with actual prices.
Exercises
Exercise 1: Simple Linear Regression
Given the following data, fit a simple linear regression model and predict the dependent variable.
Hours Studied | Exam Score |
---|---|
2 | 51 |
3 | 53 |
5 | 60 |
7 | 68 |
8 | 72 |
Task: Fit a linear regression model and predict the exam score for a student who studied for 6 hours.
Solution
import numpy as np from sklearn.linear_model import LinearRegression # Data hours_studied = np.array([2, 3, 5, 7, 8]).reshape(-1, 1) exam_scores = np.array([51, 53, 60, 68, 72]) # Create and fit the model model = LinearRegression() model.fit(hours_studied, exam_scores) # Predicting exam score for 6 hours of study predicted_score = model.predict(np.array([[6]])) print(f"Predicted Exam Score for 6 hours of study: {predicted_score[0]}")
Exercise 2: Multiple Linear Regression
Given the following data, fit a multiple linear regression model and predict the house price.
Size (sq ft) | Bedrooms | Price |
---|---|---|
1500 | 3 | 300000 |
1600 | 3 | 320000 |
1700 | 3 | 340000 |
1800 | 4 | 360000 |
1900 | 4 | 380000 |
Task: Fit a multiple linear regression model and predict the price of a house with 2000 sq ft and 4 bedrooms.
Solution
import pandas as pd from sklearn.linear_model import LinearRegression # Data data = { 'Size': [1500, 1600, 1700, 1800, 1900], 'Bedrooms': [3, 3, 3, 4, 4], 'Price': [300000, 320000, 340000, 360000, 380000] } df = pd.DataFrame(data) # Independent variables X = df[['Size', 'Bedrooms']] # Dependent variable Y = df['Price'] # Create and fit the model model = LinearRegression() model.fit(X, Y) # Predicting house price for 2000 sq ft and 4 bedrooms predicted_price = model.predict(np.array([[2000, 4]])) print(f"Predicted House Price for 2000 sq ft and 4 bedrooms: {predicted_price[0]}")
Common Mistakes and Tips
- Overfitting: Ensure your model is not too complex for the amount of data you have.
- Assumption Violations: Check for linearity, independence, homoscedasticity, and normality.
- Multicollinearity: In multiple regression, ensure that independent variables are not highly correlated.
Conclusion
Regression analysis is a fundamental tool in statistics for understanding relationships between variables and making predictions. By mastering both simple and multiple linear regression, you can apply these techniques to a wide range of practical problems in various fields.