Introduction
In this section, we will delve into two fundamental types of regression analysis used in data modeling: Linear Regression and Logistic Regression. These techniques are essential for predicting outcomes and understanding relationships between variables.
Linear Regression
Key Concepts
- Definition: Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
- Equation: The linear regression equation is typically written as:
\[
Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \epsilon
\]
where:
- \(Y\) is the dependent variable.
- \(\beta_0\) is the intercept.
- \(\beta_1, \beta_2, \ldots, \beta_n\) are the coefficients.
- \(X_1, X_2, \ldots, X_n\) are the independent variables.
- \(\epsilon\) is the error term.
Example
Let's consider a simple example where we predict the price of a house based on its size.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression # Sample data data = {'Size': [1500, 1600, 1700, 1800, 1900], 'Price': [300000, 320000, 340000, 360000, 380000]} df = pd.DataFrame(data) # Independent variable (Size) X = df[['Size']] # Dependent variable (Price) Y = df['Price'] # Create linear regression model model = LinearRegression() model.fit(X, Y) # Predict prices predicted_prices = model.predict(X) # Plot the results plt.scatter(X, Y, color='blue') plt.plot(X, predicted_prices, color='red') plt.xlabel('Size (sq ft)') plt.ylabel('Price ($)') plt.title('Linear Regression: House Price Prediction') plt.show()
Explanation
- Data Preparation: We create a DataFrame with house sizes and prices.
- Model Creation: We use
LinearRegression
fromsklearn
to create and fit the model. - Prediction: We predict house prices based on the model.
- Visualization: We plot the actual data points and the regression line.
Practical Exercise
Exercise: Use linear regression to predict the weight of a person based on their height. Use the following data:
Height (inches) | Weight (pounds) |
---|---|
60 | 115 |
62 | 120 |
64 | 130 |
66 | 140 |
68 | 150 |
Solution:
# Sample data data = {'Height': [60, 62, 64, 66, 68], 'Weight': [115, 120, 130, 140, 150]} df = pd.DataFrame(data) # Independent variable (Height) X = df[['Height']] # Dependent variable (Weight) Y = df['Weight'] # Create linear regression model model = LinearRegression() model.fit(X, Y) # Predict weights predicted_weights = model.predict(X) # Plot the results plt.scatter(X, Y, color='blue') plt.plot(X, predicted_weights, color='red') plt.xlabel('Height (inches)') plt.ylabel('Weight (pounds)') plt.title('Linear Regression: Weight Prediction') plt.show()
Logistic Regression
Key Concepts
- Definition: Logistic regression is used for binary classification problems. It models the probability that a given input belongs to a particular category.
- Equation: The logistic regression equation is:
\[
P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n)}}
\]
where:
- \(P(Y=1)\) is the probability that the dependent variable \(Y\) equals 1.
- \(\beta_0, \beta_1, \ldots, \beta_n\) are the coefficients.
- \(X_1, X_2, \ldots, X_n\) are the independent variables.
Example
Let's consider an example where we predict whether a student will pass or fail based on their study hours.
import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression import matplotlib.pyplot as plt # Sample data data = {'Hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Pass': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]} df = pd.DataFrame(data) # Independent variable (Hours) X = df[['Hours']] # Dependent variable (Pass) Y = df['Pass'] # Create logistic regression model model = LogisticRegression() model.fit(X, Y) # Predict probabilities predicted_probabilities = model.predict_proba(X)[:, 1] # Plot the results plt.scatter(X, Y, color='blue') plt.plot(X, predicted_probabilities, color='red') plt.xlabel('Hours Studied') plt.ylabel('Probability of Passing') plt.title('Logistic Regression: Pass/Fail Prediction') plt.show()
Explanation
- Data Preparation: We create a DataFrame with study hours and pass/fail outcomes.
- Model Creation: We use
LogisticRegression
fromsklearn
to create and fit the model. - Prediction: We predict the probabilities of passing based on the model.
- Visualization: We plot the actual data points and the logistic curve.
Practical Exercise
Exercise: Use logistic regression to predict whether a person will buy a product based on their age. Use the following data:
Age | Buy (1=Yes, 0=No) |
---|---|
22 | 0 |
25 | 0 |
28 | 0 |
30 | 1 |
35 | 1 |
40 | 1 |
45 | 1 |
50 | 1 |
Solution:
# Sample data data = {'Age': [22, 25, 28, 30, 35, 40, 45, 50], 'Buy': [0, 0, 0, 1, 1, 1, 1, 1]} df = pd.DataFrame(data) # Independent variable (Age) X = df[['Age']] # Dependent variable (Buy) Y = df['Buy'] # Create logistic regression model model = LogisticRegression() model.fit(X, Y) # Predict probabilities predicted_probabilities = model.predict_proba(X)[:, 1] # Plot the results plt.scatter(X, Y, color='blue') plt.plot(X, predicted_probabilities, color='red') plt.xlabel('Age') plt.ylabel('Probability of Buying') plt.title('Logistic Regression: Purchase Prediction') plt.show()
Conclusion
In this section, we explored the concepts of Linear and Logistic Regression, two powerful techniques for data modeling. We covered the basic equations, provided practical examples, and included exercises to reinforce the concepts. Understanding these regression techniques is crucial for making predictions and informed decisions based on data.
Data Analysis Course
Module 1: Introduction to Data Analysis
- Basic Concepts of Data Analysis
- Importance of Data Analysis in Decision Making
- Commonly Used Tools and Software
Module 2: Data Collection and Preparation
- Data Sources and Collection Methods
- Data Cleaning: Identification and Handling of Missing Data
- Data Transformation and Normalization
Module 3: Data Exploration
Module 4: Data Modeling
Module 5: Model Evaluation and Validation
Module 6: Implementation and Communication of Results
- Model Implementation in Production
- Communication of Results to Stakeholders
- Documentation and Reports