Introduction

In this section, we will delve into two fundamental types of regression analysis used in data modeling: Linear Regression and Logistic Regression. These techniques are essential for predicting outcomes and understanding relationships between variables.

Linear Regression

Key Concepts

  1. Definition: Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
  2. Equation: The linear regression equation is typically written as: \[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \epsilon \] where:
    • \(Y\) is the dependent variable.
    • \(\beta_0\) is the intercept.
    • \(\beta_1, \beta_2, \ldots, \beta_n\) are the coefficients.
    • \(X_1, X_2, \ldots, X_n\) are the independent variables.
    • \(\epsilon\) is the error term.

Example

Let's consider a simple example where we predict the price of a house based on its size.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Sample data
data = {'Size': [1500, 1600, 1700, 1800, 1900],
        'Price': [300000, 320000, 340000, 360000, 380000]}
df = pd.DataFrame(data)

# Independent variable (Size)
X = df[['Size']]
# Dependent variable (Price)
Y = df['Price']

# Create linear regression model
model = LinearRegression()
model.fit(X, Y)

# Predict prices
predicted_prices = model.predict(X)

# Plot the results
plt.scatter(X, Y, color='blue')
plt.plot(X, predicted_prices, color='red')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price ($)')
plt.title('Linear Regression: House Price Prediction')
plt.show()

Explanation

  • Data Preparation: We create a DataFrame with house sizes and prices.
  • Model Creation: We use LinearRegression from sklearn to create and fit the model.
  • Prediction: We predict house prices based on the model.
  • Visualization: We plot the actual data points and the regression line.

Practical Exercise

Exercise: Use linear regression to predict the weight of a person based on their height. Use the following data:

Height (inches) Weight (pounds)
60 115
62 120
64 130
66 140
68 150

Solution:

# Sample data
data = {'Height': [60, 62, 64, 66, 68],
        'Weight': [115, 120, 130, 140, 150]}
df = pd.DataFrame(data)

# Independent variable (Height)
X = df[['Height']]
# Dependent variable (Weight)
Y = df['Weight']

# Create linear regression model
model = LinearRegression()
model.fit(X, Y)

# Predict weights
predicted_weights = model.predict(X)

# Plot the results
plt.scatter(X, Y, color='blue')
plt.plot(X, predicted_weights, color='red')
plt.xlabel('Height (inches)')
plt.ylabel('Weight (pounds)')
plt.title('Linear Regression: Weight Prediction')
plt.show()

Logistic Regression

Key Concepts

  1. Definition: Logistic regression is used for binary classification problems. It models the probability that a given input belongs to a particular category.
  2. Equation: The logistic regression equation is: \[ P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n)}} \] where:
    • \(P(Y=1)\) is the probability that the dependent variable \(Y\) equals 1.
    • \(\beta_0, \beta_1, \ldots, \beta_n\) are the coefficients.
    • \(X_1, X_2, \ldots, X_n\) are the independent variables.

Example

Let's consider an example where we predict whether a student will pass or fail based on their study hours.

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

# Sample data
data = {'Hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'Pass': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data)

# Independent variable (Hours)
X = df[['Hours']]
# Dependent variable (Pass)
Y = df['Pass']

# Create logistic regression model
model = LogisticRegression()
model.fit(X, Y)

# Predict probabilities
predicted_probabilities = model.predict_proba(X)[:, 1]

# Plot the results
plt.scatter(X, Y, color='blue')
plt.plot(X, predicted_probabilities, color='red')
plt.xlabel('Hours Studied')
plt.ylabel('Probability of Passing')
plt.title('Logistic Regression: Pass/Fail Prediction')
plt.show()

Explanation

  • Data Preparation: We create a DataFrame with study hours and pass/fail outcomes.
  • Model Creation: We use LogisticRegression from sklearn to create and fit the model.
  • Prediction: We predict the probabilities of passing based on the model.
  • Visualization: We plot the actual data points and the logistic curve.

Practical Exercise

Exercise: Use logistic regression to predict whether a person will buy a product based on their age. Use the following data:

Age Buy (1=Yes, 0=No)
22 0
25 0
28 0
30 1
35 1
40 1
45 1
50 1

Solution:

# Sample data
data = {'Age': [22, 25, 28, 30, 35, 40, 45, 50],
        'Buy': [0, 0, 0, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data)

# Independent variable (Age)
X = df[['Age']]
# Dependent variable (Buy)
Y = df['Buy']

# Create logistic regression model
model = LogisticRegression()
model.fit(X, Y)

# Predict probabilities
predicted_probabilities = model.predict_proba(X)[:, 1]

# Plot the results
plt.scatter(X, Y, color='blue')
plt.plot(X, predicted_probabilities, color='red')
plt.xlabel('Age')
plt.ylabel('Probability of Buying')
plt.title('Logistic Regression: Purchase Prediction')
plt.show()

Conclusion

In this section, we explored the concepts of Linear and Logistic Regression, two powerful techniques for data modeling. We covered the basic equations, provided practical examples, and included exercises to reinforce the concepts. Understanding these regression techniques is crucial for making predictions and informed decisions based on data.

© Copyright 2024. All rights reserved