Introduction to Logistic Regression

Logistic Regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). It is used to predict the probability of a binary outcome based on one or more predictor variables.

Key Concepts

Binary Outcome: The dependent variable in logistic regression is binary, meaning it has two possible outcomes (e.g., success/failure, yes/no, 0/1).
Logit Function: Logistic regression uses the logit function to model the probability of the binary outcome.
Odds and Odds Ratio: The odds represent the ratio of the probability of the event occurring to the probability of the event not occurring. The odds ratio compares the odds of the event occurring in different groups.
Maximum Likelihood Estimation (MLE): Logistic regression parameters are estimated using MLE, which finds the parameter values that maximize the likelihood of observing the given sample data.

Logistic Regression Equation

The logistic regression model can be represented as:

\[ \text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n \]

Where:

\( p \) is the probability of the event occurring.
\( \beta_0 \) is the intercept.
\( \beta_1, \beta_2, \ldots, \beta_n \) are the coefficients of the predictor variables \( X_1, X_2, \ldots, X_n \).

Example

Let's consider a simple example where we want to predict whether a student will pass or fail an exam based on the number of hours studied.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

# Sample data
data = {
    'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Passed': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
}

df = pd.DataFrame(data)

# Splitting the data into training and testing sets
X = df[['Hours_Studied']]
y = df['Passed']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Creating and training the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print("Confusion Matrix:\n", conf_matrix)
print("Accuracy:", accuracy)

# Plotting the logistic regression curve
plt.scatter(X, y, color='red')
plt.plot(X, model.predict_proba(X)[:, 1], color='blue')
plt.xlabel('Hours Studied')
plt.ylabel('Probability of Passing')
plt.title('Logistic Regression Curve')
plt.show()

Explanation

Data Preparation: We create a simple dataset with two columns: Hours_Studied and Passed.
Train-Test Split: We split the data into training and testing sets.
Model Training: We create a logistic regression model and train it using the training data.
Predictions: We use the trained model to make predictions on the test data.
Evaluation: We evaluate the model using a confusion matrix and accuracy score.
Visualization: We plot the logistic regression curve to visualize the relationship between hours studied and the probability of passing.

Practical Exercises

Exercise 1: Implement Logistic Regression on a Different Dataset

Task: Use the Titanic dataset to predict the survival of passengers based on features like age, sex, and class.

Solution:

import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Data preprocessing
titanic = titanic[['survived', 'pclass', 'sex', 'age', 'fare']].dropna()
titanic['sex'] = titanic['sex'].map({'male': 0, 'female': 1})

# Splitting the data into training and testing sets
X = titanic[['pclass', 'sex', 'age', 'fare']]
y = titanic['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Standardizing the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Creating and training the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print("Confusion Matrix:\n", conf_matrix)
print("Accuracy:", accuracy)

Common Mistakes and Tips

Ignoring Multicollinearity: Ensure that predictor variables are not highly correlated with each other.
Feature Scaling: Standardize features when they have different scales.
Overfitting: Use regularization techniques like L1 (Lasso) or L2 (Ridge) to prevent overfitting.
Interpreting Coefficients: Understand that coefficients in logistic regression represent the log odds.

Conclusion

Logistic Regression is a powerful and widely used technique for binary classification problems. By understanding its key concepts, equation, and practical implementation, you can effectively apply logistic regression to various datasets. In the next topic, we will explore Decision Trees, another popular supervised learning algorithm.

Logistic Regression

Introduction to Logistic Regression

Key Concepts