Introduction

In this project, we will build a machine learning model to predict housing prices based on various features such as the number of bedrooms, square footage, location, etc. This project will help you apply the concepts learned in the previous modules, including data preprocessing, model training, evaluation, and deployment.

Objectives

  1. Understand the problem and the dataset.
  2. Perform data preprocessing.
  3. Train different machine learning models.
  4. Evaluate the models.
  5. Select the best model and fine-tune it.
  6. Deploy the model.

Step 1: Understanding the Problem and the Dataset

Problem Statement

We aim to predict the prices of houses based on various features. This is a regression problem where the target variable is continuous.

Dataset

We will use a dataset that contains information about various houses. The dataset includes features such as:

  • Number of bedrooms
  • Number of bathrooms
  • Square footage
  • Location (latitude and longitude)
  • Year built
  • Lot size

Sample Data

Bedrooms Bathrooms Square Footage Location (Lat, Long) Year Built Lot Size Price
3 2 1500 (37.77, -122.42) 1990 5000 750000
4 3 2000 (37.78, -122.43) 2000 6000 850000

Step 2: Data Preprocessing

Loading the Data

import pandas as pd

# Load the dataset
data = pd.read_csv('housing_data.csv')
print(data.head())

Handling Missing Data

# Check for missing values
print(data.isnull().sum())

# Fill missing values
data = data.fillna(method='ffill')

Data Transformation

# Convert categorical data to numerical data
data = pd.get_dummies(data, columns=['Location'])

Normalization and Standardization

from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
data[['Square Footage', 'Lot Size']] = scaler.fit_transform(data[['Square Footage', 'Lot Size']])

Step 3: Train Different Machine Learning Models

Splitting the Data

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X = data.drop('Price', axis=1)
y = data['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Linear Regression

from sklearn.linear_model import LinearRegression

# Train the model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = lr_model.predict(X_test)

Decision Tree

from sklearn.tree import DecisionTreeRegressor

# Train the model
dt_model = DecisionTreeRegressor()
dt_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = dt_model.predict(X_test)

Random Forest

from sklearn.ensemble import RandomForestRegressor

# Train the model
rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf_model.predict(X_test)

Step 4: Evaluate the Models

Evaluation Metrics

from sklearn.metrics import mean_squared_error, r2_score

# Evaluate Linear Regression
lr_mse = mean_squared_error(y_test, lr_model.predict(X_test))
lr_r2 = r2_score(y_test, lr_model.predict(X_test))

# Evaluate Decision Tree
dt_mse = mean_squared_error(y_test, dt_model.predict(X_test))
dt_r2 = r2_score(y_test, dt_model.predict(X_test))

# Evaluate Random Forest
rf_mse = mean_squared_error(y_test, rf_model.predict(X_test))
rf_r2 = r2_score(y_test, rf_model.predict(X_test))

# Print the results
print(f"Linear Regression - MSE: {lr_mse}, R2: {lr_r2}")
print(f"Decision Tree - MSE: {dt_mse}, R2: {dt_r2}")
print(f"Random Forest - MSE: {rf_mse}, R2: {rf_r2}")

Step 5: Select the Best Model and Fine-Tune It

Hyperparameter Tuning for Random Forest

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Best parameters
print(grid_search.best_params_)

# Best model
best_rf_model = grid_search.best_estimator_

Step 6: Deploy the Model

Saving the Model

import joblib

# Save the model
joblib.dump(best_rf_model, 'best_rf_model.pkl')

Loading and Using the Model

# Load the model
loaded_model = joblib.load('best_rf_model.pkl')

# Make predictions
new_data = [[3, 2, 1500, 37.77, -122.42, 1990, 5000]]
new_data = scaler.transform(new_data)
price_prediction = loaded_model.predict(new_data)
print(f"Predicted Price: {price_prediction}")

Conclusion

In this project, we successfully built a machine learning model to predict housing prices. We went through the entire process of understanding the problem, preprocessing the data, training different models, evaluating them, and finally deploying the best model. This project provided hands-on experience with various machine learning concepts and techniques.

Key Takeaways

  • Data preprocessing is crucial for building effective machine learning models.
  • Different models can be trained and evaluated to find the best one.
  • Hyperparameter tuning can significantly improve model performance.
  • Model deployment involves saving the trained model and loading it for future predictions.

This project sets the foundation for more complex machine learning tasks and prepares you for real-world applications.

© Copyright 2024. All rights reserved