Introduction
In this project, we will build a machine learning model to predict housing prices based on various features such as the number of bedrooms, square footage, location, etc. This project will help you apply the concepts learned in the previous modules, including data preprocessing, model training, evaluation, and deployment.
Objectives
- Understand the problem and the dataset.
- Perform data preprocessing.
- Train different machine learning models.
- Evaluate the models.
- Select the best model and fine-tune it.
- Deploy the model.
Step 1: Understanding the Problem and the Dataset
Problem Statement
We aim to predict the prices of houses based on various features. This is a regression problem where the target variable is continuous.
Dataset
We will use a dataset that contains information about various houses. The dataset includes features such as:
- Number of bedrooms
- Number of bathrooms
- Square footage
- Location (latitude and longitude)
- Year built
- Lot size
Sample Data
Bedrooms | Bathrooms | Square Footage | Location (Lat, Long) | Year Built | Lot Size | Price |
---|---|---|---|---|---|---|
3 | 2 | 1500 | (37.77, -122.42) | 1990 | 5000 | 750000 |
4 | 3 | 2000 | (37.78, -122.43) | 2000 | 6000 | 850000 |
Step 2: Data Preprocessing
Loading the Data
Handling Missing Data
# Check for missing values print(data.isnull().sum()) # Fill missing values data = data.fillna(method='ffill')
Data Transformation
Normalization and Standardization
from sklearn.preprocessing import StandardScaler # Standardize the data scaler = StandardScaler() data[['Square Footage', 'Lot Size']] = scaler.fit_transform(data[['Square Footage', 'Lot Size']])
Step 3: Train Different Machine Learning Models
Splitting the Data
from sklearn.model_selection import train_test_split # Split the data into training and testing sets X = data.drop('Price', axis=1) y = data['Price'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Linear Regression
from sklearn.linear_model import LinearRegression # Train the model lr_model = LinearRegression() lr_model.fit(X_train, y_train) # Predict and evaluate y_pred = lr_model.predict(X_test)
Decision Tree
from sklearn.tree import DecisionTreeRegressor # Train the model dt_model = DecisionTreeRegressor() dt_model.fit(X_train, y_train) # Predict and evaluate y_pred = dt_model.predict(X_test)
Random Forest
from sklearn.ensemble import RandomForestRegressor # Train the model rf_model = RandomForestRegressor() rf_model.fit(X_train, y_train) # Predict and evaluate y_pred = rf_model.predict(X_test)
Step 4: Evaluate the Models
Evaluation Metrics
from sklearn.metrics import mean_squared_error, r2_score # Evaluate Linear Regression lr_mse = mean_squared_error(y_test, lr_model.predict(X_test)) lr_r2 = r2_score(y_test, lr_model.predict(X_test)) # Evaluate Decision Tree dt_mse = mean_squared_error(y_test, dt_model.predict(X_test)) dt_r2 = r2_score(y_test, dt_model.predict(X_test)) # Evaluate Random Forest rf_mse = mean_squared_error(y_test, rf_model.predict(X_test)) rf_r2 = r2_score(y_test, rf_model.predict(X_test)) # Print the results print(f"Linear Regression - MSE: {lr_mse}, R2: {lr_r2}") print(f"Decision Tree - MSE: {dt_mse}, R2: {dt_r2}") print(f"Random Forest - MSE: {rf_mse}, R2: {rf_r2}")
Step 5: Select the Best Model and Fine-Tune It
Hyperparameter Tuning for Random Forest
from sklearn.model_selection import GridSearchCV # Define the parameter grid param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10] } # Perform grid search grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2) grid_search.fit(X_train, y_train) # Best parameters print(grid_search.best_params_) # Best model best_rf_model = grid_search.best_estimator_
Step 6: Deploy the Model
Saving the Model
Loading and Using the Model
# Load the model loaded_model = joblib.load('best_rf_model.pkl') # Make predictions new_data = [[3, 2, 1500, 37.77, -122.42, 1990, 5000]] new_data = scaler.transform(new_data) price_prediction = loaded_model.predict(new_data) print(f"Predicted Price: {price_prediction}")
Conclusion
In this project, we successfully built a machine learning model to predict housing prices. We went through the entire process of understanding the problem, preprocessing the data, training different models, evaluating them, and finally deploying the best model. This project provided hands-on experience with various machine learning concepts and techniques.
Key Takeaways
- Data preprocessing is crucial for building effective machine learning models.
- Different models can be trained and evaluated to find the best one.
- Hyperparameter tuning can significantly improve model performance.
- Model deployment involves saving the trained model and loading it for future predictions.
This project sets the foundation for more complex machine learning tasks and prepares you for real-world applications.
Machine Learning Course
Module 1: Introduction to Machine Learning
- What is Machine Learning?
- History and Evolution of Machine Learning
- Types of Machine Learning
- Applications of Machine Learning
Module 2: Fundamentals of Statistics and Probability
Module 3: Data Preprocessing
Module 4: Supervised Machine Learning Algorithms
- Linear Regression
- Logistic Regression
- Decision Trees
- Support Vector Machines (SVM)
- K-Nearest Neighbors (K-NN)
- Neural Networks
Module 5: Unsupervised Machine Learning Algorithms
- Clustering: K-means
- Hierarchical Clustering
- Principal Component Analysis (PCA)
- DBSCAN Clustering Analysis
Module 6: Model Evaluation and Validation
Module 7: Advanced Techniques and Optimization
Module 8: Model Implementation and Deployment
- Popular Frameworks and Libraries
- Model Implementation in Production
- Model Maintenance and Monitoring
- Ethical and Privacy Considerations
Module 9: Practical Projects
- Project 1: Housing Price Prediction
- Project 2: Image Classification
- Project 3: Sentiment Analysis on Social Media
- Project 4: Fraud Detection