Predicting Airbnb Rental Prices with Machine Learning
"Machine learning is like farming. You sow algorithms, cultivate data, and reap insights."
THE MAIN IDEA
Introduction
The main idea of this project is to be able to predict the price of an Airbnb listing given a set of attributes. Better decision-making can be obtained for hosts by setting competitive prices. Guests can also find value in the analysis and predictions, aligning expectations with actual pricing.For this, we will set 'price' to be our target variable.
THE DATA
Overview of the Data 🔎
For the analysis and training we were given several CSV files, the one we will focus in now is train_data.csv, which contains the following information on which we will need to base our analysis, preprocess and train to obtain the predictions.
id: Unique identifier of the property.
log_price: The rental price of the property in log format. VARIABLE TO PREDICT
property_type: Property type (e.g. apartment, house, etc.).
room_type: Type of room (e.g. private room, house/complete apartment, etc.).
amenities: Amenities offered at the property.
accommodates: Maximum number of guests the property can accommodate.
bathrooms: Number of bathrooms in the property.
bed_type: Type of bed (e.g., double bed, sofa bed, etc.).
cancellation_policy: Cancellation policy for reservations.
cleaning_fee: If a cleaning fee is charged (True/False).
city: City where the property is located.
description: Description of the property.
first_review: Date of first review.
host_has_profile_pic: If the host has a profile picture (True/False).
host_identity_verified: If the host identity is verified (True/False).
host_response_rate: Host response rate.
host_since: Date the host joined Airbnb.
instant_bookable: Whether the property is instantly bookable (T/F).
last_review: Date of the last review.
latitude: Geographical latitude of the property.
longitude: Geographical longitude of the property.
name: Name of the property.
neighborhood: The neighborhood where the property is located.
number_of_reviews: Total number of reviews of the property.
review_scores_rating: Overall score of reviews.
thumbnail_url: URL of the thumbnail.
PREPARING DATA
Data Preprocessing ⚙️
To succesfully train a accurate model, preprocessing and cleaning of our dataframe is crucial, so I took care of our data in the following ways
Creating a New Column
Handing Missing Values
Introduced a new feature, 'num_amenities,' counting the number of amenities a airbnb has based on the existing 'amenities' column.
Identified and quantified missing values and applied appropriate strategies for filling or handling missing values.
Categorical and Numeric Features
Dropping Irrelevant Columns
Dropped columns with a high number of missing values or those not immediately relevant. Example: Dropping columns such as 'first_review,' 'host_response_rate,' etc.
Numeric Features: Standard scaling using StandardScaler().
Categorical Features: One-hot encoding using OneHotEncoder.
Utilized ColumnTransformer to streamline the preprocessing process.
Train, Test, Split ✂️
TRAIN
80%
Data for training
TEST
20%
Data to test against to check accuracy
MODEL SELECTION
RANDOM FOREST REGRESSION 🌳
This model's robustness against overfitting, suitability for regression tasks, and effectiveness in capturing non-linear patterns make it an ideal fit for predicting Airbnb rental prices. Its versatility and minimal need for hyperparameter tuning add to its accessibility, making it a pragmatic choice for this project. With this model, a RMSE of 0.4195 was obtained.
PROBLEM
Test data Preprocessing 🧹✨
Remaining missing values in categorical columns ('host_has_profile_pic' and 'host_identity_verified') were imputed with their respective modes.
Lastly, missing values in 'review_scores_rating' and 'host_since' were filled with the mean and mode, respectively.
Missing values in the 'bathrooms,' 'bedrooms,' and 'beds' columns were filled with their respective medians.
host_response_rate' underwent transformation to a float by removing the percentage symbol and then had its null values filled with the mean.
To handle date data, missing values in 'first_review' and 'last_review' were filled with a placeholder date of '1900-01-01'.
Columns deemed irrelevant, such as 'neighbourhood,' 'thumbnail_url,' and 'zipcode,' were dropped.
Model Prediction Results ✅
With the clean test_data dataframe, the pipeline and model was applied to predict the prices of said airbnbs.
SUBMISSION AND CONCLUSION
The results are now here, to check the accuracy and RMSE of the submission.csv we created along our predictions, we upload the file to Kaggle and check the score there. A score of 0.4179 was achieved, leaving us #5 in the competition, not bad! In the future we could try different models and hyperparameters to try lowering our error!
Predicting Airbnb Prices with Machine Learning
Paco Perez
Created on November 26, 2023
Machine Learning competition to predict rental prices on AirBnB.
Start designing with a free template
Discover more than 1500 professional designs like these:
View
Geniaflix Presentation
View
Vintage Mosaic Presentation
View
Shadow Presentation
View
Newspaper Presentation
View
Zen Presentation
View
Audio tutorial
View
Pechakucha Presentation
Explore all templates
Transcript
Machine Learning Project 🤖
Predicting Airbnb Rental Prices with Machine Learning
"Machine learning is like farming. You sow algorithms, cultivate data, and reap insights."
THE MAIN IDEA
Introduction
The main idea of this project is to be able to predict the price of an Airbnb listing given a set of attributes. Better decision-making can be obtained for hosts by setting competitive prices. Guests can also find value in the analysis and predictions, aligning expectations with actual pricing.For this, we will set 'price' to be our target variable.
THE DATA
Overview of the Data 🔎
For the analysis and training we were given several CSV files, the one we will focus in now is train_data.csv, which contains the following information on which we will need to base our analysis, preprocess and train to obtain the predictions.
PREPARING DATA
Data Preprocessing ⚙️
To succesfully train a accurate model, preprocessing and cleaning of our dataframe is crucial, so I took care of our data in the following ways
Creating a New Column
Handing Missing Values
Introduced a new feature, 'num_amenities,' counting the number of amenities a airbnb has based on the existing 'amenities' column.
Identified and quantified missing values and applied appropriate strategies for filling or handling missing values.
Categorical and Numeric Features
Dropping Irrelevant Columns
Dropped columns with a high number of missing values or those not immediately relevant. Example: Dropping columns such as 'first_review,' 'host_response_rate,' etc.
Train, Test, Split ✂️
TRAIN
80%
Data for training
TEST
20%
Data to test against to check accuracy
MODEL SELECTION
RANDOM FOREST REGRESSION 🌳
This model's robustness against overfitting, suitability for regression tasks, and effectiveness in capturing non-linear patterns make it an ideal fit for predicting Airbnb rental prices. Its versatility and minimal need for hyperparameter tuning add to its accessibility, making it a pragmatic choice for this project. With this model, a RMSE of 0.4195 was obtained.
PROBLEM
Test data Preprocessing 🧹✨
Model Prediction Results ✅
With the clean test_data dataframe, the pipeline and model was applied to predict the prices of said airbnbs.
SUBMISSION AND CONCLUSION
The results are now here, to check the accuracy and RMSE of the submission.csv we created along our predictions, we upload the file to Kaggle and check the score there. A score of 0.4179 was achieved, leaving us #5 in the competition, not bad! In the future we could try different models and hyperparameters to try lowering our error!
Thank You
Happy Learning!
Connect with me!