Want to create interactive content? It’s easy in Genially!

Get started free

Predicting Airbnb Prices with Machine Learning

Paco Perez

Created on November 26, 2023

Machine Learning competition to predict rental prices on AirBnB.

Start designing with a free template

Discover more than 1500 professional designs like these:

Corporate Christmas Presentation

Business Results Presentation

Meeting Plan Presentation

Customer Service Manual

Business vision deck

Economic Presentation

Tech Presentation Mobile

Transcript

Machine Learning Project 🤖

Predicting Airbnb Rental Prices with Machine Learning

"Machine learning is like farming. You sow algorithms, cultivate data, and reap insights."

THE MAIN IDEA

Introduction

The main idea of this project is to be able to predict the price of an Airbnb listing given a set of attributes. Better decision-making can be obtained for hosts by setting competitive prices. Guests can also find value in the analysis and predictions, aligning expectations with actual pricing.For this, we will set 'price' to be our target variable.

THE DATA

Overview of the Data 🔎

For the analysis and training we were given several CSV files, the one we will focus in now is train_data.csv, which contains the following information on which we will need to base our analysis, preprocess and train to obtain the predictions.

  • id: Unique identifier of the property.
  • log_price: The rental price of the property in log format. VARIABLE TO PREDICT
  • property_type: Property type (e.g. apartment, house, etc.).
  • room_type: Type of room (e.g. private room, house/complete apartment, etc.).
  • amenities: Amenities offered at the property.
  • accommodates: Maximum number of guests the property can accommodate.
  • bathrooms: Number of bathrooms in the property.
  • bed_type: Type of bed (e.g., double bed, sofa bed, etc.).
  • cancellation_policy: Cancellation policy for reservations.
  • cleaning_fee: If a cleaning fee is charged (True/False).
  • city: City where the property is located.
  • description: Description of the property.
  • first_review: Date of first review.
  • host_has_profile_pic: If the host has a profile picture (True/False).
  • host_identity_verified: If the host identity is verified (True/False).
  • host_response_rate: Host response rate.
  • host_since: Date the host joined Airbnb.
  • instant_bookable: Whether the property is instantly bookable (T/F).
  • last_review: Date of the last review.
  • latitude: Geographical latitude of the property.
  • longitude: Geographical longitude of the property.
  • name: Name of the property.
  • neighborhood: The neighborhood where the property is located.
  • number_of_reviews: Total number of reviews of the property.
  • review_scores_rating: Overall score of reviews.
  • thumbnail_url: URL of the thumbnail.

PREPARING DATA

Data Preprocessing ⚙️

To succesfully train a accurate model, preprocessing and cleaning of our dataframe is crucial, so I took care of our data in the following ways
Creating a New Column
Handing Missing Values

Introduced a new feature, 'num_amenities,' counting the number of amenities a airbnb has based on the existing 'amenities' column.

Identified and quantified missing values and applied appropriate strategies for filling or handling missing values.

Categorical and Numeric Features
Dropping Irrelevant Columns

Dropped columns with a high number of missing values or those not immediately relevant. Example: Dropping columns such as 'first_review,' 'host_response_rate,' etc.

  • Numeric Features: Standard scaling using StandardScaler().
  • Categorical Features: One-hot encoding using OneHotEncoder.
  • Utilized ColumnTransformer to streamline the preprocessing process.

Train, Test, Split ✂️

TRAIN

80%

Data for training

TEST

20%

Data to test against to check accuracy

MODEL SELECTION

RANDOM FOREST REGRESSION 🌳

This model's robustness against overfitting, suitability for regression tasks, and effectiveness in capturing non-linear patterns make it an ideal fit for predicting Airbnb rental prices. Its versatility and minimal need for hyperparameter tuning add to its accessibility, making it a pragmatic choice for this project. With this model, a RMSE of 0.4195 was obtained.

PROBLEM

Test data Preprocessing 🧹✨

  • Remaining missing values in categorical columns ('host_has_profile_pic' and 'host_identity_verified') were imputed with their respective modes.
  • Lastly, missing values in 'review_scores_rating' and 'host_since' were filled with the mean and mode, respectively.
  • Missing values in the 'bathrooms,' 'bedrooms,' and 'beds' columns were filled with their respective medians.
  • host_response_rate' underwent transformation to a float by removing the percentage symbol and then had its null values filled with the mean.
  • To handle date data, missing values in 'first_review' and 'last_review' were filled with a placeholder date of '1900-01-01'.
  • Columns deemed irrelevant, such as 'neighbourhood,' 'thumbnail_url,' and 'zipcode,' were dropped.

Model Prediction Results ✅

With the clean test_data dataframe, the pipeline and model was applied to predict the prices of said airbnbs.

SUBMISSION AND CONCLUSION

The results are now here, to check the accuracy and RMSE of the submission.csv we created along our predictions, we upload the file to Kaggle and check the score there. A score of 0.4179 was achieved, leaving us #5 in the competition, not bad! In the future we could try different models and hyperparameters to try lowering our error!

Thank You

Happy Learning!

Connect with me!