How to make the Airbnb booking of your life - A Story of Data.

Roman Nagy
6 min readJun 6, 2021

--

Maybe be you are the lucky one who can visit any city of the world with unlimited time and budget. If this is not the case, this story is for you. It will help you to answer some basic questions before you book your place to stay — using the data science approach.

Booking accommodation in the city of your choice, you possibly need to know what the best time of the year to make the visit is, what location or district you should choose, and what property type is the best for your pocket. You don’t want to pay too much but still want to have some level of quality. You can make this choices based on your intuition or you can use available data.

This story will help to answer this questions using data science. It uses Airbnb data to help you plan your next stay in Munich — the capital city of Bavaria in Germany. However, this approach can be applied to any city Airbnb provides a similar dataset for. Check InsideAirbnb to get the data for your city. And now … let’s see what to book in Munich and when to do it!

The Airbnb dataset contains more than 5000 bookable properties spread across the city, divided into 25 city districts.

Besides the properties themselves, the dataset provides data regarding prices per day over the whole year, availability of the properties, and reviews. Having a look at the prices over the year, you see a clear pattern of extremely high prices from the middle of September to the beginning of October — Welcome to the Oktoberfest! If you don’t prefer this kind of fun, you better try to find a different part of the year to book your trip. It looks like the prices are pretty fine in April or in December.

You probably observed the peaks within each month. These are caused due to different prices for each day of the week. As you can see in the following plot, Friday and Saturday are the most expensive ones. Conclusion: If you don’t want to party over the weekend, you know what to do…

The next question is where exactly to book your accommodation. This choice can be done based on different criteria. I believe the best location is a combination of a good price and a good quality. If you look at the average prices per city district, you see the most expensive ones in the middle of the city:

Altstadt-Lehel, Ludwigvorstadt-Isarvorstadt or Schwanthalerhöhe are the most expensive districts in the city. Are they really worth the money? It’s really difficult to measure objectively the Quality of accommodation because each individual might have different criteria to do it. One possible way is to use the review scores of rated properties. Let’s assume that good rated ones offer a good quality of accommodation. In this case, the review scores average per district provides slightly different district ranking than the plot above:

It’s obvious that the most expensive districts are not the best rated ones. If you want to get a good rated accommodation but not a too expensive one, it would be best to book a property around the centre of the city but not directly in the middle of it. The districts in the north of the city (Moosach, Milbertshofen) are pretty cheap but the rating scores are not very high. If you are a completely undemanding visitor with low quality expectations and/or you want to optimise your trip based on the budget only, it might be a good choice for you. Otherwise, I would choose Laim or Sendling-Westpark, which are the two best rated districts of the city. They are both located the middle of the price ranking.

The question is now, what are best property types to book. In order to make this estimation, an average price for the top 20 most common property types have been calculated. This price was compared to the average price of all properties in the same district (to avoid location bias). The deviation from the average price per each property type is shown in the following plot:

As you can see, private rooms in a condo cost ~45% less than the average price of all properties in the same district. On the other hand, an entire house is pretty expensive. You see the evaluation for all property types and from now, it’s your choice!

I hope, this post was able to help you to make some data-driven decisions for your next trip to Munich so far. If it showed you how to use data to solve similar kind of problems instead of making decisions based on gut feeling, it’s even better. And if you are interested in data science and want to see how to dig even deeper into the data, the next section is definitely something for you.

After you hopefully picked the best time, location, and property type for your next trip to Munich, let’s have a closer look at the data and how to predict price of a property — given the data we have. All details together with the source code used for the data analysis above and for the price prediction can be found in my GitHub project.

I used a linear regression model to predict the price of the properties. After the first feature selection I landed with a model using 63 features. It achieved r2 score on training data (train score) ~0.38 and r2 score on test data (test score) ~0.37. Including more than 63 features, the train score continued to improve but the test score was dropping down. The model began to tend to overfitting. This is not really helpful. At the end of the day we need a model to perform well on data never seen before. You can clearly see this performance development in following plot:

What helped to achieve further improvements was to remove outliers on some relevant features correlated with the price. On the plot bellow you can see the top 20 of these features. They are ranked and the correlation with the price is shown in the first column on the left. Red color means positive correlation (increasing value of the correlated feature increases the value of the price). The blue color means the opposite. Both, positive and negative correlated features are important for the price prediction though.

After reducing outliers for the number of accommodates, beds, and the bedrooms, the model performance increased by ~10% (the train score was ~0.43 and the test score was ~0.40). These values are pretty good already but there is definitely space for further improvements.

If you are familiar with data science and would like to get some practice, feel free to download the source code and try to improve the model. I’m really curious to see how far you get and what your best set of features to be used would be. You can even re-run all data analysis described above for any city of your choice.

Great, you’ve made it! If you liked the post or you have some proposals for next accommodation related questions to be answered (or similar data analysis), please don’t hesitate to let me know. Have a nice day and stay safe!

--

--