Search
Data

The data that we will cover in this dataset is based on the demand for hotel bookings. It can be found on Kaggle here.

As per the documentation on the website, the data is described as follows.


Content

This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.

All personally identifying information has been removed from the data.

Acknowledgements

The data is originally from the article Hotel Booking Demand Datasets, written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.

The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020.

Metadata

We will be working with a reduced version of this dataset with fewer columns. Below, these columns are described.

DataType ColumnName ColumnDescription
string hotel Hotel (H1 = Resort Hotel or H2 = City Hotel)
int is_canceled Value indicating if the booking was canceled (1) or not (0)
int arrival_date_year Year of arrival date
string arrival_date_month Month of arrival date
int arrival_date_day_of_month Day of arrival date
int adults Number of adults
int childern Number of children
int babies Number of babies
string country Country of origin. Categories are represented in the ISO 3155–3:2013 format
decimal adr Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights

Let's start by importing this data.

Before doing so, we need to interact with the Kaggle API to download the dataset. The steps are taken from this blog and are outlined below.

# import packages
from kaggle.api.kaggle_api_extended import KaggleApi
# display multiple outputs in same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd
# initialise and authenticate
api = KaggleApi()
api.authenticate()
# list dataset files
api.dataset_list_files(dataset = 'jessemostipak/hotel-booking-demand').files
[hotel_bookings.csv]
# download single file
api.dataset_download_file(dataset = 'jessemostipak/hotel-booking-demand', file_name = 'hotel_bookings.csv', path = '../../_data')
True

Now that we have downloaded the dataset, we can now import it into our session.

# import data
data_hotel = pd.read_csv(filepath_or_buffer = '../../_data/hotel_bookings.csv.zip')
data_hotel = data_hotel.loc[:,['hotel', 'is_canceled', 'arrival_date_year', 'arrival_date_month', 'arrival_date_day_of_month',
                               'adults', 'children', 'babies', 'country', 'adr']]
# store and pass variables between notebooks
%store data_hotel

data_hotel
Stored 'data_hotel' (DataFrame)
hotel is_canceled arrival_date_year arrival_date_month arrival_date_day_of_month adults children babies country adr
0 Resort Hotel 0 2015 July 1 2 0.0 0 PRT 0.00
1 Resort Hotel 0 2015 July 1 2 0.0 0 PRT 0.00
2 Resort Hotel 0 2015 July 1 1 0.0 0 GBR 75.00
3 Resort Hotel 0 2015 July 1 1 0.0 0 GBR 75.00
4 Resort Hotel 0 2015 July 1 2 0.0 0 GBR 98.00
... ... ... ... ... ... ... ... ... ... ...
119385 City Hotel 0 2017 August 30 2 0.0 0 BEL 96.14
119386 City Hotel 0 2017 August 31 3 0.0 0 FRA 225.43
119387 City Hotel 0 2017 August 31 2 0.0 0 DEU 157.71
119388 City Hotel 0 2017 August 31 2 0.0 0 GBR 104.40
119389 City Hotel 0 2017 August 29 2 0.0 0 DEU 151.20

119390 rows × 10 columns