The data that we will cover in this dataset is based on the demand for hotel bookings. It can be found on Kaggle here.
As per the documentation on the website, the data is described as follows.
Content
This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.
All personally identifying information has been removed from the data.
Acknowledgements
The data is originally from the article Hotel Booking Demand Datasets, written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.
The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020.
Metadata
We will be working with a reduced version of this dataset with fewer columns. Below, these columns are described.
DataType | ColumnName | ColumnDescription |
---|---|---|
string | hotel |
Hotel (H1 = Resort Hotel or H2 = City Hotel) |
int | is_canceled |
Value indicating if the booking was canceled (1 ) or not (0 ) |
int | arrival_date_year |
Year of arrival date |
string | arrival_date_month |
Month of arrival date |
int | arrival_date_day_of_month |
Day of arrival date |
int | adults |
Number of adults |
int | childern |
Number of children |
int | babies |
Number of babies |
string | country |
Country of origin. Categories are represented in the ISO 3155–3:2013 format |
decimal | adr |
Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
Let's start by importing this data.
Before doing so, we need to interact with the Kaggle API to download the dataset. The steps are taken from this blog and are outlined below.
# import packages
from kaggle.api.kaggle_api_extended import KaggleApi
# display multiple outputs in same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
# initialise and authenticate
api = KaggleApi()
api.authenticate()
# list dataset files
api.dataset_list_files(dataset = 'jessemostipak/hotel-booking-demand').files
# download single file
api.dataset_download_file(dataset = 'jessemostipak/hotel-booking-demand', file_name = 'hotel_bookings.csv', path = '../../_data')
Now that we have downloaded the dataset, we can now import it into our session.
# import data
data_hotel = pd.read_csv(filepath_or_buffer = '../../_data/hotel_bookings.csv.zip')
data_hotel = data_hotel.loc[:,['hotel', 'is_canceled', 'arrival_date_year', 'arrival_date_month', 'arrival_date_day_of_month',
'adults', 'children', 'babies', 'country', 'adr']]
# store and pass variables between notebooks
%store data_hotel
data_hotel