In this section, we will wrangle our dataset so that it is suitable for modelling and analysing as a time-series.
The chief changes we will make are:
- [x] Computing total number of guests,
number_guests
.- This is done by adding the columns
adults
,children
andbabies
.
- This is done by adding the columns
- [x] Convert the month name given to a month number,
arrival_date_month
.- This is so we can create a date column.
- [x] Create a date column,
arrival_date
.- This is done by concatenating the columns
arrival_date_year
,arrival_date_month
andarriva_date_day_of_month
.
- This is done by concatenating the columns
- [x] Filtering out cancelled bookings because these bookings were not fulfilled.
- [x] Narrowing down categories to compute proportions by,
country
->region
. - [x] Transform data to become compositional by computing proportions.
- [x] Transform data so our data is appropriate for modelling.
We will also bring in lookup information country codes to their subregions of the world. This is so we can reduce the number of categories to analyse by.
The data that has the lookup information country codes is on GitHub here.
import pandas as pd
# display multiple outputs in same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# pass in variable from other notebook
%store -r data_hotel
# load in lookup .csv file
data_lookup = pd.read_csv(filepath_or_buffer = '../../_data/data_lookup.csv')
# compute 'number_guests'
data_hotel['number_guests'] = data_hotel['adults'] + data_hotel['children'] + data_hotel['babies']
# convert month name to month number
data_hotel['arrival_date_month'] = pd.to_datetime(data_hotel['arrival_date_month'], format = '%B').dt.month
# create date column
data_hotel['arrival_date'] = data_hotel['arrival_date_year'].astype(str) + '-' + data_hotel['arrival_date_month'].astype(str) + '-' + data_hotel['arrival_date_day_of_month'].astype(str)
data_hotel['arrival_date'] = pd.to_datetime(data_hotel['arrival_date'])
# filter out cancelled bookings
data_hotel = data_hotel.query('is_canceled == 0')
# select only columns necessary
data_hotel = data_hotel.loc[:,['hotel', 'arrival_date',
'number_guests', 'country', 'adr']]
# look at two datasets
data_hotel
data_lookup
Break-points: Searching for possibles one online
Now, let's check the dates to see if there are any candidate dates which we can pose as our break-point before analysing for a structural break. In particular, we want to see the date range so we can search the news for global shocks that would have impacted the hotel bookings industry during this period.
From a cursory search on Google, nothing arose that really stood out, thus we will look within our data to see if there are any candidate break-points.
data_hotel['arrival_date'].min(), data_hotel['arrival_date'].max()
# join data to get region
data_join = pd.merge(left = data_hotel, right = data_lookup,
how = 'left', left_on = 'country', right_on = 'alpha-3',
validate = 'many_to_one')
# select only useful columns
data_join = data_join.loc[:,['hotel', 'arrival_date', 'number_guests', 'country', 'adr',
'name', 'region', 'sub-region']]
# rename columns
data_join = data_join.rename(columns = {'name':'country_name', 'country':'country_code', 'sub-region':'sub_region'})
data_join
Computing Variables: Proportions
As we are seeking to analyse the proportion of total guests across our data, then we need to compute the proportions. This will take the following steps:
- Narrow down
country
into smaller categories to get meaningful proportions.- This will be through mapping it against
region
from a separate dataset.
- This will be through mapping it against
- Calculate
total_guests
byarrival_date
andregion
. - Use the
total_guests
field as our denominator to computeproportion_guests
.
Let's check how many subcategories we have now, so we know whether we need to further narrow this down.
data_join['region'].unique()
The number of categories looks fine, so we can proceed with grouping by these and summarising the number_guests
column. Then we will compute the proportion of guests per region to start getting a dataset that is suitable for compositional time-series modelling.
Appreciate that computing the proportions of th enumber of guests in each subregion might hide specific region-based factors that influence their ability to travel, but as this analysis is for demonstrating compositional time-series modelling, we abstract away from these concerns.
# group by 'arrival_date' and 'region' before summing 'number_guests'
data_total = data_join.groupby(['arrival_date', 'region'])['number_guests'].agg(func = ['sum'])
# rename 'sum' column to 'total_guests'
data_total = data_total.rename(columns = {'sum': 'total_guests'})
data_total
Now that we have the total number of guests, as grouped by the arrival_date
and region
, we need the proportions because we want to test our compositional time-series analysis.
# compute proportions
## where `level = 0` means grouping by first level of index, rather than by one of th columns,
## so are grouping by `arrival_date` and `region`
data_proportion = data_total.groupby(level = 0).apply(lambda x:
x / float(x.sum()))
data_proportion = data_proportion.rename(columns = {'total_guests': 'proportion_guests'})
data_proportion
# join totals with proportions
## un-groupby so we get previous grouped index as columns
data_total = data_total.reset_index()
data_proportion = data_proportion.reset_index()
## merge the two together
data_join = pd.merge(left = data_total, right = data_proportion,
how = 'inner', left_on = ('arrival_date', 'region'), right_on = ('arrival_date', 'region'),
validate = 'one_to_one')
# store and pass variables between notebooks
%store data_join
data_join