Data Wrangling

In this section, we will wrangle our dataset so that it is suitable for modelling and analysing as a time-series.

The chief changes we will make are:

[x] Computing total number of guests, number_guests.
- This is done by adding the columns adults, children and babies.
[x] Convert the month name given to a month number, arrival_date_month.
- This is so we can create a date column.
[x] Create a date column, arrival_date.
- This is done by concatenating the columns arrival_date_year, arrival_date_month and arriva_date_day_of_month.
[x] Filtering out cancelled bookings because these bookings were not fulfilled.
[x] Narrowing down categories to compute proportions by, country -> region.
[x] Transform data to become compositional by computing proportions.
[x] Transform data so our data is appropriate for modelling.

We will also bring in lookup information country codes to their subregions of the world. This is so we can reduce the number of categories to analyse by.

The data that has the lookup information country codes is on GitHub here.

import pandas as pd

# display multiple outputs in same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# pass in variable from other notebook
%store -r data_hotel

# load in lookup .csv file
data_lookup = pd.read_csv(filepath_or_buffer = '../../_data/data_lookup.csv')

# compute 'number_guests'
data_hotel['number_guests'] = data_hotel['adults'] + data_hotel['children'] + data_hotel['babies']
# convert month name to month number
data_hotel['arrival_date_month'] = pd.to_datetime(data_hotel['arrival_date_month'], format = '%B').dt.month
# create date column
data_hotel['arrival_date'] = data_hotel['arrival_date_year'].astype(str) + '-' + data_hotel['arrival_date_month'].astype(str) + '-' + data_hotel['arrival_date_day_of_month'].astype(str)
data_hotel['arrival_date'] = pd.to_datetime(data_hotel['arrival_date'])
# filter out cancelled bookings
data_hotel = data_hotel.query('is_canceled == 0')
# select only columns necessary
data_hotel = data_hotel.loc[:,['hotel', 'arrival_date', 
                               'number_guests', 'country', 'adr']]

# look at two datasets
data_hotel
data_lookup

	hotel	arrival_date	number_guests	country	adr
0	Resort Hotel	2015-07-01	2.0	PRT	0.00
1	Resort Hotel	2015-07-01	2.0	PRT	0.00
2	Resort Hotel	2015-07-01	1.0	GBR	75.00
3	Resort Hotel	2015-07-01	1.0	GBR	75.00
4	Resort Hotel	2015-07-01	2.0	GBR	98.00
...	...	...	...	...	...
119385	City Hotel	2017-08-30	2.0	BEL	96.14
119386	City Hotel	2017-08-31	3.0	FRA	225.43
119387	City Hotel	2017-08-31	2.0	DEU	157.71
119388	City Hotel	2017-08-31	2.0	GBR	104.40
119389	City Hotel	2017-08-29	2.0	DEU	151.20

75166 rows × 5 columns

	name	alpha-2	alpha-3	country-code	iso_3166-2	region	sub-region	intermediate-region	region-code	sub-region-code	intermediate-region-code
0	Afghanistan	AF	AFG	4	ISO 3166-2:AF	Asia	Southern Asia	NaN	142.0	34.0	NaN
1	Åland Islands	AX	ALA	248	ISO 3166-2:AX	Europe	Northern Europe	NaN	150.0	154.0	NaN
2	Albania	AL	ALB	8	ISO 3166-2:AL	Europe	Southern Europe	NaN	150.0	39.0	NaN
3	Algeria	DZ	DZA	12	ISO 3166-2:DZ	Africa	Northern Africa	NaN	2.0	15.0	NaN
4	American Samoa	AS	ASM	16	ISO 3166-2:AS	Oceania	Polynesia	NaN	9.0	61.0	NaN
...	...	...	...	...	...	...	...	...	...	...	...
244	Wallis and Futuna	WF	WLF	876	ISO 3166-2:WF	Oceania	Polynesia	NaN	9.0	61.0	NaN
245	Western Sahara	EH	ESH	732	ISO 3166-2:EH	Africa	Northern Africa	NaN	2.0	15.0	NaN
246	Yemen	YE	YEM	887	ISO 3166-2:YE	Asia	Western Asia	NaN	142.0	145.0	NaN
247	Zambia	ZM	ZMB	894	ISO 3166-2:ZM	Africa	Sub-Saharan Africa	Eastern Africa	2.0	202.0	14.0
248	Zimbabwe	ZW	ZWE	716	ISO 3166-2:ZW	Africa	Sub-Saharan Africa	Eastern Africa	2.0	202.0	14.0

249 rows × 11 columns

Break-points: Searching for possibles one online

Now, let's check the dates to see if there are any candidate dates which we can pose as our break-point before analysing for a structural break. In particular, we want to see the date range so we can search the news for global shocks that would have impacted the hotel bookings industry during this period.

From a cursory search on Google, nothing arose that really stood out, thus we will look within our data to see if there are any candidate break-points.

data_hotel['arrival_date'].min(), data_hotel['arrival_date'].max()

(Timestamp('2015-07-01 00:00:00'), Timestamp('2017-08-31 00:00:00'))

Joining: Country mapping to region

Bring together our hotels data with the lookup data so we can bring in region information that will be able to narrow our long list of country into a smaller list of categories, region.

# join data to get region
data_join = pd.merge(left = data_hotel, right = data_lookup,
                    how = 'left', left_on = 'country', right_on = 'alpha-3',
                    validate = 'many_to_one')
# select only useful columns
data_join = data_join.loc[:,['hotel', 'arrival_date', 'number_guests', 'country', 'adr', 
                             'name', 'region', 'sub-region']]
# rename columns
data_join = data_join.rename(columns = {'name':'country_name', 'country':'country_code', 'sub-region':'sub_region'})
data_join

	hotel	arrival_date	number_guests	country_code	adr	country_name	region	sub_region
0	Resort Hotel	2015-07-01	2.0	PRT	0.00	Portugal	Europe	Southern Europe
1	Resort Hotel	2015-07-01	2.0	PRT	0.00	Portugal	Europe	Southern Europe
2	Resort Hotel	2015-07-01	1.0	GBR	75.00	United Kingdom of Great Britain and Northern I...	Europe	Northern Europe
3	Resort Hotel	2015-07-01	1.0	GBR	75.00	United Kingdom of Great Britain and Northern I...	Europe	Northern Europe
4	Resort Hotel	2015-07-01	2.0	GBR	98.00	United Kingdom of Great Britain and Northern I...	Europe	Northern Europe
...	...	...	...	...	...	...	...	...
75161	City Hotel	2017-08-30	2.0	BEL	96.14	Belgium	Europe	Western Europe
75162	City Hotel	2017-08-31	3.0	FRA	225.43	France	Europe	Western Europe
75163	City Hotel	2017-08-31	2.0	DEU	157.71	Germany	Europe	Western Europe
75164	City Hotel	2017-08-31	2.0	GBR	104.40	United Kingdom of Great Britain and Northern I...	Europe	Northern Europe
75165	City Hotel	2017-08-29	2.0	DEU	151.20	Germany	Europe	Western Europe

75166 rows × 8 columns

Computing Variables: Proportions

As we are seeking to analyse the proportion of total guests across our data, then we need to compute the proportions. This will take the following steps:

Narrow down country into smaller categories to get meaningful proportions.
- This will be through mapping it against region from a separate dataset.
Calculate total_guests by arrival_date and region.
Use the total_guests field as our denominator to compute proportion_guests.

Let's check how many subcategories we have now, so we know whether we need to further narrow this down.

data_join['region'].unique()

array(['Europe', 'Americas', nan, 'Asia', 'Oceania', 'Africa'],
      dtype=object)

The number of categories looks fine, so we can proceed with grouping by these and summarising the number_guests column. Then we will compute the proportion of guests per region to start getting a dataset that is suitable for compositional time-series modelling.

Appreciate that computing the proportions of th enumber of guests in each subregion might hide specific region-based factors that influence their ability to travel, but as this analysis is for demonstrating compositional time-series modelling, we abstract away from these concerns.

# group by 'arrival_date' and 'region' before summing 'number_guests'
data_total = data_join.groupby(['arrival_date', 'region'])['number_guests'].agg(func = ['sum'])
# rename 'sum' column to 'total_guests'
data_total = data_total.rename(columns = {'sum': 'total_guests'})

data_total

		total_guests
arrival_date	region
2015-07-01	Americas	2.0
2015-07-01	Europe	181.0
2015-07-02	Americas	4.0
	Asia	2.0
	Europe	67.0
...	...	...
2017-08-30	Europe	100.0
2017-08-31	Africa	2.0
	Americas	4.0
	Asia	6.0
	Europe	169.0

2715 rows × 1 columns

Now that we have the total number of guests, as grouped by the arrival_date and region, we need the proportions because we want to test our compositional time-series analysis.

# compute proportions
## where `level = 0` means grouping by first level of index, rather than by one of th columns,
## so are grouping by `arrival_date` and `region`
data_proportion = data_total.groupby(level = 0).apply(lambda x: 
                                                           x / float(x.sum()))
data_proportion = data_proportion.rename(columns = {'total_guests': 'proportion_guests'})

data_proportion

		proportion_guests
arrival_date	region
2015-07-01	Americas	0.010929
2015-07-01	Europe	0.989071
2015-07-02	Americas	0.054795
	Asia	0.027397
	Europe	0.917808
...	...	...
2017-08-30	Europe	0.833333
2017-08-31	Africa	0.011050
	Americas	0.022099
	Asia	0.033149
	Europe	0.933702

2715 rows × 1 columns

# join totals with proportions
## un-groupby so we get previous grouped index as columns
data_total = data_total.reset_index()
data_proportion = data_proportion.reset_index()
## merge the two together
data_join = pd.merge(left = data_total, right = data_proportion,
                    how = 'inner', left_on = ('arrival_date', 'region'), right_on = ('arrival_date', 'region'),
                    validate = 'one_to_one')

# store and pass variables between notebooks
%store data_join

data_join

Stored 'data_join' (DataFrame)

	arrival_date	region	total_guests	proportion_guests
0	2015-07-01	Americas	2.0	0.010929
1	2015-07-01	Europe	181.0	0.989071
2	2015-07-02	Americas	4.0	0.054795
3	2015-07-02	Asia	2.0	0.027397
4	2015-07-02	Europe	67.0	0.917808
...	...	...	...	...
2710	2017-08-30	Europe	100.0	0.833333
2711	2017-08-31	Africa	2.0	0.011050
2712	2017-08-31	Americas	4.0	0.022099
2713	2017-08-31	Asia	6.0	0.033149
2714	2017-08-31	Europe	169.0	0.933702

2715 rows × 4 columns