Search
Data Wrangling

In this section, we will wrangle our dataset so that it is suitable for modelling and analysing as a time-series.

The chief changes we will make are:

  • [x] Computing total number of guests, number_guests.
    • This is done by adding the columns adults, children and babies.
  • [x] Convert the month name given to a month number, arrival_date_month.
    • This is so we can create a date column.
  • [x] Create a date column, arrival_date.
    • This is done by concatenating the columns arrival_date_year, arrival_date_month and arriva_date_day_of_month.
  • [x] Filtering out cancelled bookings because these bookings were not fulfilled.
  • [x] Narrowing down categories to compute proportions by, country -> region.
  • [x] Transform data to become compositional by computing proportions.
  • [x] Transform data so our data is appropriate for modelling.

We will also bring in lookup information country codes to their subregions of the world. This is so we can reduce the number of categories to analyse by.

The data that has the lookup information country codes is on GitHub here.

import pandas as pd

# display multiple outputs in same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# pass in variable from other notebook
%store -r data_hotel

# load in lookup .csv file
data_lookup = pd.read_csv(filepath_or_buffer = '../../_data/data_lookup.csv')

# compute 'number_guests'
data_hotel['number_guests'] = data_hotel['adults'] + data_hotel['children'] + data_hotel['babies']
# convert month name to month number
data_hotel['arrival_date_month'] = pd.to_datetime(data_hotel['arrival_date_month'], format = '%B').dt.month
# create date column
data_hotel['arrival_date'] = data_hotel['arrival_date_year'].astype(str) + '-' + data_hotel['arrival_date_month'].astype(str) + '-' + data_hotel['arrival_date_day_of_month'].astype(str)
data_hotel['arrival_date'] = pd.to_datetime(data_hotel['arrival_date'])
# filter out cancelled bookings
data_hotel = data_hotel.query('is_canceled == 0')
# select only columns necessary
data_hotel = data_hotel.loc[:,['hotel', 'arrival_date', 
                               'number_guests', 'country', 'adr']]

# look at two datasets
data_hotel
data_lookup
hotel arrival_date number_guests country adr
0 Resort Hotel 2015-07-01 2.0 PRT 0.00
1 Resort Hotel 2015-07-01 2.0 PRT 0.00
2 Resort Hotel 2015-07-01 1.0 GBR 75.00
3 Resort Hotel 2015-07-01 1.0 GBR 75.00
4 Resort Hotel 2015-07-01 2.0 GBR 98.00
... ... ... ... ... ...
119385 City Hotel 2017-08-30 2.0 BEL 96.14
119386 City Hotel 2017-08-31 3.0 FRA 225.43
119387 City Hotel 2017-08-31 2.0 DEU 157.71
119388 City Hotel 2017-08-31 2.0 GBR 104.40
119389 City Hotel 2017-08-29 2.0 DEU 151.20

75166 rows × 5 columns

name alpha-2 alpha-3 country-code iso_3166-2 region sub-region intermediate-region region-code sub-region-code intermediate-region-code
0 Afghanistan AF AFG 4 ISO 3166-2:AF Asia Southern Asia NaN 142.0 34.0 NaN
1 Åland Islands AX ALA 248 ISO 3166-2:AX Europe Northern Europe NaN 150.0 154.0 NaN
2 Albania AL ALB 8 ISO 3166-2:AL Europe Southern Europe NaN 150.0 39.0 NaN
3 Algeria DZ DZA 12 ISO 3166-2:DZ Africa Northern Africa NaN 2.0 15.0 NaN
4 American Samoa AS ASM 16 ISO 3166-2:AS Oceania Polynesia NaN 9.0 61.0 NaN
... ... ... ... ... ... ... ... ... ... ... ...
244 Wallis and Futuna WF WLF 876 ISO 3166-2:WF Oceania Polynesia NaN 9.0 61.0 NaN
245 Western Sahara EH ESH 732 ISO 3166-2:EH Africa Northern Africa NaN 2.0 15.0 NaN
246 Yemen YE YEM 887 ISO 3166-2:YE Asia Western Asia NaN 142.0 145.0 NaN
247 Zambia ZM ZMB 894 ISO 3166-2:ZM Africa Sub-Saharan Africa Eastern Africa 2.0 202.0 14.0
248 Zimbabwe ZW ZWE 716 ISO 3166-2:ZW Africa Sub-Saharan Africa Eastern Africa 2.0 202.0 14.0

249 rows × 11 columns


Break-points: Searching for possibles one online

Now, let's check the dates to see if there are any candidate dates which we can pose as our break-point before analysing for a structural break. In particular, we want to see the date range so we can search the news for global shocks that would have impacted the hotel bookings industry during this period.

From a cursory search on Google, nothing arose that really stood out, thus we will look within our data to see if there are any candidate break-points.

data_hotel['arrival_date'].min(), data_hotel['arrival_date'].max()
(Timestamp('2015-07-01 00:00:00'), Timestamp('2017-08-31 00:00:00'))

Joining: Country mapping to region

Bring together our hotels data with the lookup data so we can bring in region information that will be able to narrow our long list of country into a smaller list of categories, region.

# join data to get region
data_join = pd.merge(left = data_hotel, right = data_lookup,
                    how = 'left', left_on = 'country', right_on = 'alpha-3',
                    validate = 'many_to_one')
# select only useful columns
data_join = data_join.loc[:,['hotel', 'arrival_date', 'number_guests', 'country', 'adr', 
                             'name', 'region', 'sub-region']]
# rename columns
data_join = data_join.rename(columns = {'name':'country_name', 'country':'country_code', 'sub-region':'sub_region'})
data_join
hotel arrival_date number_guests country_code adr country_name region sub_region
0 Resort Hotel 2015-07-01 2.0 PRT 0.00 Portugal Europe Southern Europe
1 Resort Hotel 2015-07-01 2.0 PRT 0.00 Portugal Europe Southern Europe
2 Resort Hotel 2015-07-01 1.0 GBR 75.00 United Kingdom of Great Britain and Northern I... Europe Northern Europe
3 Resort Hotel 2015-07-01 1.0 GBR 75.00 United Kingdom of Great Britain and Northern I... Europe Northern Europe
4 Resort Hotel 2015-07-01 2.0 GBR 98.00 United Kingdom of Great Britain and Northern I... Europe Northern Europe
... ... ... ... ... ... ... ... ...
75161 City Hotel 2017-08-30 2.0 BEL 96.14 Belgium Europe Western Europe
75162 City Hotel 2017-08-31 3.0 FRA 225.43 France Europe Western Europe
75163 City Hotel 2017-08-31 2.0 DEU 157.71 Germany Europe Western Europe
75164 City Hotel 2017-08-31 2.0 GBR 104.40 United Kingdom of Great Britain and Northern I... Europe Northern Europe
75165 City Hotel 2017-08-29 2.0 DEU 151.20 Germany Europe Western Europe

75166 rows × 8 columns


Computing Variables: Proportions

As we are seeking to analyse the proportion of total guests across our data, then we need to compute the proportions. This will take the following steps:

  1. Narrow down country into smaller categories to get meaningful proportions.
    • This will be through mapping it against region from a separate dataset.
  2. Calculate total_guests by arrival_date and region.
  3. Use the total_guests field as our denominator to compute proportion_guests.

Let's check how many subcategories we have now, so we know whether we need to further narrow this down.

data_join['region'].unique()
array(['Europe', 'Americas', nan, 'Asia', 'Oceania', 'Africa'],
      dtype=object)

The number of categories looks fine, so we can proceed with grouping by these and summarising the number_guests column. Then we will compute the proportion of guests per region to start getting a dataset that is suitable for compositional time-series modelling.

Appreciate that computing the proportions of th enumber of guests in each subregion might hide specific region-based factors that influence their ability to travel, but as this analysis is for demonstrating compositional time-series modelling, we abstract away from these concerns.

# group by 'arrival_date' and 'region' before summing 'number_guests'
data_total = data_join.groupby(['arrival_date', 'region'])['number_guests'].agg(func = ['sum'])
# rename 'sum' column to 'total_guests'
data_total = data_total.rename(columns = {'sum': 'total_guests'})

data_total
total_guests
arrival_date region
2015-07-01 Americas 2.0
Europe 181.0
2015-07-02 Americas 4.0
Asia 2.0
Europe 67.0
... ... ...
2017-08-30 Europe 100.0
2017-08-31 Africa 2.0
Americas 4.0
Asia 6.0
Europe 169.0

2715 rows × 1 columns

Now that we have the total number of guests, as grouped by the arrival_date and region, we need the proportions because we want to test our compositional time-series analysis.

# compute proportions
## where `level = 0` means grouping by first level of index, rather than by one of th columns,
## so are grouping by `arrival_date` and `region`
data_proportion = data_total.groupby(level = 0).apply(lambda x: 
                                                           x / float(x.sum()))
data_proportion = data_proportion.rename(columns = {'total_guests': 'proportion_guests'})

data_proportion
proportion_guests
arrival_date region
2015-07-01 Americas 0.010929
Europe 0.989071
2015-07-02 Americas 0.054795
Asia 0.027397
Europe 0.917808
... ... ...
2017-08-30 Europe 0.833333
2017-08-31 Africa 0.011050
Americas 0.022099
Asia 0.033149
Europe 0.933702

2715 rows × 1 columns

# join totals with proportions
## un-groupby so we get previous grouped index as columns
data_total = data_total.reset_index()
data_proportion = data_proportion.reset_index()
## merge the two together
data_join = pd.merge(left = data_total, right = data_proportion,
                    how = 'inner', left_on = ('arrival_date', 'region'), right_on = ('arrival_date', 'region'),
                    validate = 'one_to_one')

# store and pass variables between notebooks
%store data_join

data_join
Stored 'data_join' (DataFrame)
arrival_date region total_guests proportion_guests
0 2015-07-01 Americas 2.0 0.010929
1 2015-07-01 Europe 181.0 0.989071
2 2015-07-02 Americas 4.0 0.054795
3 2015-07-02 Asia 2.0 0.027397
4 2015-07-02 Europe 67.0 0.917808
... ... ... ... ...
2710 2017-08-30 Europe 100.0 0.833333
2711 2017-08-31 Africa 2.0 0.011050
2712 2017-08-31 Americas 4.0 0.022099
2713 2017-08-31 Asia 6.0 0.033149
2714 2017-08-31 Europe 169.0 0.933702

2715 rows × 4 columns