Data Computation

Transformation: From $[0,1]$ space to $\mathbb{R}$ space

As we are trying to assess the representativeness of the number of guests before and after the break-point, thereby assess whether there is a structural-change in our hotel bookings system, we need to model using proportions, proportion_guests. This is because we expect the number of guests to change after the potential structural-change, but if we can assess whether the proportions are the same or even similar, then we can still make inferences from our data before and after the change.

This is because the similar proportions will suggest that whilst the absolute number of guests making hotel bookings have changed, the proportion of people from each group, region, are similar, so our data after the break-point is still representative of the old data before the break-point, and henceforth, any inferences still hold for the same population.

You can think of this in the sense that before the break-point, we had our target population being captured in our data. If the proportions/compositions of people from each region are similar after the break-point, then we have a representative sample of our target population.

However, we cannot model on proportion/compositional data because compositional data is bounded in the region $[0,1]$. There is a risk here that applying a model to it can give values outside this region, and henceforth be entirely meaningless because you cannot interpret such a value.

Instead, we can transform our compositional data by mapping our data into the real number space, $\mathbb{R}%$. There are three well-characterised isomorphisms that do this:

Additive logratio (alr)
Centre logratio (clr)
Isometric logratio (ilr)

Source: Wikipedia

Alternatively, we can apply the following by adding a very small value to $0$ values for our proportions. In particular, it is also a common transformation to transform data to be approximately normally-distributed.

Source: Feng et al., "Log-transformation and its implications for data analysis"

Log transform

In all these transformations, it is essential that our data does not contain any zeroes in. Thus we will use a multiplicative replacement strategy to replace zeroes with a small, positive $\delta$, and do so in a way that ensures the compositions still add up to $1$.

Source: J.A. Martin Fernandez, "Dealing with Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation"

import pandas as pd
import numpy as np

from skbio.stats.composition import multiplicative_replacement
from skbio.stats.composition import clr
from skbio.stats.composition import ilr

from functools import reduce

# display multiple outputs in same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# put in a python script
# create custom additive log-ratio function
def func_alr(mat, div):
    
    # check to see division can happen /log(0)
    
    # take vectors from input array `mat`, excluding column index, `div`
    numerator = np.delete(arr = mat, obj = div, axis = 1)
    # take vector for `div`
    denominator = mat[:, div]
    
    # take logs - should find way to call a package within a function
    lnum = np.log(numerator)
    lden = np.log(denominator)
    
    # subtract 'div' vector from every 'column' in matrix, 'mat'
    # https://stackoverflow.com/questions/26333005/numpy-subtract-every-row-of-matrix-by-vector
    output = (lnum.transpose() - lden).transpose()
    
    return output

# pass in variable from other notebook
%store -r data_join

# pivot so can apply trasnformations on
data_pivot = data_join.pivot(index = 'arrival_date', columns = 'region', values = 'proportion_guests')

# replace NaNs with 0s so can transform
data_pivot = data_pivot.loc[:, 'Africa':'Oceania'].fillna(value = 0, axis = 1)

## un-groupby so we get previous grouped index as columns
data_pivot = data_pivot.reset_index()
data_pivot

region	arrival_date	Africa	Americas	Asia	Europe	Oceania
0	2015-07-01	0.000000	0.010929	0.000000	0.989071	0.0
1	2015-07-02	0.000000	0.054795	0.027397	0.917808	0.0
2	2015-07-03	0.000000	0.103896	0.000000	0.896104	0.0
3	2015-07-04	0.000000	0.000000	0.000000	1.000000	0.0
4	2015-07-05	0.000000	0.077922	0.000000	0.922078	0.0
...	...	...	...	...	...	...
788	2017-08-27	0.000000	0.094862	0.023715	0.881423	0.0
789	2017-08-28	0.000000	0.049180	0.012295	0.938525	0.0
790	2017-08-29	0.000000	0.027586	0.096552	0.875862	0.0
791	2017-08-30	0.016667	0.100000	0.050000	0.833333	0.0
792	2017-08-31	0.011050	0.022099	0.033149	0.933702	0.0

793 rows × 6 columns

Let's start by dealing with $0$s in our data by applying the multiplicative replacement strategy to replace 0s with a small enough number that will not change the overall modelling but ensures our proportions/compositions still sum to $1$ at the same time.

We are also extracting the specific part of the dataframe that we will apply our transformation on.

# 1 extract only part we want to apply transformation on
x = data_pivot.loc[:, 'Africa':'Oceania']

# 2. store column names for later when re-creating dataframe
col_names = list(x)
col_names_alr = list(x)

# 3. apply multiplicative replacement strategy to replace 0s
x = multiplicative_replacement(x)

# 3.1 note, returns an array
x

array([[0.04      , 0.00961749, 0.04      , 0.87038251, 0.04      ],
       [0.04      , 0.05041096, 0.02520548, 0.84438356, 0.04      ],
       [0.04      , 0.09142857, 0.04      , 0.78857143, 0.04      ],
       ...,
       [0.04      , 0.02537931, 0.08882759, 0.8057931 , 0.04      ],
       [0.016     , 0.096     , 0.048     , 0.8       , 0.04      ],
       [0.01060773, 0.02121547, 0.0318232 , 0.89635359, 0.04      ]])

# get index of 'Oceania' column
index_denominator = col_names_alr.index('Oceania')
 # remove this index from list of column names
del col_names_alr[index_denominator]

Now, let's perform the ALR, CLR and ILR tranformations.

Note that the column names for ALR and ILR are misleading, they should really be:

ALR: Africa/Oceania, Americas/Oceania, ...
ILR: Africa~Americas, Americas~Asia, ...

Source: StackExchange answer by marc1s

# CAN DO ALL THIS IN A FUNCTION AND THEN LOOP ON FUNCTION

# 1. apply transformations
data_alr = func_alr(mat = x, div = index_denominator)
data_clr = clr(mat = x)
data_ilr = ilr(mat = x)
data_log = np.log(x)

 # store in list for efficiency
data_frames = [data_alr, data_clr, data_ilr, data_log]

# 2. convert to dataframe
data_frames[0] = pd.DataFrame(data = data_alr, columns = col_names_alr)
data_frames[1] = pd.DataFrame(data = data_clr, columns = col_names)
data_frames[2] = pd.DataFrame(data = data_ilr, columns = col_names_alr)
data_frames[3] = pd.DataFrame(data = data_log, columns = col_names)

# 3. rename columns for creating dataframe
data_frames[0].columns += '_alr'
data_frames[1].columns += '_clr'
data_frames[2].columns += '_ilr'
data_frames[3].columns += '_log'

# 4. add `arrival_date` back in (ASSUMES ROW ORDERING IS PRESERVED)
data_frames[0]['arrival_date'] = data_pivot['arrival_date']
data_frames[1]['arrival_date'] = data_pivot['arrival_date']
data_frames[2]['arrival_date'] = data_pivot['arrival_date']
data_frames[3]['arrival_date'] = data_pivot['arrival_date']

# 5. merge these dataframes together
data_transform = reduce(lambda left, right: pd.merge(left, right, on = ['arrival_date'], how = 'outer'), data_frames)


# unpivot
data_transform = data_transform.melt(id_vars = ['arrival_date'], var_name = 'region', value_name = 'transform_guests')

# split the `region` column into two
col_split = data_transform['region'].str.split(pat = '_', expand = True) 
data_transform['region'] = col_split[0]
data_transform['transform_type'] = col_split[1]

# pivot on `transform_type`
data_transform = data_transform.pivot_table(index = ['arrival_date', 'region'], 
                                            columns = 'transform_type', 
                                            values = 'transform_guests')
data_transform = data_transform.reset_index()

# merge with data_join
data_join = pd.merge(left = data_join, right = data_transform,
                    how = 'left', left_on = ('arrival_date', 'region'), right_on = ('arrival_date', 'region'),
                    validate = 'one_to_one')

# store and pass variables between notebooks
%store data_join

data_join

Stored 'data_join' (DataFrame)

	arrival_date	region	total_guests	proportion_guests	alr	clr	ilr	log
0	2015-07-01	Americas	2.0	0.010929	-1.425297	-1.756248	-0.581875	-4.644172
1	2015-07-01	Europe	181.0	0.989071	3.080053	2.749102	0.370015	-0.138822
2	2015-07-02	Americas	4.0	0.054795	0.231329	-0.332519	0.471513	-2.987547
3	2015-07-02	Asia	2.0	0.027397	-0.461818	-1.025666	-2.707678	-3.680694
4	2015-07-02	Europe	67.0	0.917808	3.049727	2.485880	0.630401	-0.169148
...	...	...	...	...	...	...	...	...
2710	2017-08-30	Europe	100.0	0.833333	2.995732	2.368286	0.701506	-0.223144
2711	2017-08-31	Africa	2.0	0.011050	-1.327296	-1.511161	-0.490129	-4.546172
2712	2017-08-31	Americas	4.0	0.022099	-0.634149	-0.818014	-0.614037	-3.853025
2713	2017-08-31	Asia	6.0	0.033149	-0.228684	-0.412549	-3.325103	-3.447560
2714	2017-08-31	Europe	169.0	0.933702	3.109456	2.925590	0.205568	-0.109420

2715 rows × 8 columns