Transformation: From $[0,1]$ space to $\mathbb{R}$ space
As we are trying to assess the representativeness of the number of guests before and after the break-point, thereby assess whether there is a structural-change in our hotel bookings system, we need to model using proportions, proportion_guests
. This is because we expect the number of guests to change after the potential structural-change, but if we can assess whether the proportions are the same or even similar, then we can still make inferences from our data before and after the change.
This is because the similar proportions will suggest that whilst the absolute number of guests making hotel bookings have changed, the proportion of people from each group, region
, are similar, so our data after the break-point is still representative of the old data before the break-point, and henceforth, any inferences still hold for the same population.
You can think of this in the sense that before the break-point, we had our target population being captured in our data. If the proportions/compositions of people from each region
are similar after the break-point, then we have a representative sample of our target population.
However, we cannot model on proportion/compositional data because compositional data is bounded in the region $[0,1]$. There is a risk here that applying a model to it can give values outside this region, and henceforth be entirely meaningless because you cannot interpret such a value.
Instead, we can transform our compositional data by mapping our data into the real number space, $\mathbb{R}%$. There are three well-characterised isomorphisms that do this:
- Additive logratio (alr)
- Centre logratio (clr)
- Isometric logratio (ilr)
Source: Wikipedia
Alternatively, we can apply the following by adding a very small value to $0$ values for our proportions. In particular, it is also a common transformation to transform data to be approximately normally-distributed.
Source: Feng et al., "Log-transformation and its implications for data analysis"
- Log transform
In all these transformations, it is essential that our data does not contain any zeroes in. Thus we will use a multiplicative replacement strategy to replace zeroes with a small, positive $\delta$, and do so in a way that ensures the compositions still add up to $1$.
import pandas as pd
import numpy as np
from skbio.stats.composition import multiplicative_replacement
from skbio.stats.composition import clr
from skbio.stats.composition import ilr
from functools import reduce
# display multiple outputs in same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# put in a python script
# create custom additive log-ratio function
def func_alr(mat, div):
# check to see division can happen /log(0)
# take vectors from input array `mat`, excluding column index, `div`
numerator = np.delete(arr = mat, obj = div, axis = 1)
# take vector for `div`
denominator = mat[:, div]
# take logs - should find way to call a package within a function
lnum = np.log(numerator)
lden = np.log(denominator)
# subtract 'div' vector from every 'column' in matrix, 'mat'
# https://stackoverflow.com/questions/26333005/numpy-subtract-every-row-of-matrix-by-vector
output = (lnum.transpose() - lden).transpose()
return output
# pass in variable from other notebook
%store -r data_join
# pivot so can apply trasnformations on
data_pivot = data_join.pivot(index = 'arrival_date', columns = 'region', values = 'proportion_guests')
# replace NaNs with 0s so can transform
data_pivot = data_pivot.loc[:, 'Africa':'Oceania'].fillna(value = 0, axis = 1)
## un-groupby so we get previous grouped index as columns
data_pivot = data_pivot.reset_index()
data_pivot
Let's start by dealing with $0$s in our data by applying the multiplicative replacement strategy to replace 0s with a small enough number that will not change the overall modelling but ensures our proportions/compositions still sum to $1$ at the same time.
We are also extracting the specific part of the dataframe that we will apply our transformation on.
# 1 extract only part we want to apply transformation on
x = data_pivot.loc[:, 'Africa':'Oceania']
# 2. store column names for later when re-creating dataframe
col_names = list(x)
col_names_alr = list(x)
# 3. apply multiplicative replacement strategy to replace 0s
x = multiplicative_replacement(x)
# 3.1 note, returns an array
x
# get index of 'Oceania' column
index_denominator = col_names_alr.index('Oceania')
# remove this index from list of column names
del col_names_alr[index_denominator]
Now, let's perform the ALR, CLR and ILR tranformations.
Note that the column names for ALR and ILR are misleading, they should really be:
- ALR: Africa/Oceania, Americas/Oceania, ...
- ILR: Africa~Americas, Americas~Asia, ...
Source: StackExchange answer by marc1s
# CAN DO ALL THIS IN A FUNCTION AND THEN LOOP ON FUNCTION
# 1. apply transformations
data_alr = func_alr(mat = x, div = index_denominator)
data_clr = clr(mat = x)
data_ilr = ilr(mat = x)
data_log = np.log(x)
# store in list for efficiency
data_frames = [data_alr, data_clr, data_ilr, data_log]
# 2. convert to dataframe
data_frames[0] = pd.DataFrame(data = data_alr, columns = col_names_alr)
data_frames[1] = pd.DataFrame(data = data_clr, columns = col_names)
data_frames[2] = pd.DataFrame(data = data_ilr, columns = col_names_alr)
data_frames[3] = pd.DataFrame(data = data_log, columns = col_names)
# 3. rename columns for creating dataframe
data_frames[0].columns += '_alr'
data_frames[1].columns += '_clr'
data_frames[2].columns += '_ilr'
data_frames[3].columns += '_log'
# 4. add `arrival_date` back in (ASSUMES ROW ORDERING IS PRESERVED)
data_frames[0]['arrival_date'] = data_pivot['arrival_date']
data_frames[1]['arrival_date'] = data_pivot['arrival_date']
data_frames[2]['arrival_date'] = data_pivot['arrival_date']
data_frames[3]['arrival_date'] = data_pivot['arrival_date']
# 5. merge these dataframes together
data_transform = reduce(lambda left, right: pd.merge(left, right, on = ['arrival_date'], how = 'outer'), data_frames)
# unpivot
data_transform = data_transform.melt(id_vars = ['arrival_date'], var_name = 'region', value_name = 'transform_guests')
# split the `region` column into two
col_split = data_transform['region'].str.split(pat = '_', expand = True)
data_transform['region'] = col_split[0]
data_transform['transform_type'] = col_split[1]
# pivot on `transform_type`
data_transform = data_transform.pivot_table(index = ['arrival_date', 'region'],
columns = 'transform_type',
values = 'transform_guests')
data_transform = data_transform.reset_index()
# merge with data_join
data_join = pd.merge(left = data_join, right = data_transform,
how = 'left', left_on = ('arrival_date', 'region'), right_on = ('arrival_date', 'region'),
validate = 'one_to_one')
# store and pass variables between notebooks
%store data_join
data_join