Bayesian statistics is a fundamentally different way of viewing, approaching, and tackling stastitical problems. It involves thinking about situations from a probabalistic viewpoint.
Contrast this with the other main competing school of statistical thought, frequentist statistics, which views situations within the context of repeated events.
These differences manifest themselves in the actual statistical techniques applied to problems, with multi-linear regression being the technique considered in this paper. For each school-of-thought, there is a different way to carry out the statistical technique.
Whereas in most cases, the results of these models should broadly align irrespective of whether you apply a frequentist or Bayesian version of a statistical technique, the way you interpret and communicate these results are different.
Bayesian interpretations of results fall more in-line with how we intuitively think about problems.
Alan Turing, mathematician, computer scientist, and one of the forebearers of modern computers; led efforts to unravel the German Enigma code - a development that helped the Allies win the Second World War. Turing employed several technologies and techniques in his work, including Banburismus, a process he invented which used sequential conditional probability to infer information about the likely settings of the Enigma machine.
Banburismus is described as ‘a highly intensive, Bayesian system’ that allowed Turing and colleagues to guess a stretch of letters in an Enigma message, measure their belief in the validity of these guesses - using Bayesian methods to assess the probabilities - and add more clues as they arrived.
Therefore, if it is good enough for Turing, and it is good enough to win a world war, then it must be good enough for us… 💁
We will compare our outputs of applying a frequentist against a Bayesian multi-linear regression using k-fold cross-validation.
By the end, I hope to have provided to you with a gentle and accessible introduction to what Bayesian statistics is, how it is different to frequentist statistics, the benefits of thinking and viewing problems from a Bayesian perspective, and above all, show how you can practically implement Bayesian statistics in your own work. Hopefully, I can show that it is not rocket-🧪. 😉
As we will remove sex and other highly-correlated variables with sex from our model, then the predictive accuracy of our model may not be optimal. This to ensure we meet GDS Ethics Framework standards.
There are four forms of “The Open University”, one for each UK region, in our data_qualifiers dataframe. Will need to check if this is the case for our other data frames.
As we have chosen only one fuzzy-matching algorithm, and have not benchmarked this against other possible algorithms, then our data matching between HESA SFR and Complete University’s Guide data may not be optimal. To improve our data matching and consequently, our model performances, we should explore alternative fuzzy-matching algorithms.
Have not performed analysis on all possible independent variables such as first-degree full-time student count, Entry standards, Research intensity etc., so we could meaningfully improve our model fit by including these.
From transforming our dataframes so that they are of double datatype, the conversion for ‘Acad Services Speeds’ failed, so we generated NAs. Whils this does not affect the analysis as we have not used this data, it is something worth bearing in mind for future analysis.
Scenario: You have misplaced your phone somehwere in the house. You thus decide to call your phone so you can use the ringing to direct you to where it is. 4
Frequentist reasoning - You hear your phone ringing and have a mental model of your house that helps you identify the area from which the ringing is coming from. Therefore, upon hearing the ringing, you infer the area of your house that you must search to locate your phone.
Bayesian reasoning - You hear your phone ringing. In addition to having a mental model of your house, you also know the locations where you have misplaced your phone in the past. Therefore, you combine the inferences from the ringing and your prior knowledge of where you have misplaced your phone in the past to identify an area you must search to locate your phone.
Frequentist inference | Bayesian inference |
---|---|
- Data are a repeatable random sample, so studies can be reapeated. | - Data are observed froma realised sample, so studies are fixed. |
- Underlying parameters being observed remain constant during this repeatable process. | - Underlying parameters are unknown and desvribed probabilisitcally, so can vary. |
- Parameters are fixed. | - Data are fixed. |
A frequentist would say that over repeated sampling of 95% confidence intervals, 95% of these confidence intervals have the true parameter lying in them.
A Bayesian would say over one 95% credible interval, the probability the parameter of interest will lie within this interval is 95%.
By determining conditional marginal distribution for each random variable of the model, then generating random samples from these distributions, will eventually converge to random samples from the joint posterior distribution.
In terms of how Bayesian analysis is practically implemented, we use computational resources to sample. In particular, given that our posteriors are probability distributions, then we can sample all the parameters from our joint posterior distribution. To do this, we focus on samples from each parameter individually, which is the marginal posterior distribution. Now, whilst integrating over multiple parameters is difficult, Gibbs sampling makes this possible and is employed to generate posterior samples.
Note that the posterior distribution is a probability distribution over the parameter space, and this quantifies how probable it is that a particular value of the parameter(s) has given rise to the observed data. This distribution provides not only the optimal point estimate of the parameters, but also a quantification of the entire parameter space. In this way, our Gibbs sampling method explores the entire parameter space, asking which regions are most probable, given the data observed.
Our MCMC method explores the parameter space through having an ensemble of “walkers” that move around the space randomly. At each point where a walker steps, the integrand value at that point is counted towards the integral. The walker may then make a number of tentative steps around the area, looking for a place with a reasonably high contribution to the integral to move into next.
Whilst it is not difficult to construct a Markov Chain with the desired properties, the more difficult problem lies in determining how many steps/iterations to run the chain for it to converge to the stationary distribution within an acceptable error. A good chain will have rapid “mixing”, where we reach the stationary distribution quickly from an arbitrary starting position. We empirically assess this convergence to the stationary distribution by running several independently simulated Markov chains, and checking the ratio of inter-chain to intra-chain variance for all the parameters sampled is close to one. In this paper, we use the Gelman-Rubin diagnostic for this.
If all of this technical explanation makes you feel 😩, do not fret, we flesh this out with an explicit example. 🙏
Cross-validation is a technique to evaluate a model’s predictive power by partitioning the data into a training set to train the model, and a test set to evaluate the model’s accuracy in making predictions.
k-fold cross-validation is a form of cross-validation that takes k randomly paritioned sub-samples of your data; k-1 of these sub-samples then act as the training set to train your model, with the other sub-sample being used as your test set to evaluate your models performance.
This process is repeated k times so that each of your k sub-samples are used for training and testing/evaluating your model. The k results you generate from running these k models are then averaged to produce a single estimation/metric of how accurate your model is.
k-fold cross-validation is powerful because it allows you to use all your data for both training and testing purposes. However, it can be computationally expensive.
We will use k-fold cross-validation to run our multi-linear regression models to ensure our model does not overfit our data. By overfitting, we mean the model is too good at predicting the outputs, Graduate Prospects 2018, from our dataset, and cannot accurately predict the our output, Graduate Prospects 2018, on new data that is not included in our dataset.
Root-mean-square error is a metric for evaluating the performance of your predictive model. The lower value means that your model is more accurate in producing predictions that are closer to the observed values.
It is the differences between predicted values of your model against the values actually observed. Mathematically, this is formulated by the below equation: \[RMSE = \sqrt{\frac{\sum_{i=1}^n\left( \hat{y_i}-y_i \right)^2}{n}}\]
The RMSE can be used to compare the performance of difference models to see which one is more accurate in making predictions of an dependent variable.
However, the RMSE is scale-dependent, so it cannot be used to compare models on different sets of data, or on models that are based on the same data but the data for each model have different scales. e.g. one model may be based on income data, whilst the other model is based on income data that is log-transformed.
We will use RMSE to evaluate the performance of our frequentist and Bayesian multi-linear regression models to compare which model gives us more accurate predictions.
The data used in this project comes from the Higher Educations Standards Agency’s (HESA) Statistical First Release (SFR247) and the Complete University Guide’s university rankings table.
In particular, the following datasets are used:
We will investigate the use of these data sources as potential predictors for the graduate prospects of university students.
Whilst our HESA SFR data has a unique identifier field, UKPRN, that we can use to join disparate HESA SFR data together, our Complete University’s Guide web-scraped HTML table does not have such a similar unique identifer field. Instead, it’s unique identifer field is Name, which is the universtiy’s name and is of string data type.
Our HESA SFR data also has a similar field, HE provider, which allows us to join the HESA SFR with the Complete University’s Guide data together. However, given that these fields are of string data type, then we will need to do some data cleaning first as there is a lot of noise in string/text data. The steps we will take to cleaning text data are:
Works well for | Struggles with |
---|---|
+ Missing words | - Ordering |
+ Typos | - Generic wording |
# Clean text data
Corpus(VectorSource(x = temp_CUG$Institution)) %>%
x <- tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument) %>%
# Remove additional stopwords of "The University of" to get
# HESA SFR uni names in line with Complete University's Guide
tm_map(removeWords, c("The University of"))
# Convert Corpus to dataframe
data.frame(Name = sapply(X = x, FUN = as.character), stringsAsFactors = FALSE)
x <-
# Fuzzy-matching
# Create grid of names to match - get all permutations
expand.grid(data_hesa$Name, data_CUG$Name) %>%
temp <- rename(`hesa_name` = `Var1`, `CUG_name` = `Var2`) %>%
arrange(`hesa_name`)
# Use Longest Common Subsequence (LCS) algorithm
$dist_lcs <- stringdist(temp$hesa_name, temp$CUG_name, method = "lcs")
temp# Rank the LCS distances smaller the better
temp %>%
temp <- transform(rank_dist_lcs = ave(`dist_lcs`, `hesa_name`,
FUN = function(x) rank(x, ties.method = "first"))) %>%
# Take only the best rank
filter(rank_dist_lcs == 1) %>%
# From visual inspection of matches, any fields with dist_lcs > 8 seem to be wrong,
# with exception of LSE
filter(dist_lcs <= 8 | `CUG_name` == "london school econom")
# temp dataframe thus acts as lookup table joining
# data_hesa to data_CUG
temp[, c("hesa_name", "CUG_name")] %>%
data_master <- left_join(y = data_hesa, by = c("hesa_name" = "Name")) %>%
left_join(y = data_CUG, by = c("CUG_name" = "Name"))
glimpse(data_master)
## Rows: 121
## Columns: 24
## $ UKPRN <dbl> 10007783, 10007856, 10007857, 10005343, 100…
## $ `HE provider` <chr> "The University of Aberdeen", "Aberystwyth …
## $ `Country of HE provider` <chr> "Scotland", "Wales", "Wales", "Northern Ire…
## $ `Region of HE provider` <chr> "All", "All", "All", "All", "All", "All", "…
## $ `FD FT student count` <dbl> 9725, 5985, 7835, 14725, 19145, 7950, 9095,…
## $ `PG FT student count` <dbl> 2530, 600, 1510, 3055, 5240, 1240, 1840, 11…
## $ `Male Female ratio` <dbl> 0.7795059, 1.1130742, 0.7380688, 0.8580442,…
## $ `EU to non-EU ratio` <dbl> 14.677419, 22.920000, 9.109677, 13.873737, …
## $ `Pass Fail ratio` <dbl> 2.1759259, 2.0714286, 2.2424242, 2.9255814,…
## $ `STEM non-STEM ratio` <dbl> 0.76379691, 0.58199357, 0.80392157, 1.05138…
## $ `State Private ratio` <dbl> 79.6, 95.2, 95.8, 98.1, 85.3, 94.9, 92.9, 9…
## $ Rank <dbl> 42, 87, 62, 33, 35, 72, 35, 92, 19, 82, 29,…
## $ `Entry Standards` <dbl> 441, 289, 342, 385, 399, 315, 426, 337, 485…
## $ `Student Satisfaction` <dbl> 4.00, 4.01, 4.26, 4.13, 4.09, 4.03, 4.19, 4…
## $ `Research Quality` <dbl> 2.97, 2.84, 2.99, 2.99, 3.27, 3.00, 3.03, 2…
## $ `Research Intensity` <dbl> 0.73, 0.76, 0.63, 0.95, 0.62, 0.09, 0.68, 0…
## $ `Student-Staff Ratio` <dbl> 13.5, 19.1, 16.9, 15.1, 13.7, 17.9, 12.6, 2…
## $ `Academic Services Spend` <dbl> 1552, 1106, 1034, 1266, 1245, 1246, 1309, 8…
## $ `Facilities Spend` <dbl> 672, 548, 442, 748, 535, 711, 386, 327, 567…
## $ `Good Honours` <dbl> 77.9, 66.9, 64.2, 78.3, 76.0, 67.5, 77.8, 7…
## $ `Degree Completion` <dbl> 84.3, 88.6, 82.3, 92.6, 92.6, 82.3, 86.4, 8…
## $ `Overall Score` <dbl> 718, 555, 629, 754, 747, 598, 747, 544, 808…
## $ `Graduate Prospects 2017` <dbl> 76.0, 62.3, 67.2, 78.4, 79.8, 59.0, 79.9, 6…
## $ `Graduate Prospects 2018` <dbl> 78.6, 67.9, 66.1, 80.5, 72.3, 70.0, 80.9, 7…
func_plotNAs(data_master)
ggplot(data = data_master, mapping = aes(`Graduate Prospects 2018`)) +
geom_histogram(fill = "#E69F00", colour = "#56B4E9", alpha = 0.4) +
labs(title = "Graduate Prospects 2018 Distribution", x = "Graduate Prospects 2018", y = "Count") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
mat_corrDV x =
From the plot below, we have the correlation of each independent variable against other independent variables. It is important to ensure that our independent variables are not highly correlated with sex in accordance to the GDS Ethics framework. Moreover, we should also remove highly-correlated independent variables because they will negatively affect our regression.
From the correlation matrix plot below, we see the following:
Given this, we will drop the State Private ratio and PG FT student count variables from our regressions. This is because keeping correlated independent variables will make our regression results inaccurate.
# Correlation heatmap
ggplot(data = mat_corrIV, mapping = aes(x = `Var2`, y = `Var1`, fill = value)) +
geom_tile(colour = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab", name = "Pearson\nCorrelation\nCoefficient") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(title = "IV Correlation Heatmap", x = "Independent Variables", y = "Independent Variables")
These independent variables are otherwise known as predictors/explanatory variables because they are used to explain the relationship with the dependent variable, Graduate Prospects 2018.
Multi-linear regression is multi/multiple in the sense that it takes several inpendent variables as inputs to model their relationship with the dependent variable.
Multi-linear regression is linear in the sense that it assumes there is a linear relationship between the independent variables and the dependent variable.
For A4: Independence, it is less imporant to meet this because we are dealing with non-time series data.
Whilst not included as an assumption, the impact extreme observations can have on determing the regression line can be significant, and therefore a robust regression model would remove these observations.
# Set-up 10-fold cross-validation
trainControl(method = "cv", number = 10)
trainSet <-# Run linear regression model
train(form = `Graduate Prospects 2018` ~ .,
model_regFreq <-data = mat_reg, trControl = trainSet,
method = "lm")
A1: Linearity | In the top-left panel, we have the residuals vs. fitted values plot. If we find equally spread residuals around the horizontal red line, this is a good sign that you have a linear relationship between your independent variables and dependent variable. In this instance, the points/residuals are spread weakly equally/randomly around our red-line, so we weakly meet the linearity assumption.
A2: Normal Distribution | In the top-right panel, we have the normal Q-Q plot. If we find the residuals are lined up well along the straight, diagonal line, then this is a good sign that we have normality of residuals. In this instance, our points/residuals follow the straight, diagonal line well, so we meet the normality assumption. 7
A3: Homoscedasticity | In the bottom-left panel, we have the scale-location plot. If we find equally/randomly spread points around the red, horizontal line, then this is a good sign that we have homoscedasticity of the residuals. In this instance, our points/residuals are spread-further to the left of the plot, and more concentrated on the right of the plot, causing the red-line to be diagonal rather than horizontal. We do not meet the heteroscedasticity assumption. However, since we are not dealing with time-series data, this is not of significant importance.
A4: Independence | This assumption can only be validated at the data collections stage. We believe that observations were collected independent of one another so satisfy this independence assumption.
A5: Multicollinearity | In the correlation heatmap plot in Section 2: Data Exploration - Independent Variables, we removed State Prviate ratio and PG FT student count from our set of independent variables because they were correlated with other variables. As such, we meet the no-multi-collinearity assumption.
Extreme Points | In the bottom-right panel, we have the residual vs leverage plot. If we find all our observations lying within the red-dotted Cook’s distance lines, then this is a good sign that we do not have any extreme points that greatly influence the regression line. In this instance, all our observations are within the inner Cook’s distance dotted lines, so we do not have any extreme points.
par(mfrow = c(2, 2))
plot(model_regFreq$finalModel)
summary(model_regFreq)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44198 -0.08712 0.00567 0.07955 0.41216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.25196 0.08945 2.817 0.00571 **
## `\\`EU to non-EU ratio\\`` 0.07749 0.09681 0.800 0.42513
## `\\`Pass Fail ratio\\`` 0.20204 0.09746 2.073 0.04040 *
## `\\`STEM non-STEM ratio\\`` 0.11506 0.07713 1.492 0.13852
## `\\`Research Quality\\`` 0.38478 0.08790 4.377 0.0000266 ***
## `\\`Student-Staff Ratio\\`` -0.12663 0.09358 -1.353 0.17864
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1428 on 115 degrees of freedom
## Multiple R-squared: 0.4058, Adjusted R-squared: 0.38
## F-statistic: 15.71 on 5 and 115 DF, p-value: 0.000000000009083
::rmse(actual = model_regFreq$finalModel$model$.outcome,
ModelMetricspredicted = model_regFreq$finalModel$fitted.values)
If we were to take 100 samples, 95 of those samples’ independent variables parameters’ values will lie within the region bounded by 2.5% and 97.5%, on average.
confint(object = model_regFreq$finalModel, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 0.07 0.43
## EU to non-EU ratio -0.11 0.27
## Pass Fail ratio 0.01 0.40
## STEM non-STEM ratio -0.04 0.27
## Research Quality 0.21 0.56
## Student-Staff Ratio -0.31 0.06
Given that our frequentist multi-linear regression satisfied the linear regression assumptions, we will assume the Bayesian multi-linear regression will also based on the principle that they are using the same dependent and independent variables, and the same data.
In this way, we will forego checking the linear regression assumptions.
# Graduate Prospects 2017 distribution
plotdist(data = temp, histo = TRUE, demp = TRUE)
To test our assumption above, we plot the Cullen and Frey graph below, which allows us to see what common distribution our dependent variable, Graduate Prospects 2017, aligns most closely to.
Our dependent variable, Graduate Prospects 2017, is the blue dot on the graph, and it is most closest to the triangle and asterisk objects, whilst also lying in the grey region. We will discuss each of these aspects in turn.
\[ {U}(\alpha,\: \beta)\: where -\infty\: <\: \alpha\: <\: \beta\: <\: \infty \]
\[ N(\mu,\: \sigma^2)\: where\: \mu\: \in\: \rm I\!R\: and\: \sigma^2\: >\: 0 \]
\[ B(\alpha,\: \beta)\: where\: \alpha,\: \beta\: >\: 0 \]
# Graduate Prospects 2017 distribution
descdist(data = temp, discrete = FALSE)
## summary statistics
## ------
## min: 0.00000001 max: 1
## median: 0.4547414
## mean: 0.4819571
## estimated sd: 0.2140505
## estimated skewness: 0.07885535
## estimated kurtosis: 2.103885
# Fit the uniform and normal distributions
fitdist(data = temp, distr = "unif", method = "mme",
discrete = F, na.rm = T)
## Fitting of the distribution ' unif ' by matching moments
## Parameters:
## estimate
## 1 0.1127459
## 2 0.8511683
fitdist(data = temp, distr = "norm", method = "mme",
discrete = F, na.rm = T)
## Fitting of the distribution ' norm ' by matching moments
## Parameters:
## estimate
## mean 0.4819571
## sd 0.2131642
par(mfrow = c(2,2))
func_plotGoF(dist = rpt_dist, titles = name_dist)
# create vector of regressuion parameters
"beta0"
vec_regParameters <- ncol(x = mat_reg) - 1
len <-for(i in 1:len) vec_regParameters <- append(x = vec_regParameters, values = paste0("beta[", i, "]"))
nrow(mat_reg)
numRows <-# define DV and IVs for regression
list(x = mat_reg[1:numRows, 1:len], y = mat_reg[1:numRows, len + 1])
data_regBayes <-
# run model
run.jags(method = "parallel", model = "bayesModel.bug", monitor = vec_regParameters,
model_regBayes <-data = data_regBayes, n.chains = 4, summarise = FALSE, plots = FALSE, silent.jags = TRUE)
The purpose of running Monte Carlo Markov Chain (MCMC) diagnostics is to spot large issues with our chains. This is to help inform us how many whether the number of iteration we run is enough to obtain convergence in our chains, and therefore believe our samples generated from this process are truly representative for estimating the characteristics of our gamma distribution of interest.
More specifically, there are three main goals we wish to achieve when generating an MCMC sample from our posterior distribution:
In the plots below, we have the trace/time-series plots of the MCMCs run. It shows the values each of our \(\beta\) parameters took during the runtime of our chain. If our MCMC method explores the parameter space effectively and efficiently (that is, it does so randomly) for each of our \(\beta\) regression parameters, then we should expect to see “hairy caterpillar” in these trace/time-series plots.
More concretely, we want to see no clear paatern nor trend in the plot, meaning no obvious movements to the top nor bottom of the plots. In this way, our MCMC method is leading our \(\beta\) regression parameters to consider all possible values they can take, with approximately the same probability that they will land on each possible value.
Checking for representativeness, we see that our four parallel chains run generally overlap each other for each \(\beta\) parameter, so our chains are fulfil this condition.
par(mfrow = c(3,3))
#traceplot(as.mcmc(x=model_regBayes))
traceplot(model_regBayes[[1]])
In the plots below, we have the autocorrelation function (ACF) plot, which displays the correlation coefficient at different chains/lags. We want to see the vertical lines in each plot become short quickly across the lags as this signifies that our chains are not auto-correlated, meaning that the steps our “walker” takes when exploring the parameter space for \(\beta_i\) are not substantially influenced by the previous step they took.
What this shows is that for our \(\beta0\), \(\beta_5\), and \(\beta_6\) parameters, the correlation coefficient between the first and subsequent chains/lags do not fall by much after successive lags, and therefore our chains for them are highly auto-correlated – they change gradually step-by-step. Henceforth successive chains for these parameters do not provide independent information about our \(\beta0\), \(\beta_5\), and \(\beta_6\) distributions. For our other parameters, we have less autocorrelation. Thereforem we do not fully satisfy this condition.
# ACF Plot for first chain only
autocorr.plot(model_regBayes[[1]])
When running our chains using parallel computing methods, it took 2.07 seconds which is a reasonable time.
From the trace plots in the Chain Representativeness section above, we can set our burnin for running our chains to 6,000. This allows us to get our \(\beta_i\) parameters to settle to a stationary state faster, thereby improving our efficiency further.
summary(model_regBayes)
##
## Iterations = 5001:15000
## Thinning interval = 1
## Number of chains = 4
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## beta0 0.2538 0.09244 0.0004622 0.003951
## beta[1] 0.0775 0.09893 0.0004947 0.001227
## beta[2] 0.2009 0.09895 0.0004947 0.001637
## beta[3] 0.1143 0.07893 0.0003946 0.001157
## beta[4] 0.3842 0.09106 0.0004553 0.003101
## beta[5] -0.1286 0.09507 0.0004753 0.003067
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## beta0 0.076885 0.19029 0.2536 0.31615 0.43442
## beta[1] -0.117524 0.01105 0.0778 0.14357 0.27211
## beta[2] 0.008482 0.13403 0.2008 0.26803 0.39376
## beta[3] -0.041911 0.06146 0.1150 0.16771 0.26803
## beta[4] 0.202946 0.32306 0.3840 0.44545 0.56491
## beta[5] -0.316176 -0.19215 -0.1279 -0.06417 0.05582
If we were to take one sample, there is a 95% probability that our parameter will lie within the region bounded by 2.5% and 97.5%.
# Take C.I associated with first chain only
#HPDinterval(obj = mcmc.list(as.mcmc(model_regBayes)), prob = 0.95)[[1]]
HPDinterval(obj = mcmc.list(as.mcmc(model_regBayes[[1]])), prob = 0.95)[[1]]
## lower upper
## beta0 0.0972064 0.4530410
## beta[1] -0.1281380 0.2657510
## beta[2] 0.0194725 0.3976300
## beta[3] -0.0429021 0.2649150
## beta[4] 0.1838200 0.5261700
## beta[5] -0.3265410 0.0406598
## attr(,"Probability")
## [1] 0.95
model_regFreqSum
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44198 -0.08712 0.00567 0.07955 0.41216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.25196 0.08945 2.817 0.00571 **
## `\\`EU to non-EU ratio\\`` 0.07749 0.09681 0.800 0.42513
## `\\`Pass Fail ratio\\`` 0.20204 0.09746 2.073 0.04040 *
## `\\`STEM non-STEM ratio\\`` 0.11506 0.07713 1.492 0.13852
## `\\`Research Quality\\`` 0.38478 0.08790 4.377 0.0000266 ***
## `\\`Student-Staff Ratio\\`` -0.12663 0.09358 -1.353 0.17864
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1428 on 115 degrees of freedom
## Multiple R-squared: 0.4058, Adjusted R-squared: 0.38
## F-statistic: 15.71 on 5 and 115 DF, p-value: 0.000000000009083
model_regBayesSum
##
## Iterations = 5001:15000
## Thinning interval = 1
## Number of chains = 4
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## beta0 0.2538 0.09244 0.0004622 0.003951
## beta[1] 0.0775 0.09893 0.0004947 0.001227
## beta[2] 0.2009 0.09895 0.0004947 0.001637
## beta[3] 0.1143 0.07893 0.0003946 0.001157
## beta[4] 0.3842 0.09106 0.0004553 0.003101
## beta[5] -0.1286 0.09507 0.0004753 0.003067
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## beta0 0.076885 0.19029 0.2536 0.31615 0.43442
## beta[1] -0.117524 0.01105 0.0778 0.14357 0.27211
## beta[2] 0.008482 0.13403 0.2008 0.26803 0.39376
## beta[3] -0.041911 0.06146 0.1150 0.16771 0.26803
## beta[4] 0.202946 0.32306 0.3840 0.44545 0.56491
## beta[5] -0.316176 -0.19215 -0.1279 -0.06417 0.05582
model_regFreqci
## 2.5 % 97.5 %
## (Intercept) 0.07 0.43
## EU to non-EU ratio -0.11 0.27
## Pass Fail ratio 0.01 0.40
## STEM non-STEM ratio -0.04 0.27
## Research Quality 0.21 0.56
## Student-Staff Ratio -0.31 0.06
model_regBayesci
## lower upper
## beta0 0.0972064 0.4530410
## beta[1] -0.1281380 0.2657510
## beta[2] 0.0194725 0.3976300
## beta[3] -0.0429021 0.2649150
## beta[4] 0.1838200 0.5261700
## beta[5] -0.3265410 0.0406598
## attr(,"Probability")
## [1] 0.95
For text data cleaning, would like to investigate the quanteda package which is said to be more powerful, faster, and efficient than existing text analysis packages such as tm.
We have removed NAs from our datasets so that we can carry out the Bayesian analysis. With more time, we would like to use data imputation methods so that we do not lose data.
Use parallel computing methods for running caret. Exmaple can be found in documentation.
Run Bayesian multi-linear regression model with a uniform distribution as our prior.
Would like to analyse our MCMC chains more robustly by considering the Gelman-Rubin stasitic and plots for chain representativeness; check if our multivariate effective sample size is below our minimum effective sample size, and the Monte Carlo Standard Error (MCSE) for chain accuracy and stability,
Explore better ways to communicate this report, including rmarkdown slides and incorporating rshiny functionality to make the report and slides interactive.
Use Mean Absolute Error (MAE) isntead of Root Mean Squared Error (RMSE) as our metric for evaluating the performance of our multi-linear regression models because it is more interpretable. Also, each error influences MAE in direct proportion to the absolute value of the error, which is not the case for RMSE.
Complete Glossary of Terms.
Improve use of Git so that it big functionality/features added are committed to the central repo.
Investigate a plot highlighting differences in C.I. between Bayesian and frequentist.
Flesh out GitHub README.md file.
Experiment with transferring this markdown report into the R package, bookdown.
Try to get pander and plotly to work for pretty outputting of dataframes and plots in markdown.
Implement continuous integration/deployment as part of this.
This paper adopts RAP principles to ensure ease of reproducibility of analysis.
The underlying principle here is the belief that analysis should be reproducible, so that any analyst can take the code used here, and generate the same outputs with minimal manual engineering.
This ensures transparency and accountability of analysis as part of a wider framework of good quality-assurance and more effective knowledge management. Additionally, it allows the automation of publications such as SFRs within governments, which significantly reduces production time of publications, and ensuring more consistent quality.10
For an amusing video that brings the benefits of RAP to life, see this Youtube clip here.
Particular practices adopted to ensure alignment with RAP principles are:
The code shown within the HTML document is not the complete code used for this project. Instead, only the most interesting code will be displayed within the document. To see the complete code underlying this analysis, please refer to the .rmd file and .R scripts within the RProject.
The language, LaTeX, was used to write mathematical functions and equations within the HTML report. In particular, the LaTeX code for RMSE was taken from inspecting the element (HTML code) on this webpage from Standord.
Within the code scripts, the following naming conventions are used:
The theme for this HTML document is journal. A list of alternative themes already included within the rmarkdown package can be found here. For further themes to use, download the prettydoc package which gives you more themes to choose from.
The code highlighting for this HTML document is haddock. A further list of options can be found here.
The packages used in this analysis are:
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] runjags_2.0.4-6 rjags_4-10 coda_0.19-4 fitdistrplus_1.1-3
## [5] survival_3.1-12 MASS_7.3-51.6 caret_6.0-86 lattice_0.20-41
## [9] rmarkdown_2.6 rvest_0.3.6 xml2_1.3.2 stringdist_0.9.6.3
## [13] SnowballC_0.7.0 tm_0.7-8 NLP_0.2-1 magrittr_2.0.1
## [17] readr_1.4.0 reshape2_1.4.4 ggplot2_3.3.2 dplyr_1.0.2
## [21] tidyr_1.1.2
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 splines_4.0.2 foreach_1.5.1
## [4] prodlim_2019.11.13 assertthat_0.2.1 stats4_4.0.2
## [7] emo_0.0.0.9000 renv_0.12.3 yaml_2.2.1
## [10] slam_0.1-48 ipred_0.9-9 pillar_1.4.6
## [13] glue_1.4.2 pROC_1.16.2 digest_0.6.27
## [16] colorspace_1.4-1 recipes_0.1.15 htmltools_0.5.0
## [19] Matrix_1.2-18 plyr_1.8.6 timeDate_3043.102
## [22] pkgconfig_2.0.3 purrr_0.3.4 scales_1.1.1
## [25] gower_0.2.2 lava_1.6.8.1 tibble_3.0.4
## [28] generics_0.1.0 farver_2.0.3 ellipsis_0.3.1
## [31] withr_2.3.0 nnet_7.3-14 cli_2.1.0
## [34] crayon_1.3.4 evaluate_0.14 fansi_0.4.1
## [37] nlme_3.1-148 class_7.3-17 tools_4.0.2
## [40] data.table_1.13.6 hms_0.5.3 lifecycle_0.2.0
## [43] stringr_1.4.0 munsell_0.5.0 compiler_4.0.2
## [46] rlang_0.4.10 grid_4.0.2 iterators_1.0.13
## [49] rstudioapi_0.11 labeling_0.4.2 gtable_0.3.0
## [52] ModelMetrics_1.2.2.2 codetools_0.2-16 R6_2.5.0
## [55] lubridate_1.7.9.2 knitr_1.30 utf8_1.1.4
## [58] stringi_1.5.3 parallel_4.0.2 Rcpp_1.0.5
## [61] vctrs_0.3.4 rpart_4.1-15 tidyselect_1.1.0
## [64] xfun_0.18
First-degree students are defined as students taking their first undergraduate degree or professional qualification.↩︎
One of the main reasons why frequentist statistics is the dominant mode of statistical thinking is due to computational power. Before recent advances, it was very difficult to efficiently perform Bayesian statistics. Another key reason is centred on a historical, epic, and longstanding war between Frequentist statisticians (Ronald Fisher, Jerzy Neyman and Egon Pearson) and Bayesian statisticians (Thomas Bayes, Pierre-Simon Laplace), which was won by the former group in the early 20th Century. 💥↩︎
There can be cases where the two approaches lead to substantially different results.↩︎
Example taken from this StackExchange question. Another excellent Star Wars analogy of Bayesian reasoning can be found at the countbayesie website.↩︎
Indeed, the fact that Bayesian inference is more intuitive was the subject of discussion in Daniel Kahnemann’s best-seller, Thinking, Fast and Slow.↩︎
For a good webpage on how to interpret regression diagnostic plots, visit the University of Virginia library’s webpage.↩︎
If we did not meet this assumption, we can Box-Cox transform our dependent variable to try and make it normally-distributed.↩︎
One typically complements the use of the goodness-of-fit plots by computing recommended goodness-of-fit statistics such as the Kolmogorov-Smirnov stastistic, Cramer-von Mises statistic, or the Anderson-Darling statistic. They then use these statistics to test against the null hypothesis that the empirical (dependent variable) distribution fits an estimated (proposed) distribution. However, since these recommended goodness-of-fit statistics are available only for maximum likelihood estimations, then we cannot take this complimentary approach.↩︎
Use k-fold cross-validation with 10-folds to compute the RMSE here. Code has been hidden because a custom function was created to compute this, and it is not immediately obvious that the function computes the RMSE as it returns several objects within a list.↩︎
For more information on RAP, see this government blog post.↩︎
We do not have a live link to an underlying database in this project, but we have web-scraped HTML tables.↩︎
This has now been removed as not all the packages in tidyverse are required, and when downloading these packages on a different workspace, we want to minimise the time it takes to download the required packages for this project. It has instead been replaced by the tidyr, dplyr, ggplot2, readr, reshape2, and magrittr packages.↩︎