Author’s Note

DISCLAIMER: The author is relatively new to Bayesian statistics, so would appreciate suggestions and tips.
To pull in the packages used for this project, run packrat::snapshot in your console to install the relevant packages.
Key lesson learnt here is that in most cases, you should do a piece of analysis “well enough”, so that it does not necessarily need to be perfect. However, when attempting to communicate a new concept in a way that people, especially analysts interested in the details and process, understand in a clear, concise and succint manner, as well as encouraging them to try out the concept, then extra time, effort and thought must go to doing more than “well enough”.
In this regard, your humble author has tried their best to ensure that the language is crystal-clear, the code is as readable as possible, the naming conventions are immediately obvious as to how wha they represent, and it is all consistent.
Nevertheless, your humble author would be most grateful to receive feedback, questions, and suggestions on how this analysis and report can be improved further. Feel free to get in touch either via the GitHub page, or via email. 🙇‍♀️
Please do let this author know if this analysis has encouraged you to try Bayesian statistics yourself. Your author would be most delighted to see it being implemented more widely.😄

Executive Summary

Bayesian statistics is a fundamentally different way of viewing, approaching, and tackling stastitical problems. It involves thinking about situations from a probabalistic viewpoint.
Contrast this with the other main competing school of statistical thought, frequentist statistics, which views situations within the context of repeated events.
These differences manifest themselves in the actual statistical techniques applied to problems, with multi-linear regression being the technique considered in this paper. For each school-of-thought, there is a different way to carry out the statistical technique.
Whereas in most cases, the results of these models should broadly align irrespective of whether you apply a frequentist or Bayesian version of a statistical technique, the way you interpret and communicate these results are different.
Bayesian interpretations of results fall more in-line with how we intuitively think about problems.

Motivating Example

Alan Turing, mathematician, computer scientist, and one of the forebearers of modern computers; led efforts to unravel the German Enigma code - a development that helped the Allies win the Second World War. Turing employed several technologies and techniques in his work, including Banburismus, a process he invented which used sequential conditional probability to infer information about the likely settings of the Enigma machine.
Banburismus is described as ‘a highly intensive, Bayesian system’ that allowed Turing and colleagues to guess a stretch of letters in an Enigma message, measure their belief in the validity of these guesses - using Bayesian methods to assess the probabilities - and add more clues as they arrived.
Therefore, if it is good enough for Turing, and it is good enough to win a world war, then it must be good enough for us… 💁

Section 1: Introduction

Motivation

The purpose of this paper is to investigate the use of Bayesian statistics for forecasting the graduate prospects of first-degree students studying full-time.¹

We will compare our outputs of applying a frequentist against a Bayesian multi-linear regression using k-fold cross-validation.
By the end, I hope to have provided to you with a gentle and accessible introduction to what Bayesian statistics is, how it is different to frequentist statistics, the benefits of thinking and viewing problems from a Bayesian perspective, and above all, show how you can practically implement Bayesian statistics in your own work. Hopefully, I can show that it is not rocket-🧪. 😉

Caveats

As we will remove sex and other highly-correlated variables with sex from our model, then the predictive accuracy of our model may not be optimal. This to ensure we meet GDS Ethics Framework standards.
There are four forms of “The Open University”, one for each UK region, in our data_qualifiers dataframe. Will need to check if this is the case for our other data frames.
As we have chosen only one fuzzy-matching algorithm, and have not benchmarked this against other possible algorithms, then our data matching between HESA SFR and Complete University’s Guide data may not be optimal. To improve our data matching and consequently, our model performances, we should explore alternative fuzzy-matching algorithms.
Have not performed analysis on all possible independent variables such as first-degree full-time student count, Entry standards, Research intensity etc., so we could meaningfully improve our model fit by including these.
From transforming our dataframes so that they are of double datatype, the conversion for ‘Acad Services Speeds’ failed, so we generated NAs. Whils this does not affect the analysis as we have not used this data, it is something worth bearing in mind for future analysis.

Section 2: Core Concepts

Bayesian Stastistics

What is it?

Bayesian statistics/inference is a fundamentally different way to view and approach statistical problems/questions. At its simplest, it boils down to Bayes’ Theorem within probability theory: \[ (1)\quad P(A\mid B\,) = {\frac {P(B\mid A)\ P(A\,)}{P(B\,)}} \] \[ where\:A\:and\:B\:are\:events\:and\:P(B)\neq 0 \]
Bayesian inference follows what is recognised as rational learning; where we combine observational evidence with pre-existing beliefs to obtain new beliefs. In this way, the role of data in statistics is evidentiary and henceforth we can formulate Bayesian inference to be: \[ (2)\quad Beliefs\: after\: observation\: =\: Beliefs\: before\: observation\: +\: total\: evidence\: observed\]
For the purpose of terms used later in this paper, we will rephrase (2) as (3) below: \[ (3)\quad Posterior\: beliefs\: \propto\: Prior\: \times\: Likelihood \]

Frequentist versus Bayesian Statistics

Frequentist/classical statistics, which is widely learnt in education and practiced in industry, is distinct from Bayesian statistics on a philosophical level. The two schools of thought view the concept of probability in fundamentally different ways. ²

Most frequentist methods have a Bayesian equivalent, and they both generally lead to broadly similar results. Therefore, one’s choice in deciding which of the two approaches to take is based on principle on how one views the problem, and seeks to interpret the answers. We will illustrate this with an example below: ³

Scenario: You have misplaced your phone somehwere in the house. You thus decide to call your phone so you can use the ringing to direct you to where it is. ⁴

Frequentist reasoning - You hear your phone ringing and have a mental model of your house that helps you identify the area from which the ringing is coming from. Therefore, upon hearing the ringing, you infer the area of your house that you must search to locate your phone.

Bayesian reasoning - You hear your phone ringing. In addition to having a mental model of your house, you also know the locations where you have misplaced your phone in the past. Therefore, you combine the inferences from the ringing and your prior knowledge of where you have misplaced your phone in the past to identify an area you must search to locate your phone.

Formally fleshing out this example, the differences between frequentist and Bayesian statistics is summarised in the table below.

Frequentist inference	Bayesian inference
- Data are a repeatable random sample, so studies can be reapeated.	- Data are observed froma realised sample, so studies are fixed.
- Underlying parameters being observed remain constant during this repeatable process.	- Underlying parameters are unknown and desvribed probabilisitcally, so can vary.
- Parameters are fixed.	- Data are fixed.

This distinction in how parameters and data are viewed between frequentists and Bayesians is seen in how confidence intervals and credible intervals are interpreted respectively.

A frequentist would say that over repeated sampling of 95% confidence intervals, 95% of these confidence intervals have the true parameter lying in them.

A Bayesian would say over one 95% credible interval, the probability the parameter of interest will lie within this interval is 95%.

In this way, a Bayesian inference lends itself better to interpretation as it is more intuitive, and on a practical level, we typically do not have time/money to collect data repeatedly. ⁵

How it works (technical discussion)

The four general steps followed for the Bayesian process are:

Specify a probability distribution for each parameter in the model.
Combine these distributions into the joint posterior distribution.
Find the conditional marginal distributions from the joint posterior distribution.
Employ Markov Chain Monte Carlo (MCMC) methods to maximise the joint posterior distribution. Gibbs sampling is a tool in MCMC methods for deriving estimates of parameters from a joint posterior distribution.

By determining conditional marginal distribution for each random variable of the model, then generating random samples from these distributions, will eventually converge to random samples from the joint posterior distribution.
In terms of how Bayesian analysis is practically implemented, we use computational resources to sample. In particular, given that our posteriors are probability distributions, then we can sample all the parameters from our joint posterior distribution. To do this, we focus on samples from each parameter individually, which is the marginal posterior distribution. Now, whilst integrating over multiple parameters is difficult, Gibbs sampling makes this possible and is employed to generate posterior samples.
Note that the posterior distribution is a probability distribution over the parameter space, and this quantifies how probable it is that a particular value of the parameter(s) has given rise to the observed data. This distribution provides not only the optimal point estimate of the parameters, but also a quantification of the entire parameter space. In this way, our Gibbs sampling method explores the entire parameter space, asking which regions are most probable, given the data observed.
Our MCMC method explores the parameter space through having an ensemble of “walkers” that move around the space randomly. At each point where a walker steps, the integrand value at that point is counted towards the integral. The walker may then make a number of tentative steps around the area, looking for a place with a reasonably high contribution to the integral to move into next.
Whilst it is not difficult to construct a Markov Chain with the desired properties, the more difficult problem lies in determining how many steps/iterations to run the chain for it to converge to the stationary distribution within an acceptable error. A good chain will have rapid “mixing”, where we reach the stationary distribution quickly from an arbitrary starting position. We empirically assess this convergence to the stationary distribution by running several independently simulated Markov chains, and checking the ratio of inter-chain to intra-chain variance for all the parameters sampled is close to one. In this paper, we use the Gelman-Rubin diagnostic for this.
If all of this technical explanation makes you feel 😩, do not fret, we flesh this out with an explicit example. 🙏

k-fold Cross validation

What is it?

Cross-validation is a technique to evaluate a model’s predictive power by partitioning the data into a training set to train the model, and a test set to evaluate the model’s accuracy in making predictions.
k-fold cross-validation is a form of cross-validation that takes k randomly paritioned sub-samples of your data; k-1 of these sub-samples then act as the training set to train your model, with the other sub-sample being used as your test set to evaluate your models performance.
This process is repeated k times so that each of your k sub-samples are used for training and testing/evaluating your model. The k results you generate from running these k models are then averaged to produce a single estimation/metric of how accurate your model is.
k-fold cross-validation is powerful because it allows you to use all your data for both training and testing purposes. However, it can be computationally expensive.
We will use k-fold cross-validation to run our multi-linear regression models to ensure our model does not overfit our data. By overfitting, we mean the model is too good at predicting the outputs, Graduate Prospects 2018, from our dataset, and cannot accurately predict the our output, Graduate Prospects 2018, on new data that is not included in our dataset.

Root Mean Square Error (RMSE)

Root-mean-square error is a metric for evaluating the performance of your predictive model. The lower value means that your model is more accurate in producing predictions that are closer to the observed values.
It is the differences between predicted values of your model against the values actually observed. Mathematically, this is formulated by the below equation: \[RMSE = \sqrt{\frac{\sum_{i=1}^n\left( \hat{y_i}-y_i \right)^2}{n}}\]
The RMSE can be used to compare the performance of difference models to see which one is more accurate in making predictions of an dependent variable.
However, the RMSE is scale-dependent, so it cannot be used to compare models on different sets of data, or on models that are based on the same data but the data for each model have different scales. e.g. one model may be based on income data, whilst the other model is based on income data that is log-transformed.
We will use RMSE to evaluate the performance of our frequentist and Bayesian multi-linear regression models to compare which model gives us more accurate predictions.

Section 3: Data

Sources

The data used in this project comes from the Higher Educations Standards Agency’s (HESA) Statistical First Release (SFR247) and the Complete University Guide’s university rankings table.
In particular, the following datasets are used:
We will investigate the use of these data sources as potential predictors for the graduate prospects of university students.
- Aligning with the GDS Ethics Framework, we will include sex in our variables to measure correlations with other predictors in the model, and if there is a high correlation, we will remove those predictors from our model. This is because sex is a protected characteristic, and therefore we should refrain from including it nor any potential proxies for it into our model. The risk of including it in our model is that if the outputs are used for decision-making, then one member of the sex may be unfairly disadvantaged.

Cleaning

Whilst our HESA SFR data has a unique identifier field, UKPRN, that we can use to join disparate HESA SFR data together, our Complete University’s Guide web-scraped HTML table does not have such a similar unique identifer field. Instead, it’s unique identifer field is Name, which is the universtiy’s name and is of string data type.
Our HESA SFR data also has a similar field, HE provider, which allows us to join the HESA SFR with the Complete University’s Guide data together. However, given that these fields are of string data type, then we will need to do some data cleaning first as there is a lot of noise in string/text data. The steps we will take to cleaning text data are:

Convert text to lower case
Remove numbers and punctuation
Remove English stopwords e.g. “the”, “is”, “of”
Remove white spaces
Stem text e.g. remove the “ed” from “changed”

Once we have cleaned our string columns so they can be matched more accurately, we will then need to apply fuzzy-matching to join the HESA SFR data to our Complete University’s Guide data. This is because the two sources may call the same university different names.

Example: HESA SFR has “The University of Cambridge” whilst the same university is just “Cambridge” in the Complete University’s Guide data.

Possible fuzzy-matching techniques/algorithms are:

Jaro-distance |
Jaro-Winkler distance |
Jaccard distance |
Cosine distance |
Longest common substring |

The particular fuzzy-matching used will be Longest Common Subsequence. The reasons for choosing this are outlined below:

Works well for	Struggles with
+ Missing words	- Ordering
+ Typos	- Generic wording

In the code below, we demonstrate how to apply text cleaning and fuzzy-matching. Note however that the code is not representative of the code actually used to clean our dataframes. For the actual code, the reader should refer to the underlying R scripts which underpin this report.

# Clean text data
x <- Corpus(VectorSource(x = temp_CUG$Institution)) %>% 
        tm_map(content_transformer(tolower)) %>% 
        tm_map(removeNumbers) %>% 
        tm_map(removeWords, stopwords("english")) %>% 
        tm_map(removePunctuation) %>% 
        tm_map(stripWhitespace) %>% 
        tm_map(stemDocument) %>% 
        # Remove additional stopwords of "The University of" to get 
        # HESA SFR uni names in line with Complete University's Guide
        tm_map(removeWords, c("The University of"))
# Convert Corpus to dataframe
x <- data.frame(Name = sapply(X = x, FUN = as.character), stringsAsFactors = FALSE)

# Fuzzy-matching
# Create grid of names to match - get all permutations
temp <- expand.grid(data_hesa$Name, data_CUG$Name) %>% 
          rename(`hesa_name` = `Var1`, `CUG_name` = `Var2`) %>% 
          arrange(`hesa_name`)
# Use Longest Common Subsequence (LCS) algorithm
temp$dist_lcs <- stringdist(temp$hesa_name, temp$CUG_name, method = "lcs")
# Rank the LCS distances  smaller the better
temp <- temp %>%
          transform(rank_dist_lcs = ave(`dist_lcs`, `hesa_name`,
                                        FUN = function(x) rank(x, ties.method = "first"))) %>% 
          # Take only the best rank
          filter(rank_dist_lcs == 1) %>% 
          # From visual inspection of matches, any fields with dist_lcs > 8 seem to be wrong,
          # with exception of LSE
          filter(dist_lcs <= 8 | `CUG_name` == "london school econom")

# temp dataframe thus acts as lookup table joining
# data_hesa to data_CUG
data_master <- temp[, c("hesa_name", "CUG_name")] %>% 
                left_join(y = data_hesa, by = c("hesa_name" = "Name")) %>% 
                left_join(y = data_CUG, by = c("CUG_name" = "Name"))

We have now cleaned and joined our separate datasets together. The complete dataframe looks as follows:

glimpse(data_master)

## Rows: 121
## Columns: 24
## $ UKPRN                     <dbl> 10007783, 10007856, 10007857, 10005343, 100…
## $ `HE provider`             <chr> "The University of Aberdeen", "Aberystwyth …
## $ `Country of HE provider`  <chr> "Scotland", "Wales", "Wales", "Northern Ire…
## $ `Region of HE provider`   <chr> "All", "All", "All", "All", "All", "All", "…
## $ `FD FT student count`     <dbl> 9725, 5985, 7835, 14725, 19145, 7950, 9095,…
## $ `PG FT student count`     <dbl> 2530, 600, 1510, 3055, 5240, 1240, 1840, 11…
## $ `Male Female ratio`       <dbl> 0.7795059, 1.1130742, 0.7380688, 0.8580442,…
## $ `EU to non-EU ratio`      <dbl> 14.677419, 22.920000, 9.109677, 13.873737, …
## $ `Pass Fail ratio`         <dbl> 2.1759259, 2.0714286, 2.2424242, 2.9255814,…
## $ `STEM non-STEM ratio`     <dbl> 0.76379691, 0.58199357, 0.80392157, 1.05138…
## $ `State Private ratio`     <dbl> 79.6, 95.2, 95.8, 98.1, 85.3, 94.9, 92.9, 9…
## $ Rank                      <dbl> 42, 87, 62, 33, 35, 72, 35, 92, 19, 82, 29,…
## $ `Entry Standards`         <dbl> 441, 289, 342, 385, 399, 315, 426, 337, 485…
## $ `Student Satisfaction`    <dbl> 4.00, 4.01, 4.26, 4.13, 4.09, 4.03, 4.19, 4…
## $ `Research Quality`        <dbl> 2.97, 2.84, 2.99, 2.99, 3.27, 3.00, 3.03, 2…
## $ `Research Intensity`      <dbl> 0.73, 0.76, 0.63, 0.95, 0.62, 0.09, 0.68, 0…
## $ `Student-Staff Ratio`     <dbl> 13.5, 19.1, 16.9, 15.1, 13.7, 17.9, 12.6, 2…
## $ `Academic Services Spend` <dbl> 1552, 1106, 1034, 1266, 1245, 1246, 1309, 8…
## $ `Facilities Spend`        <dbl> 672, 548, 442, 748, 535, 711, 386, 327, 567…
## $ `Good Honours`            <dbl> 77.9, 66.9, 64.2, 78.3, 76.0, 67.5, 77.8, 7…
## $ `Degree Completion`       <dbl> 84.3, 88.6, 82.3, 92.6, 92.6, 82.3, 86.4, 8…
## $ `Overall Score`           <dbl> 718, 555, 629, 754, 747, 598, 747, 544, 808…
## $ `Graduate Prospects 2017` <dbl> 76.0, 62.3, 67.2, 78.4, 79.8, 59.0, 79.9, 6…
## $ `Graduate Prospects 2018` <dbl> 78.6, 67.9, 66.1, 80.5, 72.3, 70.0, 80.9, 7…

Section 4: Data Exploration

Missing Values

To obtain an idea of how many missing/NA values we have, we plot for each variable in our cleaned and joined dataset, which observations/row index has an NA value, coloured grey. In the plot below, we see that we have very little missing data (0) so we do not need to consider data imputation methods like bootstrap aggregation (bagging).

func_plotNAs(data_master)

Dependent Variable: Graduate Prospects

In the plot below, we have the distribution of Graduate Prospects 2018 scores. We see that it follows a general normal distribution. Given this, we can see that there are values at both tails away from the distribution, which may be candidate outlier observations which should be iinvestigated further to see if they are actually outlier points, and henceforth should be removed from our dataset. This is because the presence of outliers can adversely impact our linear regression.

ggplot(data = data_master, mapping = aes(`Graduate Prospects 2018`)) +
  geom_histogram(fill = "#E69F00", colour = "#56B4E9", alpha = 0.4) +
  labs(title = "Graduate Prospects 2018 Distribution", x = "Graduate Prospects 2018", y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

In the output below, we have the correlations of each of our independent variables against our dependent variable, “Graduate Prospects 2018”. From this, we see that each of our independent variables selected are at least somewhat correlated to “Graduate Prospects 2018”, so they can be included in the regression.

x = mat_corrDV

Independent Variables

From the plot below, we have the correlation of each independent variable against other independent variables. It is important to ensure that our independent variables are not highly correlated with sex in accordance to the GDS Ethics framework. Moreover, we should also remove highly-correlated independent variables because they will negatively affect our regression.
From the correlation matrix plot below, we see the following:

Male-Female ratio is not strongly correlated with any other variables.
State Private ratio is fairly correlated with PG FT sutdent count and Pass Fail ratio.
Research Quality is highly correlated with PG FT student count.
Stu-Staff Ratio is fairly correlated with State Private ratio.

Given this, we will drop the State Private ratio and PG FT student count variables from our regressions. This is because keeping correlated independent variables will make our regression results inaccurate.

# Correlation heatmap
ggplot(data = mat_corrIV, mapping = aes(x = `Var2`, y = `Var1`, fill = value)) +
  geom_tile(colour = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1, 1), space = "Lab", name = "Pearson\nCorrelation\nCoefficient") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "IV Correlation Heatmap", x = "Independent Variables", y = "Independent Variables")

Section 5: Multi-Linear Regression

What is it?

Multi-linear regression is a method to model the relationship a dependent variable, Graduate Prospects 2018 has with a set of independent variables. The independent variables we will use are:

Male Female ratio
EU to non-EU ratio
Pass Fail ratio
STEM non-STEM ratio
Research Quality
Stu-Staff Ratio

These independent variables are otherwise known as predictors/explanatory variables because they are used to explain the relationship with the dependent variable, Graduate Prospects 2018.
Multi-linear regression is multi/multiple in the sense that it takes several inpendent variables as inputs to model their relationship with the dependent variable.
Multi-linear regression is linear in the sense that it assumes there is a linear relationship between the independent variables and the dependent variable.

Assumptions

When performing a multi-linear regression, it is important to check that the regression model satisfies the following assumptions, otherwise the outputs from the model which describe the relationship the independent variables have with the dependent variables will not be statistically valid.

A1: Linearity | the relationship between the independent variables to the dependent variable can be modelled by a linearly, meaning as the independent variables change, the dependent variable should change in a linear/constant rate manner.
A2: Normal Distribution | the dependent variable or residuals, which are the deviations of each observed value from the mean dependent variable, are normally distributed.
A3: Homoscedasticity | the residuals are homoscedastic, meaning the residual’s variances are constant across observations.
A4: Independence | the residuals are independent from each other, meaning that the value of one residual should not affect the value of another residual.
A5: No multi-collinearity | the independent variables should not be correlated with each other.

For A4: Independence, it is less imporant to meet this because we are dealing with non-time series data.
Whilst not included as an assumption, the impact extreme observations can have on determing the regression line can be significant, and therefore a robust regression model would remove these observations.

Frequentist Multi-Linear Regression

We build the frequentist multi-linear regression model using k-fold cross-validation from the code below.

# Set-up 10-fold cross-validation
trainSet <- trainControl(method = "cv", number = 10)
# Run linear regression model
model_regFreq <- train(form = `Graduate Prospects 2018` ~ .,
                       data = mat_reg, trControl = trainSet,
                       method = "lm")

Testing regression assumptions

Before we begin to view the outputs of the frequentist multi-linear regression model we created, we need to check that the model meets the regression assumptions. Code for plotting this is below. ⁶

A1: Linearity | In the top-left panel, we have the residuals vs. fitted values plot. If we find equally spread residuals around the horizontal red line, this is a good sign that you have a linear relationship between your independent variables and dependent variable. In this instance, the points/residuals are spread weakly equally/randomly around our red-line, so we weakly meet the linearity assumption.
A2: Normal Distribution | In the top-right panel, we have the normal Q-Q plot. If we find the residuals are lined up well along the straight, diagonal line, then this is a good sign that we have normality of residuals. In this instance, our points/residuals follow the straight, diagonal line well, so we meet the normality assumption. ⁷

A3: Homoscedasticity | In the bottom-left panel, we have the scale-location plot. If we find equally/randomly spread points around the red, horizontal line, then this is a good sign that we have homoscedasticity of the residuals. In this instance, our points/residuals are spread-further to the left of the plot, and more concentrated on the right of the plot, causing the red-line to be diagonal rather than horizontal. We do not meet the heteroscedasticity assumption. However, since we are not dealing with time-series data, this is not of significant importance.
A4: Independence | This assumption can only be validated at the data collections stage. We believe that observations were collected independent of one another so satisfy this independence assumption.
A5: Multicollinearity | In the correlation heatmap plot in Section 2: Data Exploration - Independent Variables, we removed State Prviate ratio and PG FT student count from our set of independent variables because they were correlated with other variables. As such, we meet the no-multi-collinearity assumption.
Extreme Points | In the bottom-right panel, we have the residual vs leverage plot. If we find all our observations lying within the red-dotted Cook’s distance lines, then this is a good sign that we do not have any extreme points that greatly influence the regression line. In this instance, all our observations are within the inner Cook’s distance dotted lines, so we do not have any extreme points.

Therefore our regression model meets the regression assumptions and our model outputs are be valid.

par(mfrow = c(2, 2))
plot(model_regFreq$finalModel)

Interpreting regression model outputs

In the below output, we have the outputs of our model.

Coefficient significance | From the p-values associated with each of our independent variables, with the exception of Research Quality, Male Female ratio, and EU to non-EU ratio, all our other independent variables are not statistically significant at 5% signifiance level.
Model significance | From the p-value associated with our regression model, we see that our model is statistically significant at the 5% signficance level.

summary(model_regFreq)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44198 -0.08712  0.00567  0.07955  0.41216 
## 
## Coefficients:
##                             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)                  0.25196    0.08945   2.817   0.00571 ** 
## `\\`EU to non-EU ratio\\``   0.07749    0.09681   0.800   0.42513    
## `\\`Pass Fail ratio\\``      0.20204    0.09746   2.073   0.04040 *  
## `\\`STEM non-STEM ratio\\``  0.11506    0.07713   1.492   0.13852    
## `\\`Research Quality\\``     0.38478    0.08790   4.377 0.0000266 ***
## `\\`Student-Staff Ratio\\`` -0.12663    0.09358  -1.353   0.17864    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1428 on 115 degrees of freedom
## Multiple R-squared:  0.4058, Adjusted R-squared:   0.38 
## F-statistic: 15.71 on 5 and 115 DF,  p-value: 0.000000000009083

Regression model accuracy

The RMSE of our frequentist multi-linear regression model is 0.14. As we have scaled our data to be in the range [0,1], then this is a small value, hence our model is accurate in predicting Graduate Prospects 2018.

ModelMetrics::rmse(actual = model_regFreq$finalModel$model$.outcome, 
                   predicted = model_regFreq$finalModel$fitted.values)

Confidence interval

In the output below, we have the 95% confidence interval for each of our independent variables. These values can be interpreted as follows;

If we were to take 100 samples, 95 of those samples’ independent variables parameters’ values will lie within the region bounded by 2.5% and 97.5%, on average.

confint(object = model_regFreq$finalModel, level = 0.95)

##                     2.5 % 97.5 %
## (Intercept)          0.07   0.43
## EU to non-EU ratio  -0.11   0.27
## Pass Fail ratio      0.01   0.40
## STEM non-STEM ratio -0.04   0.27
## Research Quality     0.21   0.56
## Student-Staff Ratio -0.31   0.06

Bayesian Multi-Linear Regression

Framework

We will follow the steps below to conduct Bayesian analysis:

Form prior distribution of your dependent variable, Graduate Prospects. This can be from an assumption made by the business such as policymakers who have a view of how the dependent variable is distributed, or it can be based on past values of your dependent variable. We use the fitdistrplus package for this.
Estimate the parameters for your prior distribution and see how well your dependent variable fits your prior distribution. If it does, proceed further. If it does not, try another common distribution.
Use the parameters from your prior distribution to specify your Bayesian model. We define this model using BUGS code (which drives the Gibbs-Sampler) in our bayesModel.bug file and use the runjags package for this.
Run diagnostic tests on your MCMC chains to ensure your chains are representative, accurate and stable, and efficient.
Tune the parameters when running your Bayesian model if your MCMC chains are not either not representative, accurate, or stable. Otherwise, analyse your model outputs!

Testing regression assumptions

Given that our frequentist multi-linear regression satisfied the linear regression assumptions, we will assume the Bayesian multi-linear regression will also based on the principle that they are using the same dependent and independent variables, and the same data.
In this way, we will forego checking the linear regression assumptions.

Formulating Priors

In the two plots below, we have the density and cumulative density plots of Graduate Prospects 2017. From this, we see that our distribution appears to follow a normal distribution.

# Graduate Prospects 2017 distribution
plotdist(data = temp, histo = TRUE, demp = TRUE)

To test our assumption above, we plot the Cullen and Frey graph below, which allows us to see what common distribution our dependent variable, Graduate Prospects 2017, aligns most closely to.
Our dependent variable, Graduate Prospects 2017, is the blue dot on the graph, and it is most closest to the triangle and asterisk objects, whilst also lying in the grey region. We will discuss each of these aspects in turn.

Triangle | The triangle represents the uniform distribution, so by our blue dot, which represents our dependent variable, being close to the triangle object, this means that our dependent variable is close to being uniformly distributed.

\[ {U}(\alpha,\: \beta)\: where -\infty\: <\: \alpha\: <\: \beta\: <\: \infty \]

Asterisk | The asterisk represents the normal distribution, so by our blue dot, which represents our dependent variable, being close to the asterisk object, this means that our dependent variable is close to being normally distributed.

\[ N(\mu,\: \sigma^2)\: where\: \mu\: \in\: \rm I\!R\: and\: \sigma^2\: >\: 0 \]

Grey region | The grey region represents the beta distribution, so by our blue dot, which represents our dependent variable, being within this grey region, this means there are \(\alpha\) and \(\beta\) values such that our dependent variable is close to being beta-distributed.

\[ B(\alpha,\: \beta)\: where\: \alpha,\: \beta\: >\: 0 \]

Our dependent variable, Graduate Prospects 2017, cannot be beta-distributed however because Graduate Prospects 2017 can take values greater than one. For a beta-distribution, the set of possible values must lie in the interval [0,1]. Therefore, we will fit the uniform and normal distribution to our Graduate Prospects data.

# Graduate Prospects 2017 distribution
descdist(data = temp, discrete = FALSE)

## summary statistics
## ------
## min:  0.00000001   max:  1 
## median:  0.4547414 
## mean:  0.4819571 
## estimated sd:  0.2140505 
## estimated skewness:  0.07885535 
## estimated kurtosis:  2.103885

In the output below, we have estimated the parameters for our dependent variable, Graduate Prospects 2017, if it was to take either a uniform or normal distribution, using methods-of-moments. Thus, if Graduate Prospects 2017 was to take a uniform distribution, its parameters will be \(\alpha\) = 0.11 and \(\beta\) = 0.85. If Graduate Prospects 2017 was to take a normal distribution, its parameters will be \(\mu\) = 0.48 and \(\sigma^2\) = 0.21.

# Fit the uniform and normal distributions
fitdist(data = temp, distr = "unif", method = "mme",
                       discrete = F, na.rm = T)

## Fitting of the distribution ' unif ' by matching moments 
## Parameters:
##    estimate
## 1 0.1127459
## 2 0.8511683

fitdist(data = temp, distr = "norm", method = "mme",
                       discrete = F, na.rm = T)

## Fitting of the distribution ' norm ' by matching moments 
## Parameters:
##       estimate
## mean 0.4819571
## sd   0.2131642

In the goodness-of-fit plots below, we examine how well our proposed uniform and normal distributions actually fit our dependent variable, Graduate Prospects 2017, in more detail.

The histogram and theoretical densities and empirical and theoretical CDFs plots on the left-side of the plots below are considered the classical goodness-of-fit plots. From these, we can see that both distributions fit our dependent variable, Graduate Prospects 2017, very well.
The quantile-quantile (Q-Q) plot in the top-right panel shows that both the uniform and normal distributions provide a good description of the right-tail of our dependent variable’s empirical distribution.
The probability-probability (P-P) plot in the bottom-right panel shows that both the uniform and normal distributions provide a good description of the centre of our dependent variable’s empirical distribution.⁸

par(mfrow = c(2,2)) 
func_plotGoF(dist = rpt_dist, titles = name_dist)

For our purposes, we will fit the normal distribution onto our dependent variable, Graduate Prospects, for the Bayesian multi-linear regression model that we will build. Whilst the uniform distribution has shown itself to be equally appropriate, we choose the normal distribution arbitrarily for the sake of comparing against the frequentist multi-linear regression model. \[ Graduate\: Prospects\: \sim\: N(\mu,\: \sigma^2) \] where \(\mu\) = 0.48 and \(\sigma^2\) = 0.21.

Building the Bayesian model

We build the Bayesian multi-linear regression model using parallel computing in the code below.

# create vector of regressuion parameters
vec_regParameters <- "beta0"
len <- ncol(x = mat_reg) - 1
for(i in 1:len) vec_regParameters <- append(x = vec_regParameters, values = paste0("beta[", i, "]"))
numRows <- nrow(mat_reg)
# define DV and IVs for regression
data_regBayes <- list(x = mat_reg[1:numRows, 1:len], y = mat_reg[1:numRows, len + 1])

# run model
model_regBayes <- run.jags(method = "parallel", model = "bayesModel.bug", monitor = vec_regParameters,
          data = data_regBayes, n.chains = 4, summarise = FALSE, plots = FALSE, silent.jags = TRUE)

Markov-Chain Monte Carlo analysis

The purpose of running Monte Carlo Markov Chain (MCMC) diagnostics is to spot large issues with our chains. This is to help inform us how many whether the number of iteration we run is enough to obtain convergence in our chains, and therefore believe our samples generated from this process are truly representative for estimating the characteristics of our gamma distribution of interest.
More specifically, there are three main goals we wish to achieve when generating an MCMC sample from our posterior distribution:

Representativeness - values in the chain must be representative of the posterior distribution. They should not be unduly influenced by the initial value of the chain. They should also fully explore the range of the posterior distribution without getting stuck at a certain value.
Accuracy and stability - chain should be of sufficient size so that estimates are accurate and stable. This means that the estimates of central tendency, such as the mean and median, and the limits of the 95% highest density interval (HDI) should not be different if the MCMC analysis was run again.
Efficiency - the chain should be generated with the minimal amount of steps, in order to reduce computing power.

i. Representativeness

In the plots below, we have the trace/time-series plots of the MCMCs run. It shows the values each of our \(\beta\) parameters took during the runtime of our chain. If our MCMC method explores the parameter space effectively and efficiently (that is, it does so randomly) for each of our \(\beta\) regression parameters, then we should expect to see “hairy caterpillar” in these trace/time-series plots.
More concretely, we want to see no clear paatern nor trend in the plot, meaning no obvious movements to the top nor bottom of the plots. In this way, our MCMC method is leading our \(\beta\) regression parameters to consider all possible values they can take, with approximately the same probability that they will land on each possible value.
Checking for representativeness, we see that our four parallel chains run generally overlap each other for each \(\beta\) parameter, so our chains are fulfil this condition.

par(mfrow = c(3,3))
#traceplot(as.mcmc(x=model_regBayes))
traceplot(model_regBayes[[1]])

ii. Accuracy and stability

In the plots below, we have the autocorrelation function (ACF) plot, which displays the correlation coefficient at different chains/lags. We want to see the vertical lines in each plot become short quickly across the lags as this signifies that our chains are not auto-correlated, meaning that the steps our “walker” takes when exploring the parameter space for \(\beta_i\) are not substantially influenced by the previous step they took.
What this shows is that for our \(\beta0\), \(\beta_5\), and \(\beta_6\) parameters, the correlation coefficient between the first and subsequent chains/lags do not fall by much after successive lags, and therefore our chains for them are highly auto-correlated – they change gradually step-by-step. Henceforth successive chains for these parameters do not provide independent information about our \(\beta0\), \(\beta_5\), and \(\beta_6\) distributions. For our other parameters, we have less autocorrelation. Thereforem we do not fully satisfy this condition.

# ACF Plot for first chain only
autocorr.plot(model_regBayes[[1]])

iii. Efficiency

When running our chains using parallel computing methods, it took 2.07 seconds which is a reasonable time.
From the trace plots in the Chain Representativeness section above, we can set our burnin for running our chains to 6,000. This allows us to get our \(\beta_i\) parameters to settle to a stationary state faster, thereby improving our efficiency further.

Interpreting regression model outputs

In the below output, we have the results of our model.

Coefficient significance | From the Bayesian equivalent of p-values, Naive SE, we see that all our independent variables are stastically significant at the 5% significance level.

summary(model_regBayes)

## 
## Iterations = 5001:15000
## Thinning interval = 1 
## Number of chains = 4 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##            Mean      SD  Naive SE Time-series SE
## beta0    0.2538 0.09244 0.0004622       0.003951
## beta[1]  0.0775 0.09893 0.0004947       0.001227
## beta[2]  0.2009 0.09895 0.0004947       0.001637
## beta[3]  0.1143 0.07893 0.0003946       0.001157
## beta[4]  0.3842 0.09106 0.0004553       0.003101
## beta[5] -0.1286 0.09507 0.0004753       0.003067
## 
## 2. Quantiles for each variable:
## 
##              2.5%      25%     50%      75%   97.5%
## beta0    0.076885  0.19029  0.2536  0.31615 0.43442
## beta[1] -0.117524  0.01105  0.0778  0.14357 0.27211
## beta[2]  0.008482  0.13403  0.2008  0.26803 0.39376
## beta[3] -0.041911  0.06146  0.1150  0.16771 0.26803
## beta[4]  0.202946  0.32306  0.3840  0.44545 0.56491
## beta[5] -0.316176 -0.19215 -0.1279 -0.06417 0.05582

Regression model accuracy

The RMSE for our Bayesian multi-linear regression model is 0.16. As we have scaled our data to be in the range [0,1], then this is a small value, hence our model is accurate in predicting Graduate Prospects 2018.⁹

Credible interval

In the output below, we have the 95% credible interval for each of our independent variables. These values can be interpreted as follows:

If we were to take one sample, there is a 95% probability that our parameter will lie within the region bounded by 2.5% and 97.5%.

# Take C.I associated with first chain only
#HPDinterval(obj = mcmc.list(as.mcmc(model_regBayes)), prob = 0.95)[[1]]
HPDinterval(obj = mcmc.list(as.mcmc(model_regBayes[[1]])), prob = 0.95)[[1]]

##              lower     upper
## beta0    0.0972064 0.4530410
## beta[1] -0.1281380 0.2657510
## beta[2]  0.0194725 0.3976300
## beta[3] -0.0429021 0.2649150
## beta[4]  0.1838200 0.5261700
## beta[5] -0.3265410 0.0406598
## attr(,"Probability")
## [1] 0.95

Comparing Bayesian with Frequentist Outputs

Below, we compare the summaries of our multi-linear regression models.

model_regFreqSum

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44198 -0.08712  0.00567  0.07955  0.41216 
## 
## Coefficients:
##                             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)                  0.25196    0.08945   2.817   0.00571 ** 
## `\\`EU to non-EU ratio\\``   0.07749    0.09681   0.800   0.42513    
## `\\`Pass Fail ratio\\``      0.20204    0.09746   2.073   0.04040 *  
## `\\`STEM non-STEM ratio\\``  0.11506    0.07713   1.492   0.13852    
## `\\`Research Quality\\``     0.38478    0.08790   4.377 0.0000266 ***
## `\\`Student-Staff Ratio\\`` -0.12663    0.09358  -1.353   0.17864    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1428 on 115 degrees of freedom
## Multiple R-squared:  0.4058, Adjusted R-squared:   0.38 
## F-statistic: 15.71 on 5 and 115 DF,  p-value: 0.000000000009083

model_regBayesSum

## 
## Iterations = 5001:15000
## Thinning interval = 1 
## Number of chains = 4 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##            Mean      SD  Naive SE Time-series SE
## beta0    0.2538 0.09244 0.0004622       0.003951
## beta[1]  0.0775 0.09893 0.0004947       0.001227
## beta[2]  0.2009 0.09895 0.0004947       0.001637
## beta[3]  0.1143 0.07893 0.0003946       0.001157
## beta[4]  0.3842 0.09106 0.0004553       0.003101
## beta[5] -0.1286 0.09507 0.0004753       0.003067
## 
## 2. Quantiles for each variable:
## 
##              2.5%      25%     50%      75%   97.5%
## beta0    0.076885  0.19029  0.2536  0.31615 0.43442
## beta[1] -0.117524  0.01105  0.0778  0.14357 0.27211
## beta[2]  0.008482  0.13403  0.2008  0.26803 0.39376
## beta[3] -0.041911  0.06146  0.1150  0.16771 0.26803
## beta[4]  0.202946  0.32306  0.3840  0.44545 0.56491
## beta[5] -0.316176 -0.19215 -0.1279 -0.06417 0.05582

Below, we compare the confidence/credible intervals of each of our models.

model_regFreqci

##                     2.5 % 97.5 %
## (Intercept)          0.07   0.43
## EU to non-EU ratio  -0.11   0.27
## Pass Fail ratio      0.01   0.40
## STEM non-STEM ratio -0.04   0.27
## Research Quality     0.21   0.56
## Student-Staff Ratio -0.31   0.06

model_regBayesci

##              lower     upper
## beta0    0.0972064 0.4530410
## beta[1] -0.1281380 0.2657510
## beta[2]  0.0194725 0.3976300
## beta[3] -0.0429021 0.2649150
## beta[4]  0.1838200 0.5261700
## beta[5] -0.3265410 0.0406598
## attr(,"Probability")
## [1] 0.95

In our frequentist multi-lienar regression, our RMSE was 0.14, whereas in our Bayesian multi-linear regression, our RMSE was 0.16.

Section 6: Next Steps

For text data cleaning, would like to investigate the quanteda package which is said to be more powerful, faster, and efficient than existing text analysis packages such as tm.
We have removed NAs from our datasets so that we can carry out the Bayesian analysis. With more time, we would like to use data imputation methods so that we do not lose data.
Use parallel computing methods for running caret. Exmaple can be found in documentation.
Run Bayesian multi-linear regression model with a uniform distribution as our prior.
Would like to analyse our MCMC chains more robustly by considering the Gelman-Rubin stasitic and plots for chain representativeness; check if our multivariate effective sample size is below our minimum effective sample size, and the Monte Carlo Standard Error (MCSE) for chain accuracy and stability,
Explore better ways to communicate this report, including rmarkdown slides and incorporating rshiny functionality to make the report and slides interactive.
Use Mean Absolute Error (MAE) isntead of Root Mean Squared Error (RMSE) as our metric for evaluating the performance of our multi-linear regression models because it is more interpretable. Also, each error influences MAE in direct proportion to the absolute value of the error, which is not the case for RMSE.
Complete Glossary of Terms.
Improve use of Git so that it big functionality/features added are committed to the central repo.
Investigate a plot highlighting differences in C.I. between Bayesian and frequentist.
Flesh out GitHub README.md file.
Experiment with transferring this markdown report into the R package, bookdown.
Try to get pander and plotly to work for pretty outputting of dataframes and plots in markdown.
Implement continuous integration/deployment as part of this.

Section 7: Appendix

Reproducible Analytical Pipeline (RAP)

This paper adopts RAP principles to ensure ease of reproducibility of analysis.
The underlying principle here is the belief that analysis should be reproducible, so that any analyst can take the code used here, and generate the same outputs with minimal manual engineering.
This ensures transparency and accountability of analysis as part of a wider framework of good quality-assurance and more effective knowledge management. Additionally, it allows the automation of publications such as SFRs within governments, which significantly reduces production time of publications, and ensuring more consistent quality.¹⁰

For an amusing video that brings the benefits of RAP to life, see this Youtube clip here.
Particular practices adopted to ensure alignment with RAP principles are:
- RProjects | Creating a fixed R environment to work within and share, so that no computer-specific file directory is required.
- Lock-down version of packages | Packages used have their versions locked down, so when major updates to packages are applied, these will not affect our results. This is acheived by the package, packrat.
- Modular coding | Breaking down our code into modules based along the lines of data loading, data manipulation, data exploration, frequentist multi-linear regression and Bayesian multi-linear regression.
- Set random seed | Specifying the random seed R uses to generate regression outputs so that we can obtain the same results.
- Version control | Git is used to manage different versions of this analysis project, enabling dial-back to historical copies before a major change was made, and to enable collaborative working.
- Database/Web Extracts | When data for analysis is taken from a live feed to an underlying database such as SQL, a copy of the data at that instance is made and saved within the RProject so that the record of the data when the analysis was performed is collected and used to reproduce results.¹¹

Glossary of Terms

Notes

The code shown within the HTML document is not the complete code used for this project. Instead, only the most interesting code will be displayed within the document. To see the complete code underlying this analysis, please refer to the .rmd file and .R scripts within the RProject.
The language, LaTeX, was used to write mathematical functions and equations within the HTML report. In particular, the LaTeX code for RMSE was taken from inspecting the element (HTML code) on this webpage from Standord.
Within the code scripts, the following naming conventions are used:
- R scripts | camelCase
- Dataframe objects | Prefixed with data_
- User-created functions | Prefixed with func_
- Temporary objects | Prefixed with temp_
- Matrix objects | Prefixed with mat_
- Vectors of names | Prefixed with name_
- Model objects | Prefixed with model_
- Variables | camelCase
- Other objects for report | Prefixed with rpt_
- String directly for report | Prefixed with txt_
The theme for this HTML document is journal. A list of alternative themes already included within the rmarkdown package can be found here. For further themes to use, download the prettydoc package which gives you more themes to choose from.
The code highlighting for this HTML document is haddock. A further list of options can be found here.
The packages used in this analysis are:

packrat | Locking-down versions of packages.
tidyverse | Data import, manipulation, and plotting.¹²
tm | Cleaning text data.
SnowballC | Stems text
stringdist | Fuzzymatching text data.
rvest | Scraping information from the web.
rmarkdown | Generate HTML report within RStudio session.
caret | General-purpose package to perform machine-learning, including multi-linear regression and k-fold cross-validation.
fitdistrplus | Powerful distribution-plotting of data, enabling the selection of the best-fitting common distribution to your data.
runjags | Facilitates Bayesian analysis as well as doing so efficiently via parallel processing.
coda | Performs Markov-Chain Monte Carlo diagnostic checks for Bayesian analysis.

Credits

The following people have influenced this report:

Richard Morey (Cardiff University) | Bayesian statistics
Adam Robinson (DfE) | RProjects, modular coding, packrat, continuous integration/deployment
Zachary Waller (DfE) | Git
Callum Staff (DfE) | RAP
David Goody (DfE) | Fuzzy-matching
Salman Kasim (DfE) | Web-scraping
Chris Thom (DfE) | apply functions > loops
Michelle Clements (BEIS) | Text data cleaning
Robert Jones (DfE) | Clear naming of objects

Section 8: Environment

This analysis was conducted on a laptop with the specifications and packages outlined below.

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] runjags_2.0.4-6    rjags_4-10         coda_0.19-4        fitdistrplus_1.1-3
##  [5] survival_3.1-12    MASS_7.3-51.6      caret_6.0-86       lattice_0.20-41   
##  [9] rmarkdown_2.6      rvest_0.3.6        xml2_1.3.2         stringdist_0.9.6.3
## [13] SnowballC_0.7.0    tm_0.7-8           NLP_0.2-1          magrittr_2.0.1    
## [17] readr_1.4.0        reshape2_1.4.4     ggplot2_3.3.2      dplyr_1.0.2       
## [21] tidyr_1.1.2       
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2           splines_4.0.2        foreach_1.5.1       
##  [4] prodlim_2019.11.13   assertthat_0.2.1     stats4_4.0.2        
##  [7] emo_0.0.0.9000       renv_0.12.3          yaml_2.2.1          
## [10] slam_0.1-48          ipred_0.9-9          pillar_1.4.6        
## [13] glue_1.4.2           pROC_1.16.2          digest_0.6.27       
## [16] colorspace_1.4-1     recipes_0.1.15       htmltools_0.5.0     
## [19] Matrix_1.2-18        plyr_1.8.6           timeDate_3043.102   
## [22] pkgconfig_2.0.3      purrr_0.3.4          scales_1.1.1        
## [25] gower_0.2.2          lava_1.6.8.1         tibble_3.0.4        
## [28] generics_0.1.0       farver_2.0.3         ellipsis_0.3.1      
## [31] withr_2.3.0          nnet_7.3-14          cli_2.1.0           
## [34] crayon_1.3.4         evaluate_0.14        fansi_0.4.1         
## [37] nlme_3.1-148         class_7.3-17         tools_4.0.2         
## [40] data.table_1.13.6    hms_0.5.3            lifecycle_0.2.0     
## [43] stringr_1.4.0        munsell_0.5.0        compiler_4.0.2      
## [46] rlang_0.4.10         grid_4.0.2           iterators_1.0.13    
## [49] rstudioapi_0.11      labeling_0.4.2       gtable_0.3.0        
## [52] ModelMetrics_1.2.2.2 codetools_0.2-16     R6_2.5.0            
## [55] lubridate_1.7.9.2    knitr_1.30           utf8_1.1.4          
## [58] stringi_1.5.3        parallel_4.0.2       Rcpp_1.0.5          
## [61] vctrs_0.3.4          rpart_4.1-15         tidyselect_1.1.0    
## [64] xfun_0.18

First-degree students are defined as students taking their first undergraduate degree or professional qualification.↩︎
One of the main reasons why frequentist statistics is the dominant mode of statistical thinking is due to computational power. Before recent advances, it was very difficult to efficiently perform Bayesian statistics. Another key reason is centred on a historical, epic, and longstanding war between Frequentist statisticians (Ronald Fisher, Jerzy Neyman and Egon Pearson) and Bayesian statisticians (Thomas Bayes, Pierre-Simon Laplace), which was won by the former group in the early 20th Century. 💥↩︎
There can be cases where the two approaches lead to substantially different results.↩︎
Example taken from this StackExchange question. Another excellent Star Wars analogy of Bayesian reasoning can be found at the countbayesie website.↩︎
Indeed, the fact that Bayesian inference is more intuitive was the subject of discussion in Daniel Kahnemann’s best-seller, Thinking, Fast and Slow.↩︎
For a good webpage on how to interpret regression diagnostic plots, visit the University of Virginia library’s webpage.↩︎
If we did not meet this assumption, we can Box-Cox transform our dependent variable to try and make it normally-distributed.↩︎
One typically complements the use of the goodness-of-fit plots by computing recommended goodness-of-fit statistics such as the Kolmogorov-Smirnov stastistic, Cramer-von Mises statistic, or the Anderson-Darling statistic. They then use these statistics to test against the null hypothesis that the empirical (dependent variable) distribution fits an estimated (proposed) distribution. However, since these recommended goodness-of-fit statistics are available only for maximum likelihood estimations, then we cannot take this complimentary approach.↩︎
Use k-fold cross-validation with 10-folds to compute the RMSE here. Code has been hidden because a custom function was created to compute this, and it is not immediately obvious that the function computes the RMSE as it returns several objects within a list.↩︎
For more information on RAP, see this government blog post.↩︎
We do not have a live link to an underlying database in this project, but we have web-scraped HTML tables.↩︎
This has now been removed as not all the packages in tidyverse are required, and when downloading these packages on a different workspace, we want to minimise the time it takes to download the required packages for this project. It has instead been replaced by the tidyr, dplyr, ggplot2, readr, reshape2, and magrittr packages.↩︎

Predicting Graduate Prospects

Avision Ho (a_vision@hotmail.co.uk)

17 February 2018

Author’s Note

Executive Summary

Motivating Example

Section 1: Introduction

Motivation

Caveats

Section 2: Core Concepts

Bayesian Stastistics

What is it?

Frequentist versus Bayesian Statistics

How it works (technical discussion)

k-fold Cross validation

What is it?

Root Mean Square Error (RMSE)

Section 3: Data

Sources

Cleaning

Section 4: Data Exploration

Missing Values

Dependent Variable: Graduate Prospects

Independent Variables

Section 5: Multi-Linear Regression

What is it?

Assumptions

Frequentist Multi-Linear Regression

Testing regression assumptions

Interpreting regression model outputs

Regression model accuracy

Confidence interval

Bayesian Multi-Linear Regression

Framework

Testing regression assumptions

Formulating Priors

Building the Bayesian model

Markov-Chain Monte Carlo analysis

i. Representativeness

ii. Accuracy and stability

iii. Efficiency

Interpreting regression model outputs

Regression model accuracy

Credible interval

Comparing Bayesian with Frequentist Outputs

Section 6: Next Steps

Section 7: Appendix

Reproducible Analytical Pipeline (RAP)

Glossary of Terms

Notes

Credits

Section 8: Environment