# Packages
library(GLMsData) # datasets foror GLMs
library(tidyverse) # data wrangling and visualising
library(brms) # fitting Bayesian models
library(bayesplot) # visualise MCMC and posterior predictive checks
library(tidybayes) # wrangling and visualising Bayesian models
# Conflicten tussen packages
conflicted::conflicts_prefer(dplyr::filter)
conflicted::conflicts_prefer(dplyr::lag)
conflicted::conflicts_prefer(tidyr::extract)
conflicted::conflicts_prefer(brms::ar)
conflicted::conflicts_prefer(brms::dstudent_t)
conflicted::conflicts_prefer(brms::pstudent_t)
conflicted::conflicts_prefer(brms::qstudent_t)
conflicted::conflicts_prefer(brms::rstudent_t)
conflicted::conflict_prefer("rhat", "brms")
# reduce html export file size by specifying graphics size
knitr::opts_chunk$set(dev = "jpeg", dpi = 56,
fig.width = 4, fig.height = 3,
out.width = "60%")
In this section we briefly review some basic principles of Bayesian statistics. We would also like to refer to the presentation of October 24, 2023 (infomoment 24/10/2023).
data.frame(
what = c(
"interpretation of probability",
"used knowledge",
"hypothesis tests",
"terminology",
"parameters and data"
),
Frequentist = c(
"long term relative frequency",
"data only",
"chance of data that is at least as extreme as the collected data",
"p-value, confidence interval, type 1 and type 2 error ...",
"parameters are fixed and unknown, data is random"
),
Bayesian = c(
"relative probability",
"data and prior knowledge",
"probability that the hypothesis is correct",
"prior, likelihood, posterior, credible interval ...",
"parameters are random and have a distribution, data is fixed"
)
) %>%
kable(booktabs = TRUE,
col.names = c("", "Frequentist", "Bayesian"))
Frequentist | Bayesian | |
---|---|---|
interpretation of probability | long term relative frequency | relative probability |
used knowledge | data only | data and prior knowledge |
hypothesis tests | chance of data that is at least as extreme as the collected data | probability that the hypothesis is correct |
terminology | p-value, confidence interval, type 1 and type 2 error … | prior, likelihood, posterior, credible interval … |
parameters and data | parameters are fixed and unknown, data is random | parameters are random and have a distribution, data is fixed |
Why and when should you use Bayesian models?
Statistical inference is concerned with trying to learn something about a (population) parameter based on data. Suppose we are interested in the number of ant species found in a research site of a pre-specified size. Numbers are often specified by a Poisson distribution since this distribution only has a probability mass at positive integers.
\[ X \sim Poisson(\lambda)\] with \(X\) the number of ant species and \(\lambda\) the parameter we want to estimate. In a Poisson distribution, there is only one parameter; \(\lambda\) is equal to the mean and variance of the distribution. We do not know the true value of \(\lambda\) a priori. Our model defines a collection of probability models; one for each possible value of \(\lambda\). We call this collection of models the likelihood.
In the figure below we show three examples of a potential likelihood. There are three probability density functions of Poisson distributions, each with a different parameter \(\lambda\).
likelihood <- data.frame(x = rep(seq(0, 25), 3),
lambda = c(rep(5, 26), rep(10, 26), rep(15, 26))) %>%
mutate(prob = stats::dpois(x, lambda = lambda),
lambda = as.factor(lambda))
likelihood %>%
ggplot(aes(x = x, y = prob, color = lambda)) +
geom_point() +
geom_errorbar(aes(ymin = 0, ymax = prob), width = 0) +
xlab("number of ant species at a site") +
ylab("probability mass")
Suppose we find 8 different species of ants on one particular site. Our likelihood tells us that there are an infinite number of values for \(\lambda\), which gives us the result 8. The figure below shows that the probability of \(X = 8\) is different in each of the three shown Poisson distributions.
likelihood %>%
ggplot(aes(x = x, y = prob, color = lambda)) +
geom_point() +
geom_errorbar(aes(ymin = 0, ymax = prob), width = 0) +
geom_bar(data = likelihood %>% filter(x == 8),
color = "grey", stat = "identity", alpha = 0.7) +
xlab("Number of ant species at a site") +
ylab("probability mass") +
facet_grid(cols = vars(lambda))
Each of these models, with a different value of \(\lambda\), can give us the value of \(X=8\) with a different probability.
The likelihood function gives us, for each potential value of the parameter, the probability of the data. The figure below shows this for our data with only one observation \(x=8\).
ll <- data.frame(lambda = seq(0, 20, 0.2)) %>%
mutate(likelihood = dpois(8, lambda))
ll %>%
ggplot(aes(x = lambda, y = likelihood)) +
geom_line() +
geom_point(data = ll %>% dplyr::filter(lambda %in% c(5, 10, 15)),
aes(color = as.factor(lambda),x = lambda, y = likelihood),
size = 2) +
scale_color_discrete(name = "lambda") +
geom_vline(aes(xintercept = 8), color = "red") +
geom_label(aes(label = "ML", x = 8, y = 0.16), fill = "red",
alpha = 0.5)
With statistical inference we actually want to invert the likelihood \(p(X|\lambda) \rightarrow p(\lambda|X)\). In this way, we hope to find out which model of all potential models is the most meaningful or probable. Both frequentist and Bayesian methods essentially invert \(p(X|\lambda) \rightarrow p(\lambda|X)\), but the method differs.
The maximum likelihood method in frequentist statistics will choose the value of lambda where the likelihood function is at its maximum as the parameter estimate (\(\lambda = 8\)). A 90% confidence interval can then be calculated around this parameter estimate \((4.8 \leq \lambda \leq 14.4)\). The decision space for the parameter \(\lambda\) is thus divided into two parts: (1) values that are likely and (2) values that are not likely.
Bayesian inference, on the other hand, uses Bayes’ rule:
\[ p(\lambda|X) = \frac{p(X|\lambda) * p(\lambda)}{p(X)} \]
This ensures that we will have a probability for every potential value of \(\lambda\).
The parameters in Bayesian models therefore have a distribution. This contrasts with frequentist inference where the solution space for \(\lambda\) is split into two parts: values that are likely or not likely.
In our example, the following meaning is given to each of the elements in Bayes’ rule:
On this website you will find a nice illustration of how prior and likelihood are combined to obtain a posterior (tutorial).
An simple example of Bayes’ rule
Suppose you go to the doctor, and he says you test positive for a rare disease. However, he also tells you that the test cannot always be trusted. In Belgium, only 0.1% of all people actually have the disease. If you have the disease, the test is positive in \(\frac{99}{100}\) tests. But even if you do not have the disease, the test will be positive in \(\frac{5}{100}\) tests. You would like to know what the probability is that you actually have the disease, given that positive test, and you use Bayes’ theorem and the “Law of total probability”:
\[ p(\text{you have the disease}|T^+) = \frac{p(T^+|\text{you have the disease})*p(\text{you have the disease})}{p(T^+)} \\ = \frac{p(T^+|\text{you have the disease})*p(\text{you have the disease})}{p(T^+|\text{you have the disease})*p(\text{you have the disease}) + p(T^+|\text{you do not have the disease})* p(\text{you do not have the disease)}}\\ =\frac{0.99*0.001}{0.99*0.001 + 0.05*0.9999}\\ =1.94\% \] Bayes’ rule allows us to update our prior knowledge about the disease (only 0.1% have the disease in the general population) with the new information (the data: a positive test) to obtain the posterior. Although, at first glance, the test seems quite reliable, Bayes’ rule shows that the probability that you have the disease, given that you tested positive, is still quite low.
Remember Bayes’ rule: \[ p(\lambda|X) = \frac{p(X|\lambda) * p(\lambda)}{p(X)} \]
We already mentioned that the denominator of this equation is very difficult to calculate. There are three possible solutions:
We will focus on MCMC in this tutorial.
MCMC stands for ‘Markov chain Monte Carlo’. A Markov chain is an algorithm to simulate a random walk and Monte Carlo is an algorithm to sample from the parameter space.
The posterior distribution of a parameter can be compared to a mountain (see figure below). The top of the mountain is where the parameter value is most likely after combining the likelihood of your data with your prior belief. However, the location and shape of the mountain peak is unknown. The goal of an MCMC algorithm is to efficiently explore the mountain peak and its surroundings.
An MCMC algorithm takes random samples from the posterior and generates a series of steps that explores the entire range of the posterior distribution. The rules ensure that the sampler stays near the top of the mountain the entire time. It is a random walk in the parameter space where the posterior distribution is highest (the top of the mountain). The size and direction of each step varies.
When taking a step uphill, the MCMC always continues. If the step is downhill, the MCMC does not always continue. The probability of going downhill depends on how deep the step down is. Small steps downwards are often taken. If the step downwards is very steep, the MCMC usually will not go into that direction and sample a new step.
The absolute height of the mountain (posterior likelihood) in itself does not matter. Only the relative differences or shape of the mountain matters. This means that we do not need to calculate the denominator of Bayes’ rule, which is the reason why the algorithm is frequently used in Bayesian statistics.
A simple MCMC algorithm is the Metropolis algorithm. It is described by these formal rules:
Since the probability of accepting or not accepting parameter values with a lower posterior varies with the difference in posterior value, an MCMC converges in the limit (after many iterations) to the posterior distribution.
For simple cases, scripting MCMC algorithm is not that complex. Below is a simple example.
A linear regression with two parameters to be estimated:
\[ \bar{Y} = \beta_0 + \beta_1 * X \]
\(\beta_0\) = intercept
\(\beta_1\) = slope
Suppose we have a dataset with 5 measurements of \(X\) with response \(Y\).
y <- c(1.3, 5.2, 9.7, 12.8, 14.9)
x <- c(-2.2, -1.3, 0.3, 2.1, 4.9)
df_data <- data.frame(y = y, x = x)
ggplot(df_data, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE)
# grootte van de plot beperken
A linear regression with LM yields:
lm1 <- lm(formula = y ~ x, data = df_data)
lm1$coefficients
## (Intercept) x
## 7.366086 1.860413
With confidence interval (\(\alpha = 0.1\))
confint(lm1, level = 0.9)
## 5 % 95 %
## (Intercept) 5.173664 9.558508
## x 1.032228 2.688598
We source some short functions to calculate the (log) likelihood and the prior and to execute the MCMC metropolis algorithm.
source(file = here("static", "code", "brms_modeling", "mcmc_functions.R"))
For this simple model with a small data set, we can calculate and plot the posterior for a large number of combinations of ‘beta_0’ and ‘beta_1’.
df <- expand.grid(beta_0 = seq(-1, 15, 0.5), beta_1 = seq(-3, 8, 0.5))
df$post <- NA
for (i in 1:nrow(df)) {
df[i, "post"] <- ll.fun(beta_0 = df[i, "beta_0"],
b = df[i, "beta_1"]) +
prior.fun(beta_0 = df[i, "beta_0"],
beta_1 = df[i, "beta_1"])
}
ggplot(df, aes(x = beta_0, y = beta_1, z = post)) +
geom_raster(aes(fill = post)) + geom_contour(colour = "black") +
scale_fill_gradientn(colours = rainbow(n = 4))
To start the MCMC, we need to choose starting values for \(\beta_0\) and \(\beta_1\):
mcmc_small <- MCMC_metro(x = x, # vector x waarden
y = y, # vector y waarden
N = 150, # lengte van de MCMC
sigma = 1, # stapgrootte (sd)
init_b0 = 12, # initiële waarde beta_0 (intercept)
init_b1 = 6) # initiële waarde beta_1 (helling)
ggplot(df, aes(x = beta_0, y = beta_1, z = post)) +
geom_raster(aes(fill = post)) +
geom_contour(binwidth = 20, colour = "black") +
scale_fill_gradientn(colours = rainbow(n = 4)) +
geom_path(data = mcmc_small,
aes(x = beta_0, y = beta_1, z = NA), colour = "black") +
geom_point(data = mcmc_small,
aes(x = beta_0, y = beta_1, z = NA), colour = "red") +
coord_cartesian(xlim = c(-1, 14.5), ylim = c(-3, 7)) +
theme(legend.position = "none")
The starting value is quite far from the maximum likelihood. It takes a while for the MCMC to stabilize in the vicinity of the top of the mountain. Therefore, the first part of the MCMC is never used. This is called the ‘warmup’, ‘burn-in’ or ‘tuning’.
We now run the same model, but with a much longer MCMC (more iterations):
mcmc_l <- MCMC_metro(x = x,
y = y,
N = 5000,
sigma = 1,
init_b0 = 12,
init_b1 = 6)
p1 <- ggplot(df, aes(x = beta_0, y = beta_1, z = post)) +
geom_raster(aes(fill = post)) +
geom_contour(binwidth = 20, colour = "black") +
scale_fill_gradientn(colours = rainbow(n = 4)) +
geom_path(data = mcmc_l,
aes(x = beta_0, y = beta_1, z = NA), colour = "black") +
geom_point(data = mcmc_l,
aes(x = beta_0, y = beta_1, z = NA), colour = "red") +
coord_cartesian(xlim = c(-1, 14.5), ylim = c(-3, 7)) +
theme(legend.position = "none")
p2 <- ggplot(data = mcmc_l, aes(x = beta_0)) + geom_histogram(bins = 50) +
scale_x_continuous(limits = c(-1, 14.5), breaks = seq(-1, 14.5, 1)) +
theme(axis.title.x = element_blank())
p3 <- ggplot(data = mcmc_l, aes(x = beta_1)) + geom_histogram(bins = 50) +
coord_flip() +
scale_x_continuous(limits = c(-3, 7), breaks = seq(-3, 7, 1)) +
theme(axis.title.y = element_blank())
empty <- ggplot()
gridExtra::grid.arrange(p2, ggplot(), p1, p3, ncol = 2, nrow = 2,
widths = c(4, 1), heights = c(1, 4))
The MCMC for each parameter can be displayed in a so called ‘trace plot’ where we visualise the sampled parameter values over time.
ann_text <- tibble(param = c("beta_0", "beta_1"),
x = 210,
y = c(max(mcmc_l$beta_0), max(mcmc_l$beta_1)),
lab = "warmup")
mcmc_l %>%
dplyr::select(iter, beta_0, beta_1) %>%
pivot_longer(cols = c("beta_0", "beta_1"), names_to = "param") %>%
ggplot(aes(x = iter, y = value)) + geom_line() +
facet_wrap(~param, nrow = 2, scales = "free_y") +
annotate("rect", xmin = 0, xmax = 200, ymin = -Inf, ymax = +Inf,
alpha = 0.3, fill = "gold") +
geom_label(data = ann_text, aes(label = lab, x = x, y = y),
hjust = "left", vjust = "top",
fill = "gold")
Summary statistics of beta_0
(intercept)
quantile(mcmc_l$beta_0[200:5000], probs = c(0.05, 0.5, 0.95))
## 5% 50% 95%
## 5.529344 7.314135 9.168863
Summary statistics of beta_1
(slope)
quantile(mcmc_l$beta_1[200:5000], probs = c(0.05, 0.5, 0.95))
## 5% 50% 95%
## 1.237901 1.842052 2.520324
The samples in the MCMC are consecutive. To obtain more independent samples, we could save only every \(x^{th}\) value of the MCMC (thinning). This saves memory space, while increasing the quality (independence) of the sample. The final number of iterations however should remain large enough to correctly calculate summary statistics.
There are more sophisticated samplers such as WinBugs, Jags or Stan, but the MCMC principles are always the same.
The brms package in R (see further) uses the Stan sampler by default.
Once we know the posterior, there are several possibilities to obtain point estimates in Bayesian inference (see also figures below):
There are also different options for quantifying uncertainty:
For symmetric, unimodal distributions, HPDI and CPI are the same (see first figure below). Otherwise there may be quite a difference between the two intervals (see figures 2 and 3 below).
A good example of bimodal data is the size of stag beetles where males are much larger than females. If we would plot the size of all stag beetles, we would obtain something close to the second and third figure below. The second figure below shows the HPDI. It can be seen that there are little beetles with a size between 34.1mm and 37.5mm: this is slightly too large for a female and too small for a male. These sizes of beetles are therefor excluded from the HPDI. The equal-tailed credible interval would yield two boundaries; 5% of the beetles are smaller than the lower interval (27.4mm) and 5% are larger than the upper limit of the interval (46.9mm) as can be seen in the third figure below. However, there are sizes of beetles outside this interval that have a higher posterior then some sizes that are in this interval. For example, while 27.3mm is outside the interval, it has a higher posterior density that 46.8mm which is in the interval.
# specify a unimodal, symmetric distribution
unimodal <- data.frame(
x = seq(10, 50)
) %>%
mutate(prob = dnorm(x, mean = 30, sd = 5),
cum_sum = pnorm(x, mean = 30, sd = 5))
limits <- data.frame(type = c("CPI", "CPI", "HPDI", "HPDI"),
x = c(22, 38, 22, 38))
p1 <- ggplot(unimodal) +
geom_line(aes(x = x, y = prob)) +
geom_vline(data = limits, aes(xintercept = x, color = type, lty = type,
size = type),
alpha = 0.7) +
scale_color_manual(breaks = c("CPI", "HPDI"), values = c("black", "red")) +
scale_linetype_manual(breaks = c("CPI", "HPDI"),
values = c("solid", "dashed"))
# specify a bimodal distribution
bimodal <- data.frame(
x = seq(20, 60, 0.1)
) %>%
mutate(prob = (dnorm(x, mean = 43, sd = 3) +
dnorm(x, mean = 30, sd = 2))/2,
cum_sum = (pnorm(x, mean = 43, sd = 3) + pnorm(x, mean = 30, sd = 2))/2)
measures <- data.frame(measures = c("mean", "mode", "median"),
x = c(weighted.mean(x = bimodal$x, w = bimodal$prob),
30, 35.2))
limits <- data.frame(type = c("CPI", "CPI", "HPDI", "HPDI", "HPDI", "HPDI"),
x = c(27.4, 46.9, 25.9, 34.1, 37.5, 48.5))
p2 <- ggplot(bimodal) +
geom_vline(data = limits %>% filter(type == "HPDI"), aes(xintercept = x)) +
geom_bar(data = bimodal %>%
filter((x >= 25.9 & x <= 34.1) | (x >= 37.5 & x <= 48.5)),
aes(x = x, y = prob),
stat = "identity", color = "grey", fill = "grey") +
ggtitle("highest posterior density interval for a bimodal distribution") +
geom_line(aes(x = x, y = prob)) +
geom_vline(data = measures, aes(xintercept = x), color = "red") +
geom_label(data = measures,
aes(label = measures, x = x, y = c(0.1, 0.09, 0.08)), fill = "red",
alpha = 0.5) +
geom_hline(aes(yintercept = 0.0123), lty = 2)
p3 <- ggplot(bimodal) +
geom_vline(data = limits %>% filter(type == "CPI"), aes(xintercept = x)) +
geom_bar(data = bimodal %>% filter(x >= 27.4 & x <= 46.9),
aes(x = x, y = prob),
stat = "identity", color = "grey", fill = "grey") +
ggtitle("equal tail credible interval for a bimodal distribution") +
geom_line(aes(x = x, y = prob))
p1
p2
p3
There exist a number of software programs that apply MCMC for parameter estimates. Jags, Nimble and Stan are three examples for which there is an R interface. A recent comparison between these three shows that each of these three has strengths and weaknesses but that Stan, overall, is the best performer.
In this course we choose the R package brms which provides a convenient, simple interface for Stan. Some alternatives also exist:
Although the package brms depends on rstan as a default backend, there is another software tool worth mentioning: cmdstanr.
Compared to rstan, cmdstanr benefits from more frequent updates and some performance advantages, yet it requires extra installation steps.
In the brm(...)
function introduced below, you can switch these two with the backend = ...
keyword.
We load a dataset on the number of ant species in New England (USA). Type ?ants
into the console for more info.
# Load data
data(ants)
# Set column names to lower case
ants_df <- ants %>%
rename(sp_rich = Srich) %>%
rename_with(tolower)
# What does the data look like?
glimpse(ants_df)
## Rows: 44
## Columns: 5
## $ site <fct> TPB, HBC, CKB, SKP, CB, RP, PK, OB, SWR, ARC, BH, QP, HAW, W…
## $ sp_rich <int> 6, 16, 18, 17, 9, 15, 7, 12, 14, 9, 10, 10, 4, 5, 7, 7, 4, 6…
## $ habitat <fct> Forest, Forest, Forest, Forest, Forest, Forest, Forest, Fore…
## $ latitude <dbl> 41.97, 42.00, 42.03, 42.05, 42.05, 42.17, 42.19, 42.23, 42.2…
## $ elevation <int> 389, 8, 152, 1, 210, 78, 47, 491, 121, 95, 274, 335, 543, 32…
We have a number of sites (site
) where the number of ant species has been counted (sp_rich
). Some variables are present: habitat
, latitude
and elevation
. Let’s look at some summary statistics. Two counts were made at each site: one in a bog and one in a forest.
# Summary statistics dataset
summary(ants_df)
## site sp_rich habitat latitude elevation
## ARC : 2 Min. : 2.000 Bog :22 Min. :41.97 Min. : 1.0
## BH : 2 1st Qu.: 4.000 Forest:22 1st Qu.:42.17 1st Qu.: 95.0
## CAR : 2 Median : 6.000 Median :42.56 Median :223.0
## CB : 2 Mean : 7.023 Mean :43.02 Mean :232.7
## CHI : 2 3rd Qu.: 8.250 3rd Qu.:44.29 3rd Qu.:353.0
## CKB : 2 Max. :18.000 Max. :44.95 Max. :543.0
## (Other):32
We visualise the data.
# Frequency of the number of species in each habitat
ants_df %>%
ggplot(aes(x = sp_rich)) +
geom_bar() +
scale_x_continuous(limits = c(0, NA)) +
facet_wrap(~habitat)
# Boxplots of the number of species per habitat
ants_df %>%
ggplot(aes(y = sp_rich, x = habitat)) +
geom_boxplot() +
geom_line(aes(colour = site, group = site), alpha = 0.5) +
geom_point(aes(colour = site, group = site)) +
scale_y_continuous(limits = c(0, NA))
# Scatter plot of the number of species, related to the latitude per habitat
ants_df %>%
ggplot(aes(y = sp_rich, x = latitude)) +
geom_point() +
geom_smooth(method = "loess", formula = "y ~ x", colour = "firebrick",
level = 0.9) +
facet_wrap(~habitat)
# Scatter plot of the number of species, related to the height per habitat
ants_df %>%
ggplot(aes(y = sp_rich, x = elevation)) +
geom_point() +
geom_smooth(method = "loess", formula = "y ~ x", colour = "firebrick",
level = 0.9) +
facet_wrap(~habitat)
As an exercise, we will create a model to compare the number of species between both habitats. From the data exploration, we already saw that the number of species seems to be higher in forests and that sites with a higher number in bogs often also have a higher number in forests.
brms
We want a model that allows us to investigate the difference in number of ant species between bog and forest habitats. Consider the response variable \(Y\), the number of ants, and \(X_{habitat}\), a dummy variable equal to 0 for bogs and 1 for forests. We assume that \(Y\) follows a Normal distribution (equivalent to a linear regression with categorical variable (= ANOVA)).
\[ Y \sim N(\mu, \sigma^2) \]
with
\[ \mu = \beta_0 + \beta_1X_{habitat} \]
So we need to estimate three parameters: \(\beta_0\), \(\beta_1\) and \(\sigma\). How do we specify this in brms?
First of all we decide which MCMC parameters we will use. Type ?brm
to see what the default settings are for these parameters. It is recommended to use multiple chains (nchains
) and run them in parallel on different cores of your computer (nparallel
). Since this is a relatively simple model, we don’t need many iterations and no thinning.
# Set MCMC parameters
nchains <- 3 # number of chains
niter <- 2000 # number of iterations (incl. warmup, see next)
warmup <- niter / 4 # number of initial samples to remove (= warmup)
nparallel <- nchains # number of cores for parallel computing
thinning <- 1 # thinning factor (here 1 = no thinning)
The model is fitted using the brm()
function. The syntax is very similar to functions used in frequentist statistics such as lm()
and glm()
. The brm()
function contains many arguments (see ?brm
). Useful arguments that we do not use here are, for example:
inits
to set the initial values of MCMC algorithm. If these are not specified, the function uses default values.prior
to provide prior distributions. If these are not specified, the function uses default (non-informative) priors. The argument sample_prior
can be used to sample from the prior so that you can visualise whether the used prior meets your expectations (also useful for prior predictive checks) .file
and file_refit
to save the model object after it has been fitted. If you run the code again and the model has already been saved, brm()
will simply load this model instead of refitting it.# Fit Normal model
fit_normal1 <- brm(
formula = sp_rich ~ habitat, # specify the model
family = gaussian(), # we use the Normal distribution
data = ants_df, # specify data
chains = nchains, # MCMC parameters
warmup = warmup,
iter = niter,
cores = nparallel,
thin = thinning,
seed = 123) # seed for reproducibility
Before we look at the results, we first check whether the model converges well.
There are several ways to check convergence of the MCMC algorithm for each parameter. The warmup samples are not taken into account. First and foremost, you have visual controls.
We can obtain the MCMC samples with the as_draws()
functions or visualise them at once via the bayesplot package that is compatible with brmsfit objects.
# Set colour palette for clear visualisation in bayesplot
color_scheme_set("mix-blue-red")
# Which parameters to look at?
parameters <- c("b_Intercept", "b_habitatForest", "sigma")
trace plot
Trace plots indicate how the sampling of the posterior distribution progresses over time. The idea is that each chain converges to the same distribution for each parameter (this is called mixing). If this plot looks like a fuzzy caterpillar, it means the convergence is good.
# Visualisation by extracting samples with as_draws_df()
as_draws_df(fit_normal1, variable = parameters) %>%
# long format
pivot_longer(cols = all_of(parameters), names_to = "parameter",
values_to = "value") %>%
# visualise with ggplot()
ggplot(aes(y = value, x = .iteration, colour = factor(.chain))) +
geom_line() +
facet_wrap(~parameter, nrow = 3, scales = "free")
# Visualise with bayesplot package
mcmc_trace(fit_normal1, pars = parameters)
running mean/quantile/… plot
In addition to a trace plot, we can also look at how certain statistics, such as the average or the quantiles, vary over time (= with increasing number of iterations). You want these statistics to stabilize after a number of iterations because then you know that you have used enough iterations.
# Code to calculate cumulative quantiles
# source: https://rdrr.io/cran/cumstats/src/R/cumquant.R
cumquant <- function(x, p) {
out <- sapply(seq_along(x),
function(k, z) quantile(z[1:k], probs = p[1], names = FALSE),
z = x)
return(out)
}
# Extract samples and calculate cumulative statistics
as_draws_df(fit_normal1, variable = parameters) %>%
mutate(across(all_of(parameters), ~cummean(.x),
.names = "{.col}_mean"),
across(all_of(parameters), ~cumquant(.x, 0.05),
.names = "{.col}_q05"),
across(all_of(parameters), ~cumquant(.x, 0.95),
.names = "{.col}_q95")) %>%
select(-all_of(parameters)) %>%
# long format for visualisation
pivot_longer(cols = starts_with(parameters), names_to = "name",
values_to = "value") %>%
# split columns names at the last underscore
tidyr::extract(name, into = c("parameter", "statistic"), "(.*)_([^_]+)$") %>%
ggplot(aes(y = value, x = .iteration, colour = factor(.chain))) +
geom_line(aes(linetype = statistic), linewidth = 0.8) +
facet_wrap(~parameter, nrow = 3, scales = "free")
posterior density plot
The shape of the posterior distributions can also be informative to check whether the chains have converged to the same distributions. If we see a bimodal distribution (a camel’s ridge with two peaks) for example, then there could also be something wrong with the specification of our model.
# Visualise density plot of the posterior for each of the chains separately
# via bayesplot package.
mcmc_dens_overlay(fit_normal1, pars = parameters)
It is even easier to use the plot()
function on the brmsfit object. This immediately shows the trace plots and the posterior density plots side by side for each parameter.
# Trace plots and posterior density for each parameter.
plot(fit_normal1)
autocorrelation plot
If two variables are correlated, it means that there is a relationship between them. This can be a positive or negative relationship. An example of positively correlated variables is people’s size and weight; the taller a person is, the more they will typically weigh. An example of negatively correlated variables is the amount of sunshine and the moisture content in the top layer of the soil.
Autocorrelation can be understood as correlation between elements in a series of values. A good example is the variation in the outside temperature over the day. Suppose you measure the temperature every ten minutes. The current temperature is very strongly correlated with the temperature half an hour ago, strongly correlated with the temperature 2 hours ago and weakly correlated with the temperature 8 hours ago. To discover the posterior, we would have to do a complete random walk through the parameter space where the location on iteration \(i\) is completely independent of the location on iteration \(i-1\). In an MCMC there will often be some autocorrelation and it is important to check whether this autocorrelation is not too large. The so-called lag indicates how much correlation there is between values that are 1, 2, 3 … iterations apart. Autocorrelation can, for example, be the result of a poorly chosen step size, a poorly chosen start location, or multimodality in the posterior. If the autocorrelation is high, the Markov chain will be inefficient, i.e. we will need many iterations to obtain a good approximation of the posterior distribution (see also effective sample size below).
If there is too much autocorrelation, thinning is often used whereby some iterations are removed from the chain. This will reduce the autocorrelation but also the number of posterior samples we have left.
# Visualisation of autocorrelation plots via bayesplot package
mcmc_acf(fit_normal1, pars = parameters)
In addition to visual checks, there are also diagnostic checks.
Gelman-Rubin diagnostics
Also called \(\hat{R}\) (R-hat) or potential scale reduction factor. One way to check whether a chain has converged is to compare its behaviour with other randomly initialized chains. The \(\hat{R}\) statistic takes the ratio of the average variance of samples within each chain to the variance of the pooled samples across all chains. If all chains converge to a common distribution, these ratios will equal 1. If the chains have not converged to a common distribution, the \(\hat{R}\) statistic will be greater than 1. We can extract the \(\hat{R}\) statistics via the rhat()
function and then visualise them if desired with the mcmc_rhat()
function of the bayesplot package.
# Get R-hats
rhats_fit_normal1 <- rhat(fit_normal1)[parameters]
print(rhats_fit_normal1)
## b_Intercept b_habitatForest sigma
## 0.999769 1.000314 1.000759
# Plot R-hats via bayesplot package
mcmc_rhat(rhats_fit_normal1) + yaxis_text(hjust = 1)
effective sample size
The effective sample size (\(n_{eff}\)) is an estimate of the number of independent draws from the posterior distribution of each parameter. Since the draws within a Markov chain are not independent when there is autocorrelation, the effective sample size is usually smaller than the total sample size, namely the number of iterations (\(N\)). The larger the ratio \(n_{eff}/N\) the better. The bayesplot package provides a neff_ratio()
extractor function. The mcmc_neff()
function can then be used to plot the ratios.
# Get ratio of effective sample size
ratios_fit_normal1 <- neff_ratio(fit_normal1)[parameters]
print(ratios_fit_normal1)
## b_Intercept b_habitatForest sigma
## 0.7153819 0.6962991 0.6844347
# Plot ratio via bayesplot package
mcmc_neff(ratios_fit_normal1) + yaxis_text(hjust = 1)
Other packages that may be useful for checking MCMC convergence are mcmcplots and coda. For these packages you must first convert the MCMC samples from your brms model to an mcmc object.
# example
install.packages("mcmcplots")
mcmcplots::mcmcplot(as.mcmc(fit_normal1, pars = parameters))
In general, we can conclude that MCMC convergence is good for all parameters in our Normal model. If this were not the case, the MCMC parameters may have to be chosen differently, for example by increasing the number of iterations. Now that we know that the parameters have been estimated correctly, we can check whether the model fits the data well.
The posterior predictive check (PPC) is a good way to check whether the model fits the data well. The idea behind the PPC is that, if a model fits well, the predicted values based on the model are very similar to the data we used to fit the model. To generate the predictions used for PPCs, we simulate several times from the posterior predictive distribution (the posterior distribution of the response variable). visualisation is possible with the bayesplot package. The thick line shows the data and the thin lines show different simulations based on the model.
# Visualise model fit via bayesplot package
pp_check(fit_normal1, type = "dens_overlay_grouped", ndraws = 100,
group = "habitat")
We see that the observed data and the predictions based on our model do not quite match. Our choice for the Normal distribution may have been wrong. Numbers are discrete, positive values, while the Normal distribution can predict continuous and both positive and negative values. The Poisson distribution is a discrete distribution that is always positive. It is often used to model numbers recorded over a given time interval, distance, area, volume…
Remark:
Note that MCMC convergence and model fit are two different things. If MCMC convergence is good, this does not necessarily mean that model fit is also good. MCMC convergence is to check whether the MCMC algorithm has been able to correctly approximate the distributions of the parameters. Model fit checks whether the model we have adopted (normality assumption, linear relationship average…) fits the data well.
We now assume that \(Y\) now follows a Poisson distribution
\[ Y \sim Pois(\lambda) \]
with
\[ \ln(\lambda) = \beta_0 + \beta_1X_{habitat} \]
So we need to estimate two parameters: \(\beta_0\) and \(\beta_1\) We use the same MCMC parameters as before. The only thing we need to adjust is the choice family = poisson()
.
# Fit Poisson model
fit_poisson1 <- brm(
formula = sp_rich ~ habitat, # specify the model
family = poisson(), # we use the Poisson distribution
data = ants_df, # specify the data
chains = nchains, # MCMC parameters
warmup = warmup,
iter = niter,
cores = nparallel,
thin = thinning,
seed = 123) # seed for reproducibility
Convergence looks good.
# Visualise MCMC convergence of the Poisson model
plot(fit_poisson1)
# Extract and visualise R-hat values
rhats_fit_poisson1 <- rhat(fit_poisson1)[c("b_Intercept", "b_habitatForest")]
mcmc_rhat(rhats_fit_poisson1) + yaxis_text(hjust = 1)
We see that all predictions are now positive, but the fit is still not perfect.
# Visualise model fit via bayesplot package
pp_check(fit_poisson1, type = "dens_overlay_grouped", ndraws = 100,
group = "habitat")
How can we further improve model fit?
We know that each site has been visited twice. Once in bog and once in forest. It is of course possible that the number of ants in the bog and forest of the same site will be correlated (site effect). Indeed, this is something we noticed during data exploration as well. Sites with higher numbers of species in bogs often also have higher numbers in forests and vice versa. We can correct this by adding a random intercept for each site ... + (1|site)
.
We assume that the number of species \(Y\) per site \(j\) (\(j = 1, ..., J\)) follows a Poisson distribution
\[ Y_j \sim Pois(\lambda_j) \]
such that
\[ \ln(\lambda_j) = \beta_0 + \beta_1X_{habitat} + b_{0,j} \] with
\[ b_0 \sim N(0, \sigma_b) \]
# Fit Poisson model with random intercept per site
fit_poisson2 <- brm(
formula = sp_rich ~ habitat + (1|site),
family = poisson(),
data = ants_df,
chains = nchains,
warmup = warmup,
iter = niter,
cores = nparallel,
thin = thinning,
seed = 123)#seed for reproducibility
Convergence looks good.
# Visualise MCMC convergence
plot(fit_poisson2)
# Extract and visualise R-hat values
parameters2 <- c("b_Intercept", "b_habitatForest", "sd_site__Intercept")
rhats_fit_poisson2 <- rhat(fit_poisson2)[parameters2]
mcmc_rhat(rhats_fit_poisson2) + yaxis_text(hjust = 1)
Model fit looks even better than before.
# Visualise model fit of the Poisson model with random intercept via
# bayesplot package
pp_check(fit_poisson2, type = "dens_overlay_grouped", ndraws = 100,
group = "habitat")
How can we objectively compare these models?
Based on the PPCs we can already see which model fits the data best. Furthermore, there are some functions that brms provides to compare different models. With the function add_criterion()
you can add model fit criteria to model objects. Type ?add_criterion()
to see which ones are available. See also https://mc-stan.org/loo/articles/online-only/faq.html
Cross-validation (CV) is a family of techniques that attempts to estimate how well a model would predict unknown data through predictions of the model fitted to the known data. You do not necessarily have to collect new data for this. You can split your own data into a test and training dataset. You fit the model to the training dataset and then use that model to estimate how well it can predict the data in the test dataset. With leave-one-out CV (LOOCV) you leave out one observation each time as test dataset and refit the model based on all other observations (= training dataset).
# Add leave-one-out model fit criterium to model objects
fit_normal1 <- add_criterion(fit_normal1,
criterion = c("loo"))
fit_poisson1 <- add_criterion(fit_poisson1,
criterion = c("loo"))
fit_poisson2 <- add_criterion(fit_poisson2,
criterion = c("loo"))
To determine the difference between the model’s prediction based on the training dataset with the omitted data, we must define a so-called utility or loss function. We use the ‘expected log pointwise predictive density’ (ELPD) here. For this tutorial, it is mainly important to understand that this is a measure of how well your Bayesian model predicts new data points. It is calculated by taking the log-likelihood of each data point from the posterior predictive distribution of your model and then averaging it. A higher ELPD indicates better model fit and predictive accuracy. For this tutorial, it is mainly important to understand that this is a measure of how well the model can predict unknown data (elpd_loo
and the standard error se_elpd_loo
). p_loo
is a measure of the complexity of the model. \(\text{looic} = -2*\text{elpd_loo}\) With the function loo_compare()
you can compare multiple models and the difference in ELPD is also calculated.
# Make comparison leave-one-out cross-validation and print
comp_loo <- loo_compare(fit_normal1, fit_poisson1, fit_poisson2,
criterion = "loo")
print(comp_loo, simplify = FALSE, digits = 3)
## elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic
## fit_poisson2 0.000 0.000 -106.077 3.896 11.399 1.569 212.154
## fit_poisson1 -13.685 5.728 -119.762 8.309 3.813 0.903 239.524
## fit_normal1 -15.874 3.597 -121.951 5.329 3.007 0.841 243.902
## se_looic
## fit_poisson2 7.792
## fit_poisson1 16.618
## fit_normal1 10.659
If you assume that this difference is normally distributed, you can calculate a confidence interval based on the uncertainty. We see that the second Poisson model scores best and that the other models score significantly lower.
# Calculate confidence interval to compare models with loocv
comp_loo %>%
as.data.frame() %>%
select(elpd_diff, se_diff) %>%
mutate(ll_diff = elpd_diff + qnorm(0.05) * se_diff,
ul_diff = elpd_diff + qnorm(0.95) * se_diff)
## elpd_diff se_diff ll_diff ul_diff
## fit_poisson2 0.00000 0.000000 0.00000 0.000000
## fit_poisson1 -13.68487 5.727656 -23.10603 -4.263713
## fit_normal1 -15.87376 3.596654 -21.78973 -9.957793
With K-fold cross-validation, the data is split into \(K\) groups. We will use \(K = 10\) groups (= folds) here. So instead of leaving out a single observation each time, as with leave-one-out CV, we will leave out one \(10^{th}\) of the data here. Via the arguments folds = "stratified"
and group = "habitat"
we ensure that the relative frequencies of habitat are preserved for each group. This technique will therefore be less precise than the previous one, but will be faster to calculate if you work with a lot of data.
# Add K-fold model fit criterium to model objects
fit_normal1 <- add_criterion(fit_normal1,
criterion = c("kfold"),
K = 10,
folds = "stratified",
group = "habitat")
fit_poisson1 <- add_criterion(fit_poisson1,
criterion = c("kfold"),
K = 10,
folds = "stratified",
group = "habitat")
fit_poisson2 <- add_criterion(fit_poisson2,
criterion = c("kfold"),
K = 10,
folds = "stratified",
group = "habitat")
We get the same statistics as before.
# Get k-fold cross-validation and print
comp_kfold <- loo_compare(fit_normal1, fit_poisson1, fit_poisson2,
criterion = "kfold")
print(comp_kfold, simplify = FALSE, digits = 3)
## elpd_diff se_diff elpd_kfold se_elpd_kfold p_kfold se_p_kfold
## fit_poisson2 0.000 0.000 -106.516 3.890 11.837 1.561
## fit_poisson1 -12.646 5.808 -119.162 8.227 3.212 1.025
## fit_normal1 -15.101 3.616 -121.617 5.148 2.672 0.717
The results are the same, but the differences are slightly smaller than before.
# Calculate confidence interval to compare models with K-fold CV
comp_kfold %>%
as.data.frame() %>%
select(elpd_diff, se_diff) %>%
mutate(ll_diff = elpd_diff + qnorm(0.05) * se_diff,
ul_diff = elpd_diff + qnorm(0.95) * se_diff)
## elpd_diff se_diff ll_diff ul_diff
## fit_poisson2 0.00000 0.000000 0.00000 0.000000
## fit_poisson1 -12.64595 5.807731 -22.19882 -3.093082
## fit_normal1 -15.10095 3.616047 -21.04882 -9.153085
The Widely Applicable Information Criterion (WAIC) does not use cross-validation but is a computational way to estimate the ELPD. How this happens exactly is beyond the purpose of this tutorial. It is yet another measure to apply model selection.
# Add WAIC model fit criterion to model objects
fit_normal1 <- add_criterion(fit_normal1,
criterion = c("waic"))
fit_poisson1 <- add_criterion(fit_poisson1,
criterion = c("waic"))
fit_poisson2 <- add_criterion(fit_poisson2,
criterion = c("waic"))
We get similar results as before.
# Get waic and print
comp_waic <- loo_compare(fit_normal1, fit_poisson1, fit_poisson2,
criterion = "waic")
print(comp_waic, simplify = FALSE, digits = 3)
## elpd_diff se_diff elpd_waic se_elpd_waic p_waic se_p_waic
## fit_poisson2 0.000 0.000 -104.809 3.678 10.131 1.344
## fit_poisson1 -14.934 5.851 -119.743 8.305 3.794 0.899
## fit_normal1 -17.104 3.620 -121.913 5.313 2.969 0.824
## waic se_waic
## fit_poisson2 209.618 7.355
## fit_poisson1 239.486 16.609
## fit_normal1 243.827 10.626
# Calculate confidence interval to compare models with WAIC
comp_waic %>%
as.data.frame() %>%
select(waic, elpd_diff, se_diff) %>%
mutate(ll_diff = elpd_diff + qnorm(0.05) * se_diff,
ul_diff = elpd_diff + qnorm(0.95) * se_diff)
## waic elpd_diff se_diff ll_diff ul_diff
## fit_poisson2 209.6176 0.00000 0.000000 0.00000 0.00000
## fit_poisson1 239.4857 -14.93407 5.850998 -24.55810 -5.31003
## fit_normal1 243.8265 -17.10447 3.620256 -23.05926 -11.14968
Both based on the PPC and the comparisons with different model selection criteria, we can conclude that the second Poisson model with random intercepts fits the data best. In principle, we could have expected this based on our own intuition and the design of the study, i.e. the use of the Poisson distribution to model numbers and the use of random intercepts to control for a hierarchical design (habitats nested within sites).
When we look at the model fit object, we see results that are similar to results we see when we fit a frequentist model. On the one hand we get an estimate of all parameters with their uncertainty, but on the other hand we see that this is clearly the output of a Bayesian model. We get information about the parameters we used for the MCMC algorithm, we get a 95% credible interval (CI) instead of a confidence interval and we also get the \(\hat{R}\) value for each parameter as discussed earlier.
# Look at the fit object of the Poisson model with random effects
fit_poisson2
## Family: poisson
## Links: mu = log
## Formula: sp_rich ~ habitat + (1 | site)
## Data: ants_df (Number of observations: 44)
## Draws: 3 chains, each with iter = 2000; warmup = 500; thin = 1;
## total post-warmup draws = 4500
##
## Multilevel Hyperparameters:
## ~site (Number of levels: 22)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.38 0.09 0.23 0.60 1.00 1733 2561
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 1.51 0.13 1.24 1.75 1.00 2452 2883
## habitatForest 0.64 0.12 0.41 0.87 1.00 5062 3193
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
A useful package for visualising the results of our final model is the tidybayes package. Through this package, you can work with the posterior distributions as you would work with any dataset through the tidyverse package.
With the function gather_draws()
you can take a certain number of samples from the posterior distributions of certain parameters and convert them into a long format table. You usually do not want to select all posterior samples because there are sometimes unnecessarily many. By specifying a ‘seed’ you ensure that these are the same samples every time you run the script again. You can then calculate certain summary statistics via the classic dplyr functions.
fit_poisson2 %>%
# gather 1000 posterior samples for 2 parameters in long format
gather_draws(b_Intercept, b_habitatForest, ndraws = 1000, seed = 123) %>%
# calculate summary statistics for each variable
group_by(.variable) %>%
summarise(min = min(.value),
q_05 = quantile(.value, probs = 0.05),
q_20 = quantile(.value, probs = 0.20),
mean = mean(.value),
median = median(.value),
q_80 = quantile(.value, probs = 0.80),
q_95 = quantile(.value, probs = 0.95),
max = max(.value))
## # A tibble: 2 × 9
## .variable min q_05 q_20 mean median q_80 q_95 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 b_Intercept 1.06 1.29 1.41 1.52 1.52 1.62 1.72 1.95
## 2 b_habitatForest 0.275 0.454 0.544 0.641 0.639 0.742 0.827 1.07
Useful functions of the tidybayes package are also median_qi()
, mean_qi()
… after gather_draws()
which you can use instead of group_by()
and summarise()
.
We would now like to visualise the estimated number of species per habitat type with associated uncertainty. With the function spread_draws()
you can take a certain number of samples from the posterior distribution and convert them into a wide format table. The average number of species in bogs according to our model is \(\exp(\beta_0)\) and in forests \(\exp(\beta_0+\beta_1)\). We show the posterior distributions with the posterior median and 60 and 90% credible intervals.
fit_poisson2 %>%
# spread 1000 posterior samples for 2 parameters in wide format
spread_draws(b_Intercept, b_habitatForest, ndraws = 1000, seed = 123) %>%
# calculate average numbers and convert to long format for visualisation
mutate(bog = exp(b_Intercept),
forest = exp(b_Intercept + b_habitatForest)) %>%
pivot_longer(cols = c("bog", "forest"), names_to = "habitat",
values_to = "sp_rich") %>%
# visualise via ggplot()
ggplot(aes(y = sp_rich, x = habitat)) +
stat_eye(point_interval = "median_qi", .width = c(0.6, 0.9)) +
scale_y_continuous(limits = c(0, NA))
In addition to stat_eye()
you will find here some nice ways to visualise posterior distributions .
We see a clear difference in the number of species between the two habitats. Is there a significant difference between the number of species in bogs and forests? We test the hypothesis that numbers are equal in bogs and forests.
\[ \exp(\beta_0) = \exp(\beta_0+\beta_1)\\ \Rightarrow \beta_0 = \beta_0 + \beta_1\\ \Rightarrow \beta_1 = 0\\ \]
This can easily be done via the hypothesis()
function of the brms package.
The argument alpha
specifies the size of the credible interval.
This allows hypothesis testing in a similar way to the frequentist null hypothesis testing framework.
# Test hypothesis difference between habitats
hyp <- hypothesis(fit_poisson2, "habitatForest = 0", alpha = 0.1)
hyp
## Hypothesis Tests for class b:
## Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob
## 1 (habitatForest) = 0 0.64 0.12 0.45 0.83 NA NA
## Star
## 1 *
## ---
## 'CI': 80%-CI for one-sided and 90%-CI for two-sided hypotheses.
## '*': For one-sided hypotheses, the posterior probability exceeds 90%;
## for two-sided hypotheses, the value tested against lies outside the 90%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.
# Plot posterior distribution hypothesis
plot(hyp)
We can conclude that there is a significant difference since 0 is not included in the 90% credible interval.
Finally, we visualise the random effects of the sites. We sort them from high to low species richness.
# Take the mean of SD of random effects
# to add to figure later
sd_mean <- fit_poisson2 %>%
spread_draws(sd_site__Intercept, ndraws = 1000, seed = 123) %>%
summarise(mean_sd = mean(sd_site__Intercept)) %>%
pull()
# Take random effects and plot
fit_poisson2 %>%
spread_draws(r_site[site,], ndraws = 1000, seed = 123) %>%
ungroup() %>%
mutate(site = reorder(site, r_site)) %>%
ggplot(aes(x = r_site, y = site)) +
geom_vline(xintercept = 0, color = "darkgrey", linewidth = 1) +
geom_vline(xintercept = c(sd_mean * qnorm(0.05), sd_mean * qnorm(0.95)),
color = "darkgrey", linetype = 2) +
stat_halfeye(point_interval = "median_qi", .width = 0.9, size = 2/3,
fill = "cornflowerblue")
Let’s go back to our very first model where we used the Normal distribution. This was equivalent to a linear regression with categorical variable. A linear regression with categorical variable is also called ANOVA and if there are only two groups, an ANOVA is equivalent to a t-test. We can therefore take the opportunity to compare the results of our first model (a Bayesian model) with the results of a classical (frequentist) t-test.
# Extract summary statistics from the Bayesian model
sum_fit_normal1 <- summary(fit_normal1, prob = 0.9)
diff_bog1 <- sum_fit_normal1$fixed$Estimate[2]
ll_diff_bog1 <- sum_fit_normal1$fixed$`l-90% CI`[2]
ul_diff_bog1 <- sum_fit_normal1$fixed$`u-90% CI`[2]
sum_fit_normal1
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: sp_rich ~ habitat
## Data: ants_df (Number of observations: 44)
## Draws: 3 chains, each with iter = 2000; warmup = 500; thin = 1;
## total post-warmup draws = 4500
##
## Regression Coefficients:
## Estimate Est.Error l-90% CI u-90% CI Rhat Bulk_ESS Tail_ESS
## Intercept 4.80 0.81 3.46 6.11 1.00 4196 3219
## habitatForest 4.33 1.14 2.42 6.15 1.00 4502 3133
##
## Further Distributional Parameters:
## Estimate Est.Error l-90% CI u-90% CI Rhat Bulk_ESS Tail_ESS
## sigma 3.73 0.41 3.12 4.46 1.00 3557 3080
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
# Perform t-test and extract summary statistics
t_test_normal1 <- t.test(sp_rich ~ habitat, data = ants_df, conf.level = 0.9)
diff_bog2 <- t_test_normal1$estimate[2] - t_test_normal1$estimate[1]
ll_diff_bog2 <- -t_test_normal1$conf.int[2]
ul_diff_bog2 <- -t_test_normal1$conf.int[1]
t_test_normal1
##
## Welch Two Sample t-test
##
## data: sp_rich by habitat
## t = -3.8881, df = 37.034, p-value = 0.0004043
## alternative hypothesis: true difference in means between group Bog and group Forest is not equal to 0
## 90 percent confidence interval:
## -6.191854 -2.444510
## sample estimates:
## mean in group Bog mean in group Forest
## 4.863636 9.181818
We see that this indeed produces almost exactly the same results. Our Bayesian model estimates that on average 4.329 more ant species occur in forests than in bogs (90% credible interval: 2.423 to 6.149). The t-test estimates that on average 4.318 more ant species occur in forests than in bogs (90% confidence interval: 2.445 to 6.192).
rstan
The brms
package is a convenience wrapper for the rstan
package, which in turn ports stan
functionality to R.
Stan is a modeling framework written in the C
programming language, which implements many probabilistic (“Bayesian”) modeling tools.
More info can be found on the Stan website.
The advantage of brms
is usability: many functions work out-of-the-box, with reasonable default values, and a syntax that is similar to what many statisticians are habituated to.
However, the relative ease-of-use comes at the cost of flexibility, and do some degree, readability.
In contrast, Stan and rstan
lean more to the mathematical formulation of models.
Every aspect of the model has to be explicitly set, which can be an advantage (e.g. if you face non-standard use cases), or disadvantage (e.g. if you specify models in non-optimal ways).
To briefly give an impression, we will build the same models as above, using the Stan framework.
library("rstan")
conflicted::conflicts_prefer(rstan::extract)
conflicted::conflicts_prefer(brms::loo)
RMarkdown can handle stan
code chunks, though more general model definition is outsourced to a separate “*.stan” file (e.g. here).
Alternatively, you can define your model in a big text block, as shown below.
The simple poisson model resembles one of the stan
-dard examples, which you can refer to for all further details and more.
stan_poisson_model_code = '
data {
int<lower=1> N; // sample size
vector[N] habitat_obs; // habitat
array[N] int<lower=0> sp_rich; // outcome variable: number of ants
}
parameters {
real intercept_rich; // intercept
real habitat_effect; // slope
}
model {
// priors
intercept_rich ~ normal(0, 1);
habitat_effect ~ normal(0, 1);
// "posterior", "likelihood", ... give it a name!
sp_rich ~ poisson_log(intercept_rich + habitat_effect * habitat_obs);
}
'
Take this as a “look behind the scenes”!
You have to explicitly define the model structure, priors, even the data types of input variables.
Yet note how even Stan is not without convenience functions: we can use the poisson_log
posterior to model log rates.
When working outside RStudio/RMarkdown, you might prefer loading the model from a file:
stan_poisson_model <- stan_model(
# file = here("code", "brms_modeling", "poisson_model.stan"),
model_code = stan_poisson_model_code,
model_name = "stan poisson model"
)
Sampling does pretty much the same as above, since at the core, brms
is just stan
.
stan_poisson_fit <- sampling(
stan_poisson_model,
list(
N = nrow(ants_df),
habitat_obs = as.integer(ants_df$habitat)-1,
sp_rich = ants_df$sp_rich
),
iter = niter,
chains = nchains,
cores = nparallel
)
stan_poisson_fit
A convenient way of quickly inspecting model fits is shinystan
.
It opens a browser with shiny plots.
library("shinystan")
launch_shinystan(stan_poisson_fit)
Behold: model outcome is exactly as above.
In the present case, the benefit from turning to stan
is very limited.
In other cases, it might pay off.
Know that Stan is there for you, do not hesitate to turn to its extensive documentation, and do not fear to give it a try!
To take your modeling skills even further, you may implement and sample the “random intercept” model. In “Bayesian” terms, the general terminology is “hierarchical” model.
Below is the code which considers site-specific intercepts.
It works slightly different from the +(1|site)
approach above: the upper one is an “offset” parametrization, whereas this time we use a hyperprior.
These are two common strategies often worth interchanging, with marginal or substantial effects on sampler performance.
data {
int<lower=1> N; // sample size
int<lower=1> L; // number of sites
vector[N] habitat_obs; // habitat
array[N] int<lower=1, upper=L> site_obs; // site
array[N] int<lower=0> sp_rich; // outcome variable: number of ants
}
parameters {
real intercept_rich; // "global" intercept
vector[L] site_effect; // the sr intercept at each site
real<lower=0> site_variation; // variation on site level
real habitat_effect; // habitat slope
}
model {
// priors
intercept_rich ~ normal(0, 1);
site_variation ~ cauchy(0, 1);
habitat_effect ~ normal(0, 1);
// site-wise intercept
for (i in 1:L) {
site_effect[i] ~ normal(intercept_rich, site_variation);
}
// "posterior", "likelihood", ... give it a name!
sp_rich ~ poisson_log(site_effect[site_obs] + habitat_effect * habitat_obs);
}
stan_poisson_fit <- sampling(
stan_poisson_model,
list(
N = nrow(ants_df),
L = length(levels(ants_df$site)),
habitat_obs = as.integer(ants_df$habitat)-1,
site_obs = as.integer(ants_df$site),
sp_rich = ants_df$sp_rich
),
iter = niter,
chains = nchains,
cores = nparallel
)
stan_poisson_fit
With Stan, po(i)ssibilities are almost endless - don’t get lost in model building!