- Multilevel models from a frequentist perspective
- The Bayesian approach
- Markov Chain Monte Carlo (MCMC)
- Examples

GDR Vision - Paris 4/10/2018

- Multilevel models from a frequentist perspective
- The Bayesian approach
- Markov Chain Monte Carlo (MCMC)
- Examples

- The dependent variable \(y\) is modeled as a weighted combination of the independent variables, plus an additive error \(\epsilon\) \[ y_i=\beta_0 + \beta_1x_{1i} + \ldots +\beta_nx_{ni} + \epsilon_i \\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right) \]

- The dependent variable \(y\) is modeled as a weighted combination of the independent variables, plus an additive error \(\epsilon\) \[ \textbf{Y} = \textbf{X}\beta + \epsilon \\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right) \]

Matrix notation:

\[ \bf{Y} = \bf{X}\beta + \epsilon \] \[ \left( \begin{array}{c} y_1 \\ \vdots \\ y_m \end{array} \right) = \left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1n}\\ \vdots & \vdots & \vdots & \vdots\\ 1 & x_{m1} & \ldots & x_{mn} \end{array} \right) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_n \end{array} \right) + \left( \begin{array}{c} \epsilon_1 \\ \vdots \\ \epsilon_m \end{array} \right) \]

Matrix multiplication: \[ \left( \begin{array}{cc} a & b \\ c& d \end{array} \right) \left( \begin{array}{c} 1 \\ 2 \end{array} \right) = 1 \left( \begin{array}{c} a \\ c \end{array} \right) + 2 \left( \begin{array}{c} b \\ d \end{array} \right) = \left( \begin{array}{c} a + 2b \\ c+ 2d \end{array} \right) \]

- 'Classical' linear models are
*fixed-effects*only:- independent variables are all experimental manipulation (they are not random).
- the only random source of variation is the residual error \(\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)\).

- However
*observations*(e.g. trials) are often grouped according to*observational cluster*(e.g. subjects), random samples from a larger population, on which we'd like to make inferences. - In order to properly generalize from the sample to the population, these variables should be treated as
*random effects*. Mixed-effects models allow to do that by explicitly modelling the population distribution.

- A model containing both fixed and random effects is usually called a
*mixed-effects*model (but also: hierarchical, multilevel, etc.). - Random effects are treated as random variations around a population mean. These random variations are usually assumed to have a Gaussian distribution.
- A simple example: a
*random-intercept*model. Regressions have the same slope in each of the \(J\) groups (\(j=1,\dots,J\)), but random variations in intercept \[ y_{ij} = \beta_0 + b_j + \beta_1x_{ij} + \epsilon_{i}\\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)\\ b \sim \mathcal{N}\left( 0, \sigma_b^2 \right) \]

General formulation in matrix notation \[ \textbf{Y} = \textbf{X}\beta + \textbf{Z}b + \epsilon \] where \(\textbf{X}\) and \(\textbf{Z}\) are the known fixed-effects and random-effects regressor matrices.

The components of the residual error vector \(\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)\) are assumed to be

*i.i.d.*(independent and identically distributed).The random-effect components, \(b \sim \mathcal{N}\left( 0, \Omega \right)\) are assumed to be normally distributed with mean 0, however they are not necessarily independent (the components \(b_j\) can be correlated, and correlations can be estimated). Example:\[ \left[ {\begin{array}{c} {{b_0}}\\ {{b_1}} \end{array}} \right] \sim\cal N \left( {\left[ {\begin{array}{c} 0 \\ 0 \end{array}} \right],\Omega = \left[ {\begin{array}{c} {{\mathop{\rm Var}} \left( b_0 \right)} & {{\mathop{\rm cov}} \left( {{b_0},{b_1}} \right)}\\ {{\mathop{\rm cov}} \left( {{b_0},{b_1}} \right)} & {{\mathop{\rm Var}} \left( b_1 \right)} \end{array}} \right]} \right) \]

Parameters (\(\beta\), \(\sigma^2\) and \(\Omega\)) are estimated by maximizing the likelihood function, which is the probability of the data, given the parameters (but treated as a function of the parameters, keeping the data fixed).

The probability of the data

*conditional to the random effects*is integrated with respect to the marginal density of the random effects, to obtain the marginal density of the data \[ L\left(\beta,\sigma^2,\Omega \mid \text{data}\right) = \int p \left(\text{data} \mid \beta,\sigma^2, b\right) \, p\left(b \mid \Omega\right) \, db \]

The \(\textsf{R}\) library

`lme4`

provides two methods for estimating parameters: Maximum Likelihood (ML) and Restricted Maximum Likelihood (REML).ML tend to be biased and underestimate the variance components (e.g. \(\Omega\)).

REML provide less biased variance estimates: conceptually can be seen as similar to Bessel's correction for sample variance (using \(n-1\) instead of \(n\) in the denominator).

- Improved estimates for
*repeated sampling*(e.g. repeated-measures design). - Improved estimates for
*imbalances in sampling*(e.g. unequal number of trials across subjects). - Avoid averaging (pre-averaging of data remove variation and can manifacture false confidence).
- Subject-specific standard error is taken into account in group-level estimates.
- Variation among group or individuals is modelled explicitly.
- Outperform classical methods in predictive ability.

`sleepstudy`

`sleepstudy`

is a dataset in the `lme4`

package, with reaction times data from 18 subjects that were restricted to 3 hours of sleep for 10 days.

Questions:

- how reaction times changes with each sleep-deprived night?
- are individual difference in baseline response times related to individual differences in the effect of sleep deprivation?

str(sleepstudy) 'data.frame': 180 obs. of 3 variables: $ Reaction: num 250 259 251 321 357 ... $ Days : num 0 1 2 3 4 5 6 7 8 9 ... $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...

- The model we want to fit includes both random slopes and intercept \[ y_{ij} = \beta_0 + b_{0j} + \left( \beta_1 + b_{1j} \right) \times {\rm{Days}}_i + \epsilon_i \\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right) \\ b\sim\cal N \left( 0, \Omega\right) \]

- In
`lme4`

m.1 <- lmer(Reaction ~ 1 + Days + (1 + Days | Subject), sleepstudy)

summary(m.1) Linear mixed model fit by REML ['lmerMod'] Formula: Reaction ~ 1 + Days + (1 + Days | Subject) Data: sleepstudy REML criterion at convergence: 1743.6 Scaled residuals: Min 1Q Median 3Q Max -3.9536 -0.4634 0.0231 0.4634 5.1793 Random effects: Groups Name Variance Std.Dev. Corr Subject (Intercept) 612.09 24.740 Days 35.07 5.922 0.07 Residual 654.94 25.592 Number of obs: 180, groups: Subject, 18 Fixed effects: Estimate Std. Error t value (Intercept) 251.405 6.825 36.84 Days 10.467 1.546 6.77 Correlation of Fixed Effects: (Intr) Days -0.138

The function `confint()`

allow easily to compute bootstrapped 95% CI of the parameters

CI_fixef <- confint(m.1, method = "boot", nsim = 500, oldNames = F)

print(CI_fixef, digits = 2) 2.5 % 97.5 % sd_(Intercept)|Subject 12.81 35.63 cor_Days.(Intercept)|Subject -0.53 0.81 sd_Days|Subject 3.57 8.32 sigma 22.61 28.44 (Intercept) 237.02 264.76 Days 7.54 13.20

- Tests of fixed and random effects can also be conducted using likelihood ratio tests, by comparing the likelihood of two
*nested*models. - If \(L_1\) and \(L_2\) are the maximised likelihoods of two nested models with \(k_1 < k_2\) parameters, the test statistic is \(2\text{log}\left( L_2/ L_1 \right)\), which is approximately \(\chi^2\) with \(k_2 - k_1\) degrees of freedom.
- Example: test if there are substantial variation across subjects in the effect of
`Days`

# m.1 <- lmer(Reaction ~ 1 + Days + (1 + Days|Subject), # sleepstudy) m.2 <- lmer(Reaction ~ 1 + Days + (1 | Subject), sleepstudy) anova(m.1, m.2) refitting model(s) with ML (instead of REML) Data: sleepstudy Models: m.2: Reaction ~ 1 + Days + (1 | Subject) m.1: Reaction ~ 1 + Days + (1 + Days | Subject) Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq) m.2 4 1802.1 1814.8 -897.04 1794.1 m.1 6 1763.9 1783.1 -875.97 1751.9 42.139 2 7.072e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

*Shrinkage*: the *predicted* \({\hat b}_j\) (conditional modes of the random effects) can be seen as a "compromise" between the within-subject estimates and the population mean.

- So far, we've seen multilevel models from a
**frequentist**point of view.- parameters are seen as unknown
*fixed*quantities. - the goal is to find best guesses (point-estimates) of the parameters.
- uncertainty in the parameter's value is not treated probabilistically.
- uncertainty in the estimates is characterized in relation to the data-generating process (think about the highly un-intuitive definition of confidence interval).
- probability is interpreted as the expected
*long-run frequency*of an event in an (imaginary) very large sample.

- parameters are seen as unknown

- What is different in the Bayesian approach?
- probability is interpreted as the
*degree of belief*in the occurrence of an event, or in a proposition. - parameters are seen as unknown random variable, uncertainty in their values represented by a
*prior*probability distribution. - the goal is to update the prior distribution on the basis of the available data, leading to a
*posterior*probability distribution. - this is done with Bayes' theorem, however the Bayesian approach
*is not*distinguished by Bayes' theorem (which is just a trivial implication of probability theory). *"The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical analysis"*(Gelman et al. 2013).

- probability is interpreted as the

- In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data}\right) = \frac{p\left(\text{data} \mid \theta\right) p\left(\theta\right)}{p\left( \text{data} \right)} \]

- In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data}\right) = \frac{p\left(\text{data} \mid \theta\right)p\left(\theta\right)}{\int p\left( \text{data} \mid \theta \right) p\left(\theta\right) d\theta}\]

- In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data} \right) = \frac{p\left(\text{data} \mid \theta\right)p\left(\theta\right)}{\int p\left( \text{data} \mid \theta \right) p\left(\theta\right) d\theta}\\ \text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{average likelihood}} \]

- In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data}\right) \propto p\left(\text{data} \mid \theta\right)p\left(\theta\right)\\ \text{posterior} \propto \text{likelihood} \times \text{prior} \]

- In the case of a multilevel model, each cluster \(j\) has parameter \(\theta_j\) which is an independent sample from a population distribution governed by some
*hyperparameter*\(\phi\). - \(\phi\) is also not known, and thus has its prior distribution \(p\left(\phi\right)\) (
*hyperprior*). - We have a joint prior distribution: \(p\left(\phi,\theta\right)=p\left(\phi\right) \prod_{j=1}^{J} p\left(\theta_j \mid \phi\right)\)
- and a joint posterior distribution: \[p\left(\phi,\theta \mid \text{data}\right) \propto \underbrace{p\left(\phi\right) \prod_{j=1}^{J} p\left(\theta_j \mid \phi\right)}_\text{prior} \times \underbrace{ p\left(\text{data} \mid \theta_j\right)}_\text{likelihood} \]

- The information we want about a parameter \(\phi\) is contained in its marginal posterior distribution \(p\left(\phi \mid \text{data} \right)=\int p\left(\phi,\theta \mid \text{data} \right) d\theta\).
- The calculation can be difficult once a model has many parameters (to describe the uncertainty in a parameter of interest one must average or
*marginalise*over the uncertainty in all other parameters). - Monte Carlo techniques and computer simulations (Markov-Chain Monte Carlo sampling, MCMC) can be used to draw samples from the posterior distribution.
- Once you have samples from the posterior distribution, summarizing the marginal posterior distribution becomes simply a matter of counting values.

Say you want to sample from a density \(P({\bf x})\), but you can only evaluate a function \(f({\bf x})\) that is proportional to the density \(P({\bf x})\):

- Initialization:
- choose arbitrary starting point \({\bf x}_0\)
- choose a probability density to generate proposals \(g({\bf x}_{n+1}|{\bf x}_n)\) (
*proposal density*)

- For each iteration \(i\):
- generate a candidate \({\bf x'}\) sampling from \(g({\bf x}_{i+1}|{\bf x}_{i})\)
- calculate the
*acceptance ratio*\(a = \frac{f({\bf x}')}{f({\bf x}_{i})}\) *accept/reject*: generate a uniform number \(u\) in \([0,1]\)- if \(u \le a\), accept and set \({\bf x}_{i+1}={\bf x'}\) (
*"move to the new value"*) - if \(u > a\), reject and set \({\bf x}_{i+1}={\bf x}_{i}\) (
*"stay where you are"*)

- if \(u \le a\), accept and set \({\bf x}_{i+1}={\bf x'}\) (

Let use the sampling algorithm to infer values from one participant of the dataset `sleepstudy`

.

- To sample from the posterior distribution of parameters is sufficient to compute the un-normalized posterior: \(\text{prior} \times \text{likelihood}\)

First, code one function to compute the (log)likelihood, keeping the data fixed.

x <- sleepstudy$Days[sleepstudy$Subject == "308"] y <- sleepstudy$Reaction[sleepstudy$Subject == "308"] loglik <- function(par) { # par = [intercept, slope, log(residual Std.)] pred <- par[1] + par[2] * x return(sum(dnorm(y, mean = pred, sd = exp(par[3]), log = T))) }

Next, choose your priors

Next, choose your priors