Gaussian Processes

Youngsoo Baek

Mar 05, 2026

Recalling the longitudinal analysis

Is Linear Fit the Right Way?

There are various ways to fit a model to time-varying behavior of the curves.

A brief agenda

By the end of this lecture you should:

  1. Understand the basic concepts of Gaussian processes (GP) and their usefulness in statistical modeling.

  2. Compute posterior and predictive inference for a simple GP model.

  3. Use GP as a building block in modeling time-correlated data.

Motivation

Let \(i\) index distinct eyes (\(i=1,2,\ldots,n\)) and \(t\) index within-eye data in time (\(t=1,2,\ldots,n_i\)).

Suppose we are interested in effects of time-invariant predictors (age and iop):

\[Y_{it} = \underbrace{\mathbf{x}_i\boldsymbol{\beta}}_{\text{Fixed in time}} + \underbrace{\eta_{it}}_{\text{Varying in time}} + \epsilon_{it}\]

In a previous lecture, we included only time (\(X_{it}\)) and fit a random slope regression.

\[\eta_{it} = \theta_{0i} + \underbrace{X_{it}}_{=Time}\theta_{1i}.\]

What if we want a nonlinear model for time effects?

Specifying basis functions

Linear increase/decrease in time is too limited! What if their visual loss recovers, or fluctuates?

\[\eta_{it} = \theta_{0i} + \sum_{k=1}^K X_{it}^{\color{red} k}\theta_{1i}^{(k)}.\]

Bypassing the choice of bases

Choosing the “right” type/number of bases is difficult (polynomials are usually not great choices).

In a continuous time, \(\eta_{it} = \underbrace{\eta_i}_{i-\text{th function}}(t)\) is useful:

  • Different people/groups have observations spanning different grids.

  • Makes easier the formulation of a prediction task: new time \(t^{pred}\) not in your dataset?

Can we introduce a probabilistic model for a whole collection of random functions \((\eta_i)_{i=1}^{n}\)? How can we place a “prior” on random effects over all times?

What do I want the function to look like?

Denote by \(f\) a random function with mean zero. A distribution of a random function is determined by knowing every possible finite-dimensional marginals:

\[\mathbf{f} = \begin{pmatrix} f(t_{1})\\ \vdots \\ f(t_{m}) \end{pmatrix} {\sim} N_{m}(\mathbf{0},\boldsymbol{\Sigma}(t_1,\ldots,t_m)),\]

where the \(st\)-th entry of the matrix is \(= \mathbb{C}[f(s), f(t)]\) for a given covariance function \(\mathbb{C}\).

We call such random function a Gaussian process.

What are my prior assumptions about the covariance?

If I write that covariance as \(C(t,s)\), then we must decide what the function \(C\) looks like.

  1. Hypothesis 1 : The marginal variability should stay the same.

  2. Hypothesis 2 : Time correlation should decay in, and only depend on, the amount of time elapsed.

  3. Hypothesis 3 : We have expectations about the span of correlation and smoothness of the process.

These are all a priori hypotheses. The data may come from something very different!

Stationary Gaussian processes

Based on our considerations, let’s choose \(C\) that only depends on the lag between two time points.

\[ \mathbb{C}(f_s, f_t) = \sigma^2 C(|s-t|). \]

We want \(C(0) = 1\) and \(C(h)\to 0\) as \(h = |t-s|\to\infty\).

Different choices of \(C\) yields sampled functions with varying smoothness.

  • Exponential kernel: \(C(h) = \sigma^2\exp(-h/\rho)\)

  • Square exponential kernel: \(C(h) = \sigma^2\exp\{-h^2/(2\rho^2)\}\)

  • Matérn kernels

Gaussian Process Covariance Functions in Stan

A list of available kernels is available here.

The squared exponential kernel (exponentiated quadratic kernel) is given by:

\[C(\mathbf{x}_i,\mathbf{x}_j) = \sigma^2 \exp\left(-\frac{|\mathbf{x}_i - \mathbf{x}_j|^2}{2\rho^2}\right)\]

matrix gp_exp_quad_cov(array[] real x, real sigma, real length_scale)

For us, \(\mathbf{x}\) is 1D (time)…but it does not have to be!

Prior Sampling in Stan

Once we have a GP model, sampling from the prior is equivalent to sampling from a multivariate normal.

data {
  int<lower=1> N;
  array[N] real x;
  real<lower=0> sigma;
  real<lower=0> l;
}
transformed data {
  // 0. A "small nugget" to stabilize matrix root computation
  // This is for.better numerical stability in taking large matrix roots
  real delta = 1e-9;

  // 1. Compute the squared exponential kernel matrix
  // x is the time variable
  vector[N] mu = rep_vector(0, N);
  matrix[N, N] R_C;
  matrix[N, N] C = gp_exp_quad_cov(x, sigma, l);
  for (i in 1:N) {
    C[i, i] = C[i, i] + delta;
  }

  // 2. Compute the root of C by Cholesky decomposition
  // (Stan wants the "root" of C, rather than C)
  R_C = cholesky_decompose(C);
}
generated quantities {
  // 3. Sample from the prior: multivariate_normal(0, C)
  f ~ multi_normal_cholesky(mu, R_C)
}

Visualizing the process

Default Prior Choices

\(\rho\) determines the “natural length scale” of correlations between time.

  • When you have longitudinal data, there are theoretical justifications for using \(\rho^{-1} \sim Gamma(5,5)\).

  • A priori the prior is concentrated around 1. For a prior mean, distance increase of 1 corrresponds to a multiplicative decay of correlation by \(e^{-1}\sim 37\%\).

  • Beware of the units! A year correlation is very different from that over a day.

Combining GP with Random Effects Regression

For each \(i\)-th eye: time effect vector is given an independent prior based on the GP model.

\[\boldsymbol{\eta}_i = (\eta_{i1},\ldots,\eta_{in_i})^\top \stackrel{ind}{\sim} N_{n_i}(\mathbf{0},\mathbf{C}_i)\]

Each \(\eta_{it}\) is a pointwise value of the process \(\eta_i(t) = \eta_{it}\).

\[\mathbf{C}_i = \begin{bmatrix} C(0) & C(|t_{i1}-t_{i2}|) & \cdots & C(|t_{i1} - t_{in_i}|)\\ C(|t_{i1} - t_{i2}|) & C(0) & \cdots & C(|t_{i,2} - t_{in_i}|)\\ \vdots & \vdots & \ddots & \vdots \end{bmatrix}\]

We can use a squared-exponential kernel \(C(|s-t|) = \alpha^2e^{-|s-t|^2/2\rho^2}\) and also draw posterior samples of \((\alpha,\rho)\).

The full model

For \(i\) (\(i = 1,\ldots,n\)) and \(t\) (\(t = 1,\ldots,n_i\)),

\[\begin{align*} Y_{it} &= \mathbf{x}_{i}\boldsymbol{\beta} + \eta_{i}(t) + \epsilon_{it}, \quad \epsilon_{it} \stackrel{iid}{\sim} N(0,\sigma^2),\\ \boldsymbol{\eta}_i &\stackrel{ind}{\sim} N_{n_i}(\mathbf{0},\mathbf{C}_i),\\ \boldsymbol{\Omega} &\sim f(\boldsymbol{\Omega}). \end{align*}\]

  • \(\boldsymbol{\Omega}\) represents the population-level unknown parameters: \((\boldsymbol{\beta},\sigma^2,\alpha,\rho)\).

First Look at the Data

head(dataset)
  pat_id eye_id mean_deviation      time      age      iop
1      1      1          -7.69 0.0000000 51.55616 10.87303
2      1      1          -9.95 0.5753425 51.55616 10.87303
3      1      1          -9.58 1.0547945 51.55616 10.87303
4      1      1          -9.53 1.5726027 51.55616 10.87303
5      1      1          -9.18 2.0136986 51.55616 10.87303
6      1      1          -9.63 2.5671233 51.55616 10.87303

Model Specification

We want to fit a model estimating the effects of age and iop on mean deviation, adjusting for nonlinear time effects:

\[\begin{align*} Y_{it} &= \underbrace{\beta_0 + \text{age}_i\beta_1 + \text{iop}_i\beta_2}_{=\mathbf{x}_{i}\boldsymbol{\beta}} + \eta_{it} + \epsilon_{it}\\ \boldsymbol{\eta}_i &= \begin{bmatrix} \eta_{i1}\\ \vdots\\ \eta_{in_i} \end{bmatrix} \stackrel{iid}{\sim} N(0,\mathbf{C}_i)\\ \epsilon_{it} &\stackrel{iid}{\sim} N(0,\sigma^2) \end{align*}\]

Our model for \(\eta_{it}\) is nonlinear in time: time enters the covariance structure of \(\mathbf{C}_i\) producing a function \(\eta_i(t) = \eta_{it}\).

Preparing the Data

A few data processing steps are needed to handle the data structure with more ease.

  • \(N\) will denote the number of all observations: \(N = \sum_{i=1}^{n} n_i\).

  • \(n\) will denote the total number of eyes.

  • \(p\) will denote the number of predictors fixed across time (\(p=3\): intercept, slopes for age and iop).

  • A separate vector of the number of observations (\(n_i\)) for each eye will be stored as s.

Preparing the Data

library(dplyr)

# Group predictors fixed across time
fixed_df <- dataset %>% 
  group_by(eye_id) %>% 
  reframe(age = unique(age), iop = unique(iop))
Xmat <- model.matrix(~age + iop, data = fixed_df)

# Number of measurements for each eye
groupsizes <- dataset %>% 
  group_by(eye_id) %>% 
  summarise(n = n()) %>% 
  pull(n)

stan_data <- list(
  N = dim(dataset)[1],
  n = max(dataset$eye_id),
  p = dim(Xmat)[2],
  t = dataset$time,
  Y = dataset$mean_deviation,
  s = groupsizes,
  X = Xmat
)

Conditional GP Model

data {
  int<lower=0> N;       // total number of observations
  int<lower=1> n;       // number of eyes
  int<lower=1> p;       // fixed effects dimension
  vector[N] Y;          // observation
  matrix[n, p] X;        // fixed effects predictors
  array[N] real t;      // obs time points
  array[n] int s;       // sizes of within-pt obs
}
transformed data {
  real delta = 1e-9;
}
parameters {
  // Fixed effects model
  vector[p] beta;
  real<lower=0> sigma;
  // GP parameters
  vector[N] z;
  real<lower=0> alpha;
  real<lower=0> rho;
}
transformed parameters {
  vector[n] mu = X * beta;
}
model {
  beta ~ normal(0,3);
  z ~ std_normal();
  alpha ~ std_normal();
  sigma ~ std_normal();
  rho ~ inv_gamma(5,5);

  vector[N] mu_rep;
  vector[N] eta;
  
  int pos;
  pos = 1;
  // Ragged loop computing the mean for each time obs
  for (i in 1:n) {
    // GP covariance for the k-th eye
    int n_i = s[i];
    int pos_end = pos + n_i - 1;
    matrix[n_i, n_i] R_C;
    matrix[n_i, n_i] C = gp_exp_quad_cov(segment(t, pos, n_i), alpha, rho);
    for (j in 1:n_i) {
      // adding a small term to the diagonal entries
      C[j, j] = C[j, j] + delta;
    }
    R_C = cholesky_decompose(C);
    
    // Mean of data at each time
    mu_rep[pos:pos_end] = rep_vector(mu[i], n_i);
    // GP for the i-th eye
    eta[pos:pos_end] = R_C * segment(z, pos, n_i);
    pos = pos_end + 1;
  }
  // Normal observation model centered at mu_rep + eta
  Y ~ normal(mu_rep + eta, sigma);
}

Assessing Convergence

traceplot(gp_condl_fit, pars = c("beta", "sigma", "alpha", "rho"))

Assessing Convergence

bayesplot::mcmc_acf(gp_condl_fit, regex_pars = c("beta", "sigma", "alpha", "rho"))

Marginal Model Specification

The conditional model is pretty slow (can be improved) and results in poor mixing (can be run longer). Marginalizing the random effect can stabilize the computation.

In vector notation, our observation model for the \(i\)-th eye is

\[ \mathbf{Y}_i = \begin{bmatrix} Y_{i1}\\ \vdots \\ Y_{in_i} \end{bmatrix} = \mathbf{x}_i\boldsymbol{\beta} + \boldsymbol{\eta}_i + \boldsymbol{\epsilon}_i \] By marginalizing out \(\boldsymbol{\eta}_i\), we obtain

\[ \mathbf{Y}_i \sim N_{n_i}(\mathbf{x}_i\boldsymbol{\beta},\mathbf{C}_i + \sigma^2\mathbf{I}_{n_i}) \]

Fitting a Marginal Model

Only the model segment has to be changed.

model {
  beta ~ normal(0,3);
  alpha ~ std_normal();
  sigma ~ std_normal();
  rho ~ inv_gamma(5,5);

  int pos;
  pos = 1;
  // Ragged loop computing joint likelihood for each eye
  for (i in 1:n) {
    // GP covariance for the k-th eye
    int n_i = s[i];
    vector[n_i] mu_rep;
    
    matrix[n_i, n_i] R_C;
    matrix[n_i, n_i] C = gp_exp_quad_cov(segment(t, pos, n_i), alpha, rho);
    for (j in 1:n_i) {
      // Add noise variance to the diagonal entries
      C[j, j] = C[j, j] + square(sigma);
    }
    R_C = cholesky_decompose(C);
    // Marginal model for the i-th eye
    mu_rep = rep_vector(mu[i], n_i);
    target += multi_normal_cholesky_lpdf(to_vector(segment(Y, pos, n_i)) | mu_rep, R_C);
    pos = pos + n_i;
  }
}

Assessing Convergence

traceplot(gp_marginal_fit, pars = c("beta", "sigma", "alpha", "rho"))

Assessing Convergence

bayesplot::mcmc_acf(gp_marginal_fit, regex_pars = c("beta", "sigma", "alpha", "rho"))

Estimates of eye-level Predictors

Learned GP hyperparameter

  • The posterior mean of \(\rho\) (in year) is very large: the pattern of mean deviation change is very gradual over time.

  • A posteriori we believe mean deviations of an eye, 10 years apart and adjusted for age and iop, are still strongly correlated (\(e^{-5/6}\sim 43\%\)).

Random effects covariance structure for the first 5 years

Predictive inference

The model is continuous in nature and defined, in principle, for all times. This is very convenient for visualiztion, interpolation, and forecasting.

\[ \mathbf{Y}_i = \underbrace{\mathbf{f}_{i}}_{ = \mathbf{x}_{i}\boldsymbol{\beta} + \boldsymbol{\eta}_{i} } + \boldsymbol{\epsilon}_i \]

The vector \(\mathbf{f}_{i}\) may be interpreted as the denoised latent process of an eye-specific mean deviation over time: \((f_{i1},\ldots,f_{in_i})^\top\).

\[ \mathbb{E}[f_{it}] = \mathbf{x}_i\boldsymbol{\beta},\; \mathbb{C}(f_{it}, f_{it'}) = \mathbb{C}(\eta_{it},\eta_{it'}) \]

Predictive Inference using Normal Conditioning

Say \(\mathbf{f}^{pred}_i\) is the values of \(f_i\) at \(m_i\) new time points which we want to predict, based on our GP model.

Key is that the random effects at observed times and new prediction time points are not independent.

\[\begin{align*} \begin{pmatrix} \mathbf{Y}_i \\ \mathbf{f}^{pred}_i \end{pmatrix} \sim N_{n_i + m_i}\left( \begin{pmatrix} \mathbf{x}_i\boldsymbol{\beta}\\ \mathbf{x}_i\boldsymbol{\beta} \end{pmatrix}, \begin{pmatrix} \mathbf{C}_i + \sigma^2\mathbf{I}_{n_i} & \mathbf{k}_i^\top \\ \mathbf{k}_i & \mathbf{C}^{pred}_i\end{pmatrix} \right), \end{align*}\] where \(\mathbf{k}_i = \mathbb{C}(\mathbf{f}^{pred}_i, \mathbf{f}_i)\) (“cross-covariances”).

An analytical formula exists for the conditional distribution of \(\mathbf{f}^{pred}\) given observed \(Y_{it}\) at \(n_i\) time points.

\[ \mathbf{f}^{pred}_i | \mathbf{Y}_i \sim N\left(\mathbf{x}_i\boldsymbol{\beta} + \mathbf{k}_i(\mathbf{C}_i + \sigma^2\mathbf{I})^{-1}(\mathbf{Y}_i - \mathbf{x}_i\boldsymbol{\beta}), \mathbf{C}_i^{pred}-\mathbf{k}_i(\mathbf{C}_i + \sigma^2\mathbf{I})^{-1}\mathbf{k}_i^\top\right) \]

New Data for Predictive Inference

Suppose we are interested in understanding the mean deviation trend for the first 5 years from baseline (\(t^{pred}\in [0,5]\)).

stan_data <- list(
  N = dim(dataset)[1],
  n = max(dataset$eye_id),
  p = dim(Xmat)[2],
  Y = dataset$mean_deviation,
  t = dataset$time,
  s = groupsizes,
  X = Xmat,
  # Time window for which we want predictive
  t_pred = seq(0, 5, by = 0.25),
  # Total length of this window
  N_pred = 21
)

Stan Implementation

See the Stan Help page for details.

functions {
  // Analytical formula for latent GP conditional on Gaussian observations
  vector gp_pred_rng(array[] real x_pred,
                     vector Y,
                     array[] real x,
                     real mu,
                     real alpha,
                     real rho,
                     real sigma,
                     real delta) {
    int N1 = rows(Y);
    int N2 = size(x_pred);
    vector[N2] f_pred;
    {
      matrix[N1, N1] L_Sigma;
      vector[N1] Sigma_div_y;
      matrix[N1, N2] C_x_xpred;
      matrix[N1, N2] v_pred;
      vector[N2] fpred_mu;
      matrix[N2, N2] cov_fpred;
      matrix[N2, N2] diag_delta;
      matrix[N1, N1] Sigma;
      Sigma = gp_exp_quad_cov(x, alpha, rho);
      for (n in 1:N1) {
        Sigma[n, n] = Sigma[n, n] + square(sigma);
      }
      L_Sigma = cholesky_decompose(Sigma);
      Sigma_div_y = mdivide_left_tri_low(L_Sigma, Y - mu);
      Sigma_div_y = mdivide_right_tri_low(Sigma_div_y', L_Sigma)';
      C_x_xpred = gp_exp_quad_cov(x, x_pred, alpha, rho);
      fpred_mu = (C_x_xpred' * Sigma_div_y);
      v_pred = mdivide_left_tri_low(L_Sigma, C_x_xpred);
      cov_fpred = gp_exp_quad_cov(x_pred, alpha, rho) - v_pred' * v_pred;
      diag_delta = diag_matrix(rep_vector(delta, N2));

      f_pred = multi_normal_rng(fpred_mu, cov_fpred + diag_delta);
    }
    return f_pred;
  }
}
//...
generated quantities {
  matrix[n,Np] f_pred;
  matrix[Np,Np] Cp;
  Cp = gp_exp_quad_cov(t_pred, alpha, rho);
  // Posterior predictive on fixed time grid for all eyes
  int pos;
  pos = 1;
  for (i in 1:n) {
    int n_i = s[i];
    f_pred[i,] = mu[i] + 
      gp_pred_rng(
          t_pred,
          segment(Y, pos, n_i),
          segment(t, pos, n_i),
          mu[i], 
          alpha,
          rho,
          sigma,
          delta
        )';
    pos = pos + n_i;
  }
}

Visualizing latent effects using predictive formulae

Stan speed concerns

  • GP is notorious for not being scalable. Tl;dr is the need for matrix root / inverse computation.

  • For datasets covered in this class, Stan works well. For research purposes, worth exploring other specialized toolkits.

  1. Do a bit of software research on Wikipedia

  2. Code MCMC yourself to make it faster (e.g., using Rcpp)

  3. Scalable Approximations (Activee research area)

A summary

  • GP is a flexible, high-dimensional model for handling correlated measurements over time.

  • With some basic knowledge about conditioning Gaussian random variables, we can implement posterior computation and prediction / interpolation.

  • Programming in Stan is straightforward but can be expensive with large number of observations.

Prepare for next class

  • Work on HW 04

  • Have a nice spring break!