Estimation of Gaussian covariance kernels using covariate information

Authors
Affiliations

Wicher Bergsma

London School of Economics and Political Science

Haziq Jamil

Universiti Brunei Darussalam

Abstract

Assuming the covariance kernel to be in a conditional RKHS, we propose a methodology to estimate it and its hyperparameters. The present work can be viewed as an extension of the I-prior methodology for regression (Bergsma, 2019; Bergsma and Jamil, 2023) to covariance estimation. Applications are not only in time series analysis (when time is the covariate), but also in regression when we want a flexible estimation of the error covariance.

Introduction

Consider a multivariate Gaussian distribution with mean 0 and precision matrix Θ:

YNm(0,Θ1)

The m×m precision matrix, Θ, is the inverse of the covariance matrix, and hence a symmetric, positive definite matrix. Estimating the precision matrix is of paramount importance in various applications, such as graphical modeling and finance.

Prior distribution

A conjugate prior for the precision matrix of a multivariate Gaussian distribution is the Wishart distribution. The Wishart distribution is parameterized by an m×m scale matrix W and degrees of freedom ν. The probability density function (pdf) of the Wishart distribution is given by:

p(ΘW0,ν)=|Θ|(νm1)/2etr(W01Θ)/22vm/2|W0|v/2Γm(v/2)

where m is the dimension of the Gaussian distribution, tr denotes the trace of a matrix, and Γm is the multivariate gamma function. We write ΘW(W,ν).

Likelihood

Given a sample of n independent and identically distributed observations Y={y1,y2,,yn}, the likelihood of the data given the precision matrix Θ is:

L(ΘY)=i=1n|Θ|1/2e12yiTΘyi(2π)m/2

Posterior distribution

Using Bayes’ theorem, the posterior distribution of Θ given the data and the prior is proportional to the product of the likelihood and the prior:

p(ΘY)L(ΘY)×p(ΘW0,ν)

To derive the posterior distribution in closed form, we need only to combine the terms in the likelihood and the prior that involve Θ:

p(ΘY)|Θ|n/2+(νm1)/2e12i=1nyiTΘyi12tr(W01Θ) Write S=n1i=1nyiyi for the sample covariance matrix. Collecting terms, the posterior distribution is also a Wishart distribution with updated parameters:

ΘYW(W,ν) where W=(W01+nS)1 and ν=ν+n.

The posteior mean of Θ is E[ΘY]=νW=(ν+n)(W01+nS)1. It’s worth noting that while posterior distribution obtained when specifiying an Inverse-Wishart prior for the covariance matrix Σ=Θ1 is exactly the same what has been derived above, the posterior mean is different. That is, E[ΘY]1E[ΣY], and one cannot simply invert the posterior mean of Θ to obtain the posterior mean of Σ.

Adding covariate information

Let X be a set and let F be a reproducing kernel Hilbert space (RKHS) on X with reproducing kernel h:X×XR. Then the tensor product FF is an RKHS on X×X with reproducing kernel hh. Let ΘFF be a random element with prior distribution p(Θ), say the Wishart distribution. Write Θ=t,u=1mwtuh(xt,)h(xu,). In other words, the (i,j)th element of the precision matrix Θ is given by Θij=t,u=1mwtuh(xt,xi)h(xu,xj). Therefore Θ=HWH, where HRm×m is the kernel matrix with elements h(xi,xj) and WRm×m is the matrix with elements wtu. Since Θ has a Wishart distribution, necessarily W also has a Wishart distribution. The prior distribution of W is the Wishart distribution with scale matrix W0 and degrees of freedom v. Thus ΘW(HW0H,v).

Deriving the posterior distribution for W

Let y={y1,y2,,yn} be independent and identically distributed observations from a multivariate Gaussian distribution with mean 0 and precision matrix Θ. Let xRm be a covariate, which is the same for each yi. Then the likelihood of the data given W is: (ΘW,y)=nm2log(2π)+n2log|Θ|12tr(SΘ). Then the posterior distribution for Θ is then ΘyW([(HW0H)1+nS]1,ν+n). which suggests that WyW((W01+nHSH)1,ν+n)

Setting an I-prior

The covariance kernel of Θ should be proportional to its Fisher information. For a multivariate normal distribution with mean μ and precision matrix Θ, the Fisher information for the precision matrix is given by I(Θ)=n2ΘΘ, where here denotes the Kronecker product.

On the other hand, given WW(W0,v), then Var(vec(W))=v(W0W0). Hence, if it is required that the covariance kernel of Θ should be proportional to I(Θ), then the covariance kernel of the prior on the W matrix should be proportional to I(Θ). An I-prior on Θ might look like WW(Θ,v). Which means we could choose the observed information instead, so setting W0=S. [Or… S1???]

Experiments

Here the idea is to simulate data from a multivariate normal distribution with a known precision matrix Θ. Then we will try to recover Θ using the I-prior.

Simulate data

The true Θ is made dependent on the covariate x.

m <- 5
x <- rnorm(m)
Theta0 <- (x %*% t(x) + diag(rep(0.0001, m)))
Sigma0 <- solve(Theta0)
n <- 10000
y <- mvtnorm::rmvnorm(n = n, sigma = Sigma0)
H <- kern_fbm(x)

corrplot(Theta0, is.corr = FALSE, col = COL2('RdBu', 10), addCoef.col = 'black')

Posterior estimate

# Posterior mean of I-prior estimate for Theta
S <- crossprod(y - mean(y)) / n
What <- solve(n * H %*% S %*% H + solve(S))
nuhat <- nrow(What) + n
Theta_hat <- H %*% (What * nuhat) %*% H

corrplot(Theta_hat, is.corr = FALSE, col = COL2('RdBu', 10), addCoef.col = 'black')