<- 5
m <- rnorm(m)
x <- (x %*% t(x) + diag(rep(0.0001, m)))
Theta0 <- solve(Theta0)
Sigma0 <- 10000
n <- mvtnorm::rmvnorm(n = n, sigma = Sigma0)
y <- kern_fbm(x)
H
corrplot(Theta0, is.corr = FALSE, col = COL2('RdBu', 10), addCoef.col = 'black')
Estimation of Gaussian covariance kernels using covariate information
Assuming the covariance kernel to be in a conditional RKHS, we propose a methodology to estimate it and its hyperparameters. The present work can be viewed as an extension of the I-prior methodology for regression (Bergsma, 2019; Bergsma and Jamil, 2023) to covariance estimation. Applications are not only in time series analysis (when time is the covariate), but also in regression when we want a flexible estimation of the error covariance.
Introduction
Consider a multivariate Gaussian distribution with mean 0 and precision matrix \(\Theta\):
\[ Y \sim \mathcal N_m(0, \Theta^{-1}) \]
The \(m\times m\) precision matrix, \(\Theta\), is the inverse of the covariance matrix, and hence a symmetric, positive definite matrix. Estimating the precision matrix is of paramount importance in various applications, such as graphical modeling and finance.
Prior distribution
A conjugate prior for the precision matrix of a multivariate Gaussian distribution is the Wishart distribution. The Wishart distribution is parameterized by an \(m\times m\) scale matrix \(W\) and degrees of freedom \(\nu\). The probability density function (pdf) of the Wishart distribution is given by:
\[ p(\Theta \mid W_0,\nu) = \frac{|\Theta|^{(\nu-m-1)/2} e^{-\text{tr}(W_0^{-1}\Theta)/2}}{2^{vm/2}|W_0|^{v/2}\Gamma_m(v/2)} \]
where \(m\) is the dimension of the Gaussian distribution, \(\operatorname{tr}\) denotes the trace of a matrix, and \(\Gamma_m\) is the multivariate gamma function. We write \(\Theta \sim \mathcal W(W, \nu)\).
Likelihood
Given a sample of \(n\) independent and identically distributed observations \(\mathcal Y = \{y_1, y_2, \ldots, y_n\}\), the likelihood of the data given the precision matrix \(\Theta\) is:
\[ L(\Theta \mid \mathcal Y) = \prod_{i=1}^{n} \frac{|\Theta|^{1/2} e^{-\frac{1}{2} y_i^T \Theta y_i}}{(2\pi)^{m/2}} \]
Posterior distribution
Using Bayes’ theorem, the posterior distribution of \(\Theta\) given the data and the prior is proportional to the product of the likelihood and the prior:
\[ p(\Theta \mid \mathcal Y) \propto L(\Theta \mid \mathcal Y ) \times p(\Theta \mid W_0,\nu) \]
To derive the posterior distribution in closed form, we need only to combine the terms in the likelihood and the prior that involve \(\Theta\):
\[ p(\Theta \mid \mathcal Y) \propto |\Theta|^{n/2 + (\nu-m-1)/2} e^{-\frac{1}{2} \sum_{i=1}^{n} y_i^T \Theta y_i - \frac{1}{2} \text{tr}(W_0^{-1}\Theta)} \] Write \(S=n^{-1}\sum_{i=1}^{n} y_i y_i^\top\) for the sample covariance matrix. Collecting terms, the posterior distribution is also a Wishart distribution with updated parameters:
\[ \Theta \mid \mathcal Y \sim \mathcal W( W^*, \nu^*) \] where \[ W^* = (W_0^{-1} + nS)^{-1} \] and \[ \nu^* = \nu + n. \]
The posteior mean of \(\Theta\) is \[ \operatorname{E}[ \Theta \mid \mathcal Y] = \nu^* W^* = (\nu + n)(W_0^{-1} + nS)^{-1}. \] It’s worth noting that while posterior distribution obtained when specifiying an Inverse-Wishart prior for the covariance matrix \(\Sigma = \Theta^{-1}\) is exactly the same what has been derived above, the posterior mean is different. That is, \[ \operatorname{E}[ \Theta \mid \mathcal Y]^{-1} \neq \operatorname{E}[ \Sigma \mid \mathcal Y], \] and one cannot simply invert the posterior mean of \(\Theta\) to obtain the posterior mean of \(\Sigma\).
Adding covariate information
Let \(\mathcal X\) be a set and let \(\mathcal F\) be a reproducing kernel Hilbert space (RKHS) on \(\mathcal X\) with reproducing kernel \(h:\mathcal X \times \mathcal X \to \mathbb R\). Then the tensor product \(\mathcal F \otimes \mathcal F\) is an RKHS on \(\mathcal X \times \mathcal X\) with reproducing kernel \(h \otimes h\). Let \(\Theta \in \mathcal F \otimes \mathcal F\) be a random element with prior distribution \(p(\Theta)\), say the Wishart distribution. Write \[ \Theta = \sum_{t,u=1}^m w_{tu} h(x_t,\cdot) \otimes h(x_u,\cdot). \] In other words, the \((i,j)\)th element of the precision matrix \(\Theta\) is given by \[ \Theta_{ij} = \sum_{t,u=1}^m w_{tu} h(x_t, x_i) h(x_u, x_j). \] Therefore \(\Theta = HWH\), where \(H \in \mathbb R^{m\times m}\) is the kernel matrix with elements \(h(x_i, x_j)\) and \(W \in \mathbb R^{m\times m}\) is the matrix with elements \(w_{tu}\). Since \(\Theta\) has a Wishart distribution, necessarily \(W\) also has a Wishart distribution. The prior distribution of \(W\) is the Wishart distribution with scale matrix \(W_0\) and degrees of freedom \(v\). Thus \[ \Theta \sim \mathcal W ( HW_0H, v). \]
Deriving the posterior distribution for \(W\)
Let \(y = \{y_1, y_2, \ldots, y_n\}\) be independent and identically distributed observations from a multivariate Gaussian distribution with mean 0 and precision matrix \(\Theta\). Let \(x\in\mathbb R^m\) be a covariate, which is the same for each \(y_i\). Then the likelihood of the data given \(W\) is: \[ \ell(\Theta \mid W, y) = -\frac{nm}{2} \log(2\pi) + \frac{n}{2} \log |\Theta| - \frac{1}{2} \operatorname{tr}(S \Theta). \] Then the posterior distribution for \(\Theta\) is then \[ \Theta \mid y \sim \mathcal W\big ([(HW_0H)^{-1} + nS]^{-1}, \nu + n\big). \] which suggests that \[ W \mid y \sim \mathcal W \big( (W_0^{-1} + nHSH)^{-1}, \nu + n \big) \]
Setting an I-prior
The covariance kernel of \(\Theta\) should be proportional to its Fisher information. For a multivariate normal distribution with mean \(\mu\) and precision matrix \(\Theta\), the Fisher information for the precision matrix is given by \(I(\Theta) = \frac{n}{2} \Theta \otimes \Theta\), where \(\otimes\) here denotes the Kronecker product.
On the other hand, given \(W\sim \mathcal W(W_0, v)\), then \[ \operatorname{Var}(\operatorname{vec}(W)) = v (W_0 \otimes W_0). \] Hence, if it is required that the covariance kernel of \(\Theta\) should be proportional to \(I(\Theta)\), then the covariance kernel of the prior on the \(W\) matrix should be proportional to \(I(\Theta)\). An I-prior on \(\Theta\) might look like \[ W \sim \mathcal W \big( \Theta, v \big). \] Which means we could choose the observed information instead, so setting \(W_0 = S\). [Or… \(S^{-1}\)???]
Experiments
Here the idea is to simulate data from a multivariate normal distribution with a known precision matrix \(\Theta\). Then we will try to recover \(\Theta\) using the I-prior.
Simulate data
The true \(\Theta\) is made dependent on the covariate \(x\).
Posterior estimate
# Posterior mean of I-prior estimate for Theta
<- crossprod(y - mean(y)) / n
S <- solve(n * H %*% S %*% H + solve(S))
What <- nrow(What) + n
nuhat <- H %*% (What * nuhat) %*% H
Theta_hat
corrplot(Theta_hat, is.corr = FALSE, col = COL2('RdBu', 10), addCoef.col = 'black')