%\VignetteEncoding{UTF-8}
%\VignetteIndexEntry{a guide to the hetGP package}
%\VignetteEngine{knitr::knitr}
\documentclass[nojss]{jss}
%% -- LaTeX packages and custom commands ---------------------------------------
%% recommended packages
\usepackage{thumbpdf,lmodern}
\usepackage{amsthm}
\usepackage{amsfonts}
\usepackage{amsmath}
%% another package (only for this demo article)
\usepackage{framed}
%% other packages
\usepackage[boxed]{algorithm}
\usepackage{algpseudocode}
\usepackage{algorithmicx}
\usepackage{thumbpdf} %recommanded by jss
%% new custom commands
\newcommand{\class}[1]{`\code{#1}'}
\newcommand{\fct}[1]{\code{#1()}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\xu}{\bar{\mathbf{x}}} %unique design
\newcommand{\yu}{\bar{\mathbf{y}}} % averaged observations at \xu
\newcommand{\R}{\mathbb{R}}
\newcommand{\veck}{\mathbf{k}}
\newcommand{\vecY}{\mathbf{y}}
\newcommand{\vecYu}{\bar{\mathbf{y}}}
\newcommand{\Deltan}{\boldsymbol{\Delta}_n}
\newcommand{\Lan}{\boldsymbol{\Lambda}_n}
\newcommand{\LaN}{\boldsymbol{\Lambda}_N}
\newcommand{\An}{\mathbf{A}_n}
\newcommand{\Cn}{\mathbf{C}_n}
\newcommand{\CN}{\mathbf{C}_N}
\newcommand{\Cg}{\mathbf{C}_{(g)}}
\newcommand{\Kn}{\mathbf{K}_n}
\newcommand{\KN}{\mathbf{K}_N}
\newcommand{\LambdaN}{\mathbf{K}_N}
\newcommand{\SiN}{\boldsymbol{\Sigma}_N}
\newcommand{\Sn}{\mathbf{S}_n}
\newcommand{\SN}{\mathbf{S}_N}
\newcommand{\bm}[1]{\mbox{\boldmath $#1$}}
\newcommand{\ones}{\bm{1}}
% \newcommand{\Un}{\boldsymbol{\Upsilon}_n}
\newcommand{\Un}{\mathbf{K}_n}
% \newcommand{\Ug}{\boldsymbol{\Upsilon}_{(g)}}
\newcommand{\Ug}{\mathbf{K}_{(g)}}
\newcommand{\XN}{\mathbf{X}_N}
\newcommand{\Xn}{\mathbf{X}_n}
%% For Sweave-based articles about R packages:
%% need no \usepackage{Sweave}
<>=
library("knitr")
## cache can be set to TRUE
opts_chunk$set(
engine='R', tidy=FALSE, cache=FALSE, autodep=TRUE
)
render_sweave() # For JSS when using knitr
knitr::opts_chunk$set(fig.pos = 'ht!')
@
<>=
options(prompt = "R> ", continue = "+ ", width = 70, useFancyQuotes = FALSE, scipen = 5)
@
%% -- Article metainformation (author, title, ...) -----------------------------
%% - \author{} with primary affiliation
%% - \Plainauthor{} without affiliations
%% - Separate authors by \And or \AND (in \author) or by comma (in \Plainauthor).
%% - \AND starts a new line, \And does not.
\author{Micka\"el Binois \\ Argonne National Laboratory
\And Robert B. Gramacy \\Virginia Tech}
\Plainauthor{Micka\"el Binois, Robert B. Gramacy}
%% - \title{} in title case
%% - \Plaintitle{} without LaTeX markup (if any)
%% - \Shorttitle{} with LaTeX markup (if any), used as running title
\title{{\tt hetGP}: Heteroskedastic Gaussian Process Modeling and Sequential Design in \proglang{R}}
\Plaintitle{Heteroskedastic Gaussian Process Modeling and Sequential Design in R}
\Shorttitle{Heteroskedastic Gaussian Processes in \proglang{R}}
%% - \Abstract{} almost as usual
\Abstract{
An increasing number of time-consuming simulators exhibit a complex noise
structure that depends on the inputs. For conducting studies with limited
budgets of evaluations, new surrogate methods are required in order to
simultaneously model the mean and variance fields. To this end, we present the
\pkg{hetGP} package, implementing many recent advances in Gaussian process
modeling with input-dependent noise. First, we describe a simple, yet
efficient, joint modeling framework that relies on replication for both speed
and accuracy. Then we tackle the issue of data acquisition leveraging
replication and exploration in a sequential manner for various goals, such as
for obtaining a globally accurate model, for optimization, or for contour
finding. Reproducible illustrations are provided throughout. }
%% - \Keywords{} with LaTeX markup, at least one required
%% - \Plainkeywords{} without LaTeX markup (if necessary)
%% - Should be comma-separated and in sentence case.
\Keywords{input-dependent noise, level-set estimation, optimization, replication, stochastic kriging}
\Plainkeywords{input-dependent noise, level-set estimation, optimization, replication, stochastic kriging}
%% - \Address{} of at least one author
%% - May contain multiple affiliations for each author
%% (in extra lines, separated by \emph{and}\\).
%% - May contain multiple authors for the same affiliation
%% (in the same first line, separated by comma).
\Address{
Micka\"el Binois\\
Mathematics and Computer Science Division\\
Argonne National Laboratory\\
9700 Cass Ave.\\
Lemont, IL 60439, United States of America\\
E-mail: \email{mbinois@mcs.anl.gov}\\
URL: \email{https://sites.google.com/site/mickaelbinoishomepage/}
Robert B. Gramacy\\
Department of Statistics\\
Virginia Tech\\
250 Drillfield Drive\\
Blacksburg, VA 24061, United States of America\\
E-mail: \email{rbg@vt.edu}\\
URL: \url{http://bobby.gramacy.com/}
}
\begin{document}
\section{Introduction}
\label{sec:intro}
Historically, computer experiments have been associated with deterministic
black-box functions; see, for example, \citet{Sacks1989}. Gaussian process
(GP) interpolators are popular in this setting. Recently, determinism has
become less common for more complex simulators, e.g., those relying on agents.
Stochastic simulation has opened up many avenues for inquiry in the applied
sciences. In data in geostatistics \citep{banerjee2004hierarchical} and machine
learning \citep{Rasmussen2006}, where high-fidelity modeling also involves
GPs, not only is noise common but frequently regression tasks involve
signal-to-noise ratios that may be low or changing over the input space. In
this context, whether for stochastic simulation or for data from observational
studies, more samples are needed in order to isolate the signal, and learning
the variance is at least as important as the mean. Yet GPs buckle under the
weight of even modestly big data. Moreover, few options for heteroskedastic
modeling exist.
Replication, meaning repetition of experimental runs at the same design with
different realizations or measurements, provides a powerful tool for
separating signal from noise, offering a pure look at (local) variability.
Replication can also yield computational savings in inference, and thus holds
the potential to capture input-dependent dynamics in variance in GP
regression. For example, in the operations research community, stochastic
kriging \cite[SK;][]{Ankenman2010} offers approximate methods that exploit
large degrees of replication, coupling GP regression modeling on the mean and
the variance, through hybrid likelihood and method-of-moments (MoM) based
inference, toward efficient heteroskedastic modeling. A downside to SK is a
dependence on large degrees of replication in order to support the MoM method.
\citet{hetGP1} provide similar results but relaxing the necessity of having a
large degree of replication, and place inference fully within a likelihood
framework. That methodology is the primary focus of the implementation
described here. Other approaches for achieving a degree of input-dependent
variance include quantile kriging \citep{Plumlee2014}; the use of pseudoinputs
\citep{snelson:ghahramani:2005}, also sometimes called a predictive process
\citep{Banerjee2008}; and (non-GP-based) tree methods \citep{Pratola2017b}.
Although the \pkg{hetGP} package has many aspects in common, and in some cases
is directly inspired by these approaches, none of these important
methodological contributions are, to our knowledge, coupled with an open
source \proglang{R} implementation.
In the computer experiments and machine learning communities, sequential
design of experiments/active learning \citep[e.g.,][]{seo00,gra:lee:2009} and
Bayesian optimization \citep[e.g.,][]{snoek:etal:2012} are a primary goal of
modeling. The \pkg{hetGP} package provides hooks for learning and design to
reduce predictive variance, organically determining an input-dependent degree
of replication required to efficiently separate signal from noise
\citep{hetGP2}; for Bayesian optimization; for contour finding
\citep{Lyu2018}, and for leptokurtic responses \citep{Shah2014,Chung2018}. A
description and detailed examples are provided herein.
Although many \proglang{R} packages are available on CRAN for GP spatial and
computer surrogate modeling \citep[e.g.,][provide a nice empirical review and
comparison]{Erickson2017}, we are not aware of any others that provide an
efficient approach to fully likelihood-based coupled GP mean and variance
modeling for heteroskedastic processes and, moreover, that provide a
comprehensive approach to sequential design in that context. The \pkg{tgp}
\citep{Gramacy2007,Gramacy2010} and
\pkg{laGP} \citep{Gramacy2016} packages offer a limited degree of
nonstationary and heteroskedastic GP modeling features via partitioning, which
means they cannot capture smooth dynamics in mean and variance, nor can they
efficiently handle replication in the design. Methods in the \pkg{hetGP}
squarely target filling that void. The \pkg{DiceKriging} \citep{Roustant2012}
package is often used for SK experiments; however, users must preprocess the
data and calculate their own moment-based variance estimates. The \pkg{mlegp}
package \citep{mlegp} offers a more hands-off SK experience, facilitating
replicate detection, but without coupled modeling and sequential design
features. Other \proglang{R} packages include \pkg{GPfit} \citep{GPfit}, which
focuses expressly on deterministic modeling, whereas those like \pkg{kergp}
\citep{kergp} target flexible covariance kernels, including additive or
qualitative versions, in the homoskedastic setting.
The remainder of this paper is organized as follows. Section \ref{sec:gp}
reviews \pkg{hetGP}'s approach to ordinary (homoskedastic) GP regression, and
linear algebra identities that enable substantial speedups when the design has
a high degree of replication. Section \ref{sec:hetGP} details \pkg{hetGP}'s
latent variance approach to joint GP likelihood-based inference to accommodate
heteroskedastic processes. Section \ref{sec:seqdes} covers sequential design
for \pkg{hetGP} models targeting reduction in prediction error, Bayesian
optimization, and contour finding. Emphasis is on considerations in
implementation, and examples of use in practice, throughout. We conclude in
Section \ref{sec:summary} with a brief summary and discussion.
\section{Gaussian process modeling under replication}
\label{sec:gp}
Here we address computer experiments involving a function $f(\x): \x \in \R^d
\rightarrow \R$ requiring expensive simulation to \emph{approximate} a
real-world (physical) relationship. An {\em emulator} $\hat{f}_N$ is a
regression or response surface fit to input-output pairs $(\x_1, y_1), \dots,
(\x_N, y_N)$, where $y_i = f(\x_i) + \epsilon_i$, with a centered
(Gaussian) random noise $\epsilon_i$, e.g., $\epsilon_i \sim \mathcal{N}(0,
r(\x_i))$. Predictive equations from
$\hat{f}_N$, notated as $\hat{f}_N(\x')$ for a new location $\x'$, may serve as
a cheap {\em surrogate} for $f(\x')$ at new inputs $\x'$ for visualization,
sensitivity analysis, optimization, and so forth. The \pkg{hetGP} package targets
thrifty GP surrogates for stochastic computer model
simulations whose variance and mean are both believed to vary nonlinearly.
Although our emphasis and vocabulary are tilted toward computer surrogate
modeling throughout, many of the techniques we describe are equally well
suited to applications in machine learning and geostatistics or anywhere else
GP regression may be applied. Many of the methods presented here are inspired
by advances in those areas. We begin by reviewing
\pkg{hetGP}'s approach to conventional, homoskedastic GP modeling involving a
large degree of replication relative to the number of unique sites where
computer simulation responses are measured.
\subsection{Gaussian process review}
\label{sec:rev}
{\em Gaussian process regression} is a form of spatial modeling via
multivariate normal (MVN) distributions, assuming that the data comes from a GP denoted by $Y$. In the computer surrogate modeling
community, one commonly takes a mean-zero GP formulation, moving all the
modeling action to the covariance structure, which is usually specified as a
decreasing function of distance. The formulation below uses a scaled
separable (or anisotropic) Gaussian kernel:
\begin{align}
\mathbf{Y}_N &\sim \mathcal{N}_N(0, \nu \KN) \label{eq:k} \\
\mbox{with} \quad \KN&=(K_{\theta}(\x_i, \x_j) + \tau^2 \delta_{i=j})_{1 \leq i,j \leq N}, ~\mbox{and}~
K_{\theta}(\x_i, \x_j) =
\exp\left\{ - \sum_{k=1}^d \frac{(x_k - x'_k)^2}{\theta_k} \right\}. \nonumber
\end{align}
{\em Lengthscale}s $\boldsymbol{\theta} = (\theta_1,\dots,\theta_p)$ govern
the rate of decay of correlation as a function of coordinate-wise distance;
{\em scale} $\nu$ adjusts the amplitude of $\mathbf{Y}_N$ realizations, and
{\em nugget} $\tau^2$ implements an iid (i.e., nonspatial) variance component
for noisy data.
For deterministic computer simulations, a zero nugget ($\tau^2 = 0$) is common, although authors
have argued that sometimes that can be inefficient \citep{gra:lee:2012}. Other
forms for the correlation kernel $K_\theta(\cdot, \cdot)$ are prevalent, and we
shall discuss several others and their ``hyperparameters'' in due course.
Although we shall mostly assume a constant zero trend throughout for notational
conciseness, extensions like polynomial trends are relatively straightforward.
For example, the \pkg{hetGP} package allows a constant mean to be estimated, which can
be helpful when GPs are used to smooth latent variances, as discussed in more
detail in Section \ref{sec:hetGP}.
GPs make popular surrogates because their predictions are accurate, have
appropriate coverage, and can interpolate when the nugget is zero. The beauty
of GP modeling is evident in the form of its predictive equations, which arise
as a simple application of conditioning for MVN joint distributions. Let $D_N
= (\XN, \vecY_N)$ denote the data. GP prediction $Y(\x) \mid D_N$ at a new
location $\x$ follows
\[
Y(\x) \mid D_N \sim \mathcal{N}(\mu_N(\x), \sigma_N^2(\x))
\]
with
\begin{align}
\mbox{mean } \quad \mu_N(\x) &= \veck_N(\x)^\top \KN^{-1} \vecY_N \\
\mbox{and variance } \quad \sigma^2_N(\x) &= \nu K_\theta(\x, \x) - \nu \veck_N(\x)^\top \KN^{-1} \veck_N(\x),
\end{align}
where $\veck_N(\x) = (K_\theta(\x, \x_1), \dots, K_\theta(\x,\x_N))^\top$.
These are the so-called kriging equations in spatial statistics. They
describe the best (minimizing MSPE) linear unbiased predictor (BLUP).
If the covariance structure is hyperparameterized, as it
is in Equation~(\ref{eq:k}), the multivariate structure can emit a log
likelihood that may be utilized for inference for unknown hyperparameters
$(\nu, \boldsymbol{\theta}, \tau^2)$:
\[
\ell = \log L = -\frac{N}{2} \log 2\pi - \frac{N}{2} \log \nu
- \frac{1}{2} \log |\KN| - \frac{1}{2\nu} \vecY_N^\top \KN^{-1} \vecY_N.
\]
To maximize $\ell$ with respect to $\nu$, say, one differentiates and solves:
\begin{align*}
0 \stackrel{\mathrm{set}}{=} \ell'(\nu) \equiv \frac{\partial \ell}{\partial \nu}
&= - \frac{N}{2 \nu} + \frac{1}{2 \nu^2} \vecY_N^\top \KN^{-1} \vecY_N \\
\hat{\nu} &= \frac{\vecY_N^\top \KN^{-1} \vecY_N}{N}.
\end{align*}
The quantity $\hat{\nu}$ is like a mean residual sum of squares. Now, plugging
$\hat{\nu}$ into $\ell$ gives the so-called concentrated log-likelihood,
\begin{align*}
\ell(\tau^2, \boldsymbol{\theta}) &= -\frac{N}{2} \log 2\pi - \frac{N}{2} \log \hat{\nu}
- \frac{1}{2} \log |\KN| - \frac{1}{2\hat{\nu}} \vecY_N^\top \KN^{-1} \vecY_N \\
&= c - \frac{N}{2} \log \vecY_N^\top \KN^{-1} \vecY_N - \frac{1}{2} \log |\KN|.
\end{align*}
Using the chain rule and that
\[
\frac{\partial \KN^{-1}}{\partial \cdot}
= - \KN^{-1} \frac{\partial \KN}{\partial \cdot} \KN^{-1} \quad \mbox{ and } \quad
\frac{\partial \log | \KN | }{\partial \cdot}
= \mathrm{tr} \left \{ \KN^{-1} \frac{\partial \KN}{\partial \cdot} \right\}
\]
yields closed-form expressions for the partial derivatives with respect to ``$\cdot$'',
a place holder for the hyperparameter(s) of interest, e.g.,
$(\theta_1, \dots, \theta_k)$ and $\tau^2$. Unfortunately these cannot be set
to zero and solved analytically, but it can be done via numerical methods.
To illustrate and expose the essence of the implementation automated in
the \pkg{hetGP} package (but for more involved settings discussed
momentarily), we provide the following hand-coded example. The code block
below implements the negative log-likelihood for parameters \code{par}, whose
first \code{ncol(X)} settings correspond to the lengthscales
$\boldsymbol{\theta}$, followed by the nugget $\tau^2$.
<>=
library("hetGP")
nLL <- function(par, X, Y) {
theta <- par[1:ncol(X)]
tau2 <- par[ncol(X) + 1]
n <- length(Y)
K <- cov_gen(X1 = X, theta = theta) + diag(tau2, n)
Ki <- solve(K)
ldetK <- determinant(K, logarithm = TRUE)$modulus
(n / 2) * log(t(Y) %*% Ki %*% Y) + (1 / 2) * ldetK
}
@
The gradient is provided by the code below.
<>=
gnLL <- function(par, X, Y) {
n <- length(Y)
theta <- par[1:ncol(X)]; tau2 <- par[ncol(X) + 1]
K <- cov_gen(X1 = X, theta = theta) + diag(tau2, n)
Ki <- solve(K); KiY <- Ki %*% Y
dlltheta <- rep(NA, length(theta))
for(k in 1:length(dlltheta)) {
dotK <- K * as.matrix(dist(X[, k]))^2 / (theta[k]^2)
dlltheta[k] <- n * t(KiY) %*% dotK %*% KiY / (t(Y) %*% KiY) -
sum(diag(Ki %*% dotK))
}
dlltau2 <- n * t(KiY) %*% KiY / (t(Y) %*% KiY) - sum(diag(Ki))
-c(dlltheta / 2, dlltau2 / 2)
}
@
Consider a $d=2$-dimensional input space and responses observed with noise,
coded to the unit square. The data-generating function coded below was
introduced by \citep{gra:lee:2009} to illustrate challenges in GP regression
when the response surface is nonstationary. (Heteroskedasticity is a form of
nonstationarity; and although here the nonstationarity is in the mean,
\citet[][see Appendix C]{hetGP1} showed that nevertheless \pkg{hetGP} methods
can offer an appropriate quantification of predictive uncertainty on these
data.) In order to help separate signal from noise, degree 2 replication is
used. Replication is an important focus of this manuscript, with further
discussion in Section \ref{sec:rep}. Initial designs are Latin hypercube
samples, generated by using the \pkg{lhs} package \citep{Carnell2018}.
<>=
library("lhs")
X <- 6 * randomLHS(40, 2) - 2
X <- rbind(X, X)
y <- X[, 1] * exp(-X[, 1]^2 - X[, 2]^2) + rnorm(nrow(X), sd = 0.01)
@
We use the L-BFGS-B method (via \code{optim()} in \proglang{R}), which supports bound constraints,
to estimate hyperparameters, plugging in our negative
log-likelihood (for minimizing) and gradient.
For all
parameters, we choose a small (nearly zero) lower bound. For the
lengthscales the upper bound is 10, which is much longer than the square of
the longest distance in the unit square ($\sqrt{2}$); and for the nugget
$\tau^2$ we chose the marginal variance.
<>=
Lwr <- sqrt(.Machine$double.eps); Upr <- 10
out <- optim(c(rep(0.1, 2), 0.1 * var(y)), nLL, gnLL, method = "L-BFGS-B",
lower = Lwr, upper = c(rep(Upr, 2), var(y)), X = X, Y = y)
out$par
@
Apparently, the lengthscale in $x_2$ ($\theta_2$) is about $2\times$ longer
than for $x_1$ ($\theta_1$). Interpreting these estimated quantities is
challenging and often not a primary focus of inference. What is important is
the connection to the predictive quantities. Deriving those equations requires
rebuilding the covariance structure with estimated hyperparameters,
decomposing the covariance structure, and calculating $\hat{\nu}$.
<>=
Ki <- solve(cov_gen(X, theta = out$par[1:2]) + diag(out$par[3], nrow(X)))
nuhat <- drop(t(y) %*% Ki %*% y / nrow(X))
@
Actual prediction requires a set of test inputs. Below a
regular grid in two dimensions is used.
<>=
xx <- seq(-2, 4, length = 40)
XX <- as.matrix(expand.grid(xx, xx))
@
The code below extends that covariance structure to the test set, both
between themselves and with the training set.
<>=
KXX <- cov_gen(XX, theta = out$par[1:2]) + diag(out$par[3], nrow(XX))
KX <- cov_gen(XX, X, theta = out$par[1:2])
mup <- KX %*% Ki %*% y
Sigmap <- nuhat * (KXX - KX %*% Ki %*% t(KX))
@
Using those quantities,
Figure \ref{fig:exp2d} illustrates the resulting predictive surface. The color
palettes used throughout the document come from the \pkg{colorspace} package
on CRAN \citep{Zeileis2009,Ihaka2019}.
<>=
library("colorspace")
sdp <- sqrt(diag(Sigmap))
par(mfrow = c(1,2))
cols <- sequential_hcl(palette = "Viridis", n = 128, l = c(40, 90))
persp(xx, xx, matrix(mup, ncol = length(xx)), theta = -30, phi = 30,
main = "mean surface", xlab = "x1", ylab = "x2", zlab = "y")
image(xx, xx, matrix(sdp, ncol = length(xx)), main = "variance",
xlab = "x1", ylab = "x2", col = cols)
points(X[, 1], X[, 2])
@
A characteristic feature of GP predictive or kriging surfaces is that the
predictive variance is lower at the training data sites and higher elsewhere.
This is exemplified by the darker/indigoer colors near the open circles in the
right panel, and lighter/yellower colors elsewhere. In 1D applications the
error bars derived from predictive mean and variance surfaces take on a
sausage shape, being fat away from the training data sites and thin, or
``pinched,'' at the training locations.
Many libraries, such as those cited in our introduction, offer automations of
these procedures. In this manuscript we highlight features of the
\pkg{hetGP} package, which offers similar calculations as a special case.
The code below provides an illustration.
<>=
fit <- mleHomGP(X, y, rep(Lwr, 2), rep(Upr, 2), known = list(beta0 = 0),
init = c(list(theta = rep(0.1, 2), g = 0.1 * var(y))))
c(fit$theta, fit$g)
@
These estimated parameters are nearly identical to the ones above, obtained
``by hand''. Arguments \code{lower} and \code{upper} (3rd and 4th arguments)
as well as \code{init} of \code{mleHomGP} are used to define the same
optimization problem as above. By default \code{mleHomGP} estimates a constant
trend $\beta_0$, while it is fixed with \code{known} here in order to
facilitate an apples-to-apples comparison. There are subtle differences
between the objective and optimization being performed, which accounts for the
slight differences in higher significant digits. The
\pkg{hetGP} package offers an S3
\code{predict} method that can facilitate prediction exactly in the manner
illustrated above; but rather than duplicate Figure \ref{fig:exp2d} here, we
shall delay an illustration until later.
So to conclude this warm-up, GPs are relatively simple to specify, and
inference involves a few dozen lines of code to define an objective
(log-likelihood) and its gradient, and a library routine for optimization.
Where do the challenges lie? Increasingly, data in geostatistics, machine
learning, and computer simulation experiments involve signal-to-noise ratios
that may be low and/or possibly changing over the input space. Stochastic
(computer) simulators, from physics, business, and epidemiology, may exhibit
both of those features simultaneously, but let's start with the first one
first. With noisy processes, more samples are needed to isolate the signal,
hindering the practical use of GPs.
Evident in the equations above is the need to decompose an $N \times N$ matrix in order to
obtain $\KN^{-1}$ and $|\KN|$, at $\mathcal{O}(N^3)$ cost. What can be done?
\subsection{Speedup from replication}
\label{sec:rep}
Replication can be a powerful device for separating signal from noise and can
yield computational savings as well. If $\yu = (\bar{y}_1, \dots,
\bar{y}_n)^\top$ collects averages of $a_i$ replicates $\left(y_i^{(1)}, \dots, y_i^{(a_i)}\right)$ from $n \ll N$ unique
locations $\bar{\x}_i$,
\[
\bar{y}_i = \frac{1}{a_i} \sum_{j=1}^{a_i} y_i^{(j)} \quad \mbox{and similarly for the variances} \quad
\hat{\sigma}^2_i = \frac{1}{a_i - 1} \sum_{j=1}^{a_i} (y_i^{(j)} - \bar{y}_i)^2,
\]
then the ``unique-$n$'' predictive equations are a BLUP:
\begin{align}
\mu_n(\x) &= \nu \veck_n(\x)^\top (\nu \Cn + \Sn)^{-1} \yu \label{eq:mu} \\
\sigma_n^2(\x) &= \nu K_\theta(\x,\x) - \nu^2 \veck_n(\x)^\top(\nu \Cn + \Sn)^{-1} \veck_n(\x), \label{eq:s2}
\end{align}
where $\veck_n(\x) = (K_\theta(\x, \xu_1), \dots, K_\theta(\x,
\xu_n))^\top$, $\Sn = [\hat{\boldsymbol{\sigma}}_{1:n}^2] \An^{-1} =
\mathrm{Diag}(\hat{\sigma}_1^2/a_1, \dots, \hat{\sigma}_n^2/a_n)$, $\Cn =
\left(K_\theta(\xu_i, \xu_j)_{1 \leq i, j \leq n} \right)$, and $a_i \gg 1$.
This is the basis of the stochastic kriging (SK) predictor
\citep{Ankenman2010}, which is implemented as an option in the
\pkg{DiceKriging} and
\pkg{mleGP} packages on CRAN. The simplicity of this setup is attractive.
Basically, an independent moments-based estimate of variance is used in lieu
of the more traditional, likelihood-based (hyperparametric) alternative. This
could be advantageous if the variance is changing throughout the input space,
as we shall discuss further in Section \ref{sec:hetGP}. Computationally, the
advantage is readily apparent. Only $\mathcal{O}(n^3)$ matrix decompositions
are required, which could represent huge savings compared with
$\mathcal{O}(N^3)$ if the degree of replication is high.
However, independent calculations also have their drawbacks, e.g., lacking the
ability to specify a priori that variance evolves smoothly over the input
space. More fundamentally, the numbers of replicates $a_i$ must be relatively
large in order for the $\hat{\sigma}^2_i$ values to be reliable.
\citet{Ankenman2010} recommend $a_i
\geq 10$ for all $i$, which can be prohibitive. Thinking more
homoskedastically, so as not to get too far ahead of our Section
\ref{sec:hetGP} discussions, the problem with this setup is that it does not
emit a likelihood for inference for the other hyperparameters, such as
lengthscale $\boldsymbol{\theta}$ and scale $\nu$. This is because the $\Sn$,
$\bar{y}_i$ and $a_i$ values do not constitute a set of sufficient statistics,
although they are close to being so.
The fix involves Woodbury linear algebra identities. Although
these are not unfamiliar to the spatial modeling community \citep[see,
e.g.,][]{opsomer:etal:1999,Banerjee2008,ng:yin:2012}, we believe they have not
been used toward precisely this end until recently \citep{hetGP1}. Here, the
goal is to make the SK idea simultaneously more general and more
prescriptive, facilitating full likelihood-based inference. Specifically, the
Woodbury identities give
\begin{align*}
\KN^{-1} = (\mathbf{U} \Cn \mathbf{V}^\top + \mathbf{D})^{-1} &=
\mathbf{D}^{-1} - \mathbf{D}^{-1} \mathbf{U} (\Cn^{-1}
+ \mathbf{V} \mathbf{D}^{-1} \mathbf{U})^{-1} \mathbf{V} \mathbf{D}^{-1} \\
|\KN| = |\mathbf{D} + \mathbf{U} \Cn \mathbf{V}|
&= |\Cn^{-1} + \mathbf{V} \mathbf{D}^{-1} \mathbf{U} |
\times |\Cn| \times |\mathbf{D}|%.
\end{align*}
with
$\mathbf{U} = \mathbf{V}^\top =
\mathrm{Diag}(\ones_{a_1,1},
\dots, \ones_{a_n,1})$ where $\ones_{k, l}$ is a $k \times l$ matrix of ones, $\mathbf{D} =
\tau^2 \mathbb{I}_N = \tau^2 \ones_{N,N}$ (or later $\mathbf{D} = \LaN$ in the heteroskedastic
setting with a diagonal matrix of variances $\LaN$).
Not only is storage of $\Kn$ and $\mathbf{U}$, which may be represented sparsely (or
even implicitly), more compact than $\KN$, but the Woodbury formulas show how
to calculate the requisite inverse and determinants by acting on
$\mathcal{O}(n^2)$ quantities rather than $\mathcal{O}(N^2)$. Pushing that
application of the Woodbury identities through to the predictive equations,
via the mapping $\nu \CN + \SN = \nu (\CN + \LaN) \equiv \nu (\CN + \tau^2 \mathbb{I}_N)$
between SK and our more conventional notation, yields the
equality of predictive identities: $\mu_n(\x) = \mu_N(\x)$ and $\sigma_n^2(\x) = \sigma_N^2(\x)$.
In words, the typical ``full-$N$'' predictive quantities may be calculated
identically via ``unique-$n$'' counterparts, with potentially dramatic savings
in computational time and space. The unique-$n$ predictor is unbiased and
minimizes MSPE by virtue of the fact that those properties hold for the
full-$N$ one. No asymptotic or frequentist arguments are required. Crucially,
no minimum data or numbers of replicates (e.g., $a_i \geq 10$ for SK) are
required, although replication can still be helpful from a statistical
efficiency perspective (see Section \ref{sec:imspe}).
The same trick can be played with the concentrated log-likelihood. Recall that
$\Un = \Cn + \An^{-1} \Lan$; where for now $\Lan = \tau^2
\mathbb{I}_n$. Later we shall generalize $\Lan$ for the heteroskedastic
setting. Using these quantities, we have
\begin{align}
\ell &= c + \frac{N}{2} \log \hat{\nu}_N - \frac{1}{2}
\sum_{i=1}^n [(a_i - 1) \log \lambda_i + \log a_i] - \frac{1}{2} \log | \Un | , \label{eq:ellw} \\
\mbox{where} \;\; \hat{\nu}_N
&= N^{-1} (\vecY^\top \LaN^{-1} \vecY - \vecYu^\top \An \Lan^{-1} \vecYu + \vecYu^\top \Un^{-1} \vecYu). \nonumber
\end{align}
Notice the additional terms in $\hat{\nu}_N$ compared with $\hat{\nu}_n :=
n^{-1} \vecYu^\top \Un^{-1} \vecYu$, comprising of the missing statistics for sufficiency. Since $\Lan$
is diagonal, evaluation of $\ell$ requires just $\mathcal{O}(n^3)$ operations,
which means hyperparameter inference can proceed far more computationally
efficiently than in typical setups. The derivative is available in
$\mathcal{O}(n^3)$ time, too, facilitating numerical schemes such as L-BFGS-B:
\begin{align}
\frac{\partial \ell}{\partial \cdot}
= \frac{1}{2\hat{\nu}_N} \frac{\partial ( \vecY^\top \LaN \vecY - \vecYu \An \Lan^{-1} \vecYu
+ n \hat{\nu}_n)}{\partial \cdot } - \frac{1}{2} \sum_{i=1}^n \left[(a_i - 1) \frac{\partial \log \lambda_i}{\partial \cdot} \right]
- \frac{1}{2} \mathrm{tr} \left(\Un^{-1} \frac{\partial \Un}{\partial \cdot} \right) \label{eq:dellw} .
\end{align}
The potential computational benefit of such a mapping, which can be several
orders of magnitude depending on the proportion of replication, has been
illustrated using \pkg{hetGP} library calls in Appendix A.2 of \citet{hetGP1},
and thus we shall not duplicate these here. However, we will remark that the
implementation, which is not detailed in that Appendix, is similar to that
outlined in Section \ref{sec:rev}: define negative log likelihood (\code{nLL})
and gradient (\code{gnLL}) via Equation~(\ref{eq:ellw}) and
Equation~(\ref{eq:dellw}), plug those into \code{optim} and, conditional on
hyperparameters thus inferred, apply predictive
Equations~(\ref{eq:mu}--\ref{eq:s2}). See \code{mleHomGP} and
\code{predict.homGP}, respectively. Under the hood of those user-focused
functions, you'll find three important subroutines (not all exported to the
namespace, but potentially worth inspecting):
\begin{itemize}
\item \code{find_reps}: pre-processing to find replicates and organize sufficient statistics.
\item \code{hetGP:::logLikHom}: implementing Equation~(\ref{eq:ellw}) using those sufficient
statistics (accessible with \code{logLikH});
\item \code{hetGP:::dlogLikHom}: implementing Equation~(\ref{eq:dellw}) similarly.
\end{itemize}
The latter two are careful to apply an optimized sequence of linear algebra
subroutines, and deploy custom \pkg{Rcpp} subroutines when vectorization
is not immediate. An example is the extraction of the diagonal elements
of a matrix product, required by the
derivative (\ref{eq:dellw}), implemented as \code{hetGP:::fast_diag}.
Besides the SK feature offered by the \pkg{mleGP} package, we are not aware of
any other software package, for \proglang{R} or otherwise, that automatically
preprocesses the data (e.g., via \code{find_reps}) into a form that
can exploit these Woodbury identities to speed calculations in the face of
heavy replication, or with comparable computational advantage. Replication is
a common feature in the sampling of noisy processes, and there is a
substantial computational benefit to handling this form of data efficiently.
\section{Implementing heteroskedastic modeling in hetGP}
\label{sec:hetGP}
The typical GP formulation, utilizing a covariance structure based on
Euclidean distance between inputs, emits a stationary process where features in
input--output relationships are identical throughout the input space.
Many data-generating mechanisms are at odds with the assumption of
stationarity. A process can diverge from stationarity (i.e., be nonstationary)
in various ways, but few are well accommodated by computationally viable
modeling methodology. Even fewer come with public software, such as \pkg{tgp}
\citep{Gramacy2007,Gramacy2010} and \pkg{laGP} \citep{Gramacy2016}. Both of
those packages offer a divide-and-conquer approach to accommodate disparate
covariance structures throughout the input space. Such a mechanism is
computationally thrifty, but it is not without drawbacks such as lack of
continuity of the resulting predictive surfaces.
Input-dependent variance, or heteroskedasticity, is a particular form of
nonstationarity that often arises in practice and that is increasingly
encountered in stochastic computer experiment settings. An example is the
so-called motorcycle accident data, available in \pkg{MASS}
\citep{Venables2002}. These data arise from a series of simulation experiments
measuring the acceleration on the helmet of a motorcycle rider before and
after an impact, triggering a whiplash-like effect. Actually, the motorcycle
data contains replicates; only about two-thirds of the inputs are unique.
Ordinary homoskedastic GPs are notoriously inadequate on this example.
Consider the following fit via \code{mleHomGP}, in comparison to the
heteroskedastic alternative \code{mleHetGP} described next, both with a
Mat\'ern covariance kernel.
<>=
library("MASS")
hom <- mleHomGP(mcycle$times, mcycle$accel, covtype = "Matern5_2")
het <- mleHetGP(mcycle$times, mcycle$accel, covtype = "Matern5_2")
@
The code below comprises of our first demonstration of the generic
\code{predict} method provided by the \pkg{hetGP} package, utilizing a dense
grid from the smallest to the largest values in the input space
(\code{times}).
<>=
Xgrid <- matrix(seq(0, 60, length = 301), ncol = 1)
p <- predict(x = Xgrid, object = hom)
p2 <- predict(x = Xgrid, object = het)
@
Figure \ref{fig:moto1} shows the resulting predictive surface overlaid on a
scatter plot of the data. The thick lines are the predictive mean, and the
thin lines trace out a 90\% predictive interval. The output from
\code{predict} separates variance in terms of mean (\code{p$sd2}) and residual
(nugget-based \code{p$nugs}) estimates, which we combine in the figure to show
the full predictive uncertainty of the response $Y(\x) \mid D_N$.
<>=
plot(mcycle, main = "Predictive Surface", ylim = c(-160, 90),
ylab = "acceleration (g)", xlab = "time (ms)")
lines(Xgrid, p$mean, col = 2, lwd = 2)
lines(Xgrid, qnorm(0.05, p$mean, sqrt(p$sd2 + p$nugs)), col = 2)
lines(Xgrid, qnorm(0.95, p$mean, sqrt(p$sd2 + p$nugs)), col = 2)
lines(Xgrid, p2$mean, col = 4, lwd = 2, lty = 4)
lines(Xgrid, qnorm(0.05, p$mean, sqrt(p2$sd2 + p2$nugs)), col = 4, lty = 4)
lines(Xgrid, qnorm(0.95, p$mean, sqrt(p2$sd2 + p2$nugs)), col = 4, lty = 4)
empSd <- sapply(find_reps(mcycle[, 1], mcycle[, 2])$Zlist, sd)
points(het$X0, het$Z0, pch = 20)
arrows(x0 = het$X0, y0 = qnorm(0.05, het$Z0, empSd),
y1 = qnorm(0.95, het$Z0, empSd), code = 3, angle = 90, length = 0.01)
@
Observe in the figure that the homoskedastic fit is less than desirable.
Specifically, uncertainty is largely overestimated across the first third of
\code{times}, from left to right. This corresponds to the pre-impact regime in the
experiment. Apparently, the predictive surface is overwhelmed by the latter
two-thirds of the data, comprising of the higher-variance whiplash regime.
Since the model assumes stationarity, a compromise must be made between these
two regimes, and the one with stronger support (more data, etc.) wins in this
instance.
Foreshadowing somewhat, consider the heteroskedastic fit also shown in the
figure. As a consequence of being able to better track the signal-to-noise
relationship over the input space, the heteroskedastic estimate of the signal
(particularly for the first third of \code{times}) is better.
Predictive uncertainty is highest for the middle third of the \code{times},
which makes sense because this is where the whiplash effect is most prominent.
The vertical error-bars plotted through each replicated training data point
indicate moment-based estimates of variance obtained from the limited number
of replicates in this example. (There are nowhere near enough for an SK-like
approach.)
The methodology behind \pkg{hetGP} was designed to cope with exactly this sort
of behavior, but in a more ambitious setting: larger experiments, in higher
dimension. If replication is an important tool in separating signal from
noise in homoskedastic settings, it is doubly so in the face of
input-dependent noise. Learning how the variance is changing is key to learning
about mean dynamics. Therefore, handling and prescribing degrees of
replication feature prominently in the methodology, and as a means of
remaining computationally thrifty in implementation and execution.
\subsection{Joint Gaussian process modeling}
\label{sec:joint}
SK, introduced in Section \ref{sec:rep}, accommodates a notion of in-sample
heteroskedasticity ``for free'' via independently calculated moments $\Sn =
\mathrm{Diag}(\hat{\sigma}^2_1, \dots, \hat{\sigma}^2_n)$, but that is useless
out of sample. By fitting each $\hat{\sigma}_i^2$ separately there is no
pooling effect for interpolation. \citet{Ankenman2010} suggest the quick fix of
fitting a second GP to the variance observations with ``data''
%\[
$(\bar{\x}_1, \hat{\sigma}_1^2), (\bar{\x}_2, \hat{\sigma}_2^2), \dots, (\bar{\x}_n, \hat{\sigma}_n^2)$
%\]
to obtain a smoothed variance for use out of sample.
A more satisfying approach from the machine learning literature
\citep{goldberg:williams:bishop:1998} involves introducing latent (log
variance) variables under a GP prior and performing joint inference of all
unknowns, including hyperparameters for both ``mean'' and ``noise'' GPs and
latent variances, via MCMC. The overall method, which is effectively on the
order of $\mathcal{O}(TN^4)$ for $T$ MCMC samples, is totally impractical but
works well on small problems. Several authors have economized on aspects of
this framework \citep{kersting:etal:2007,lazaro-gredilla:tsitas:2011} with
approximations and simplifications of various sorts, but none (to our
knowledge) have resulted in public \proglang{R} software.\footnote{A partial
implementation is available for \proglang{Python} via
\pkg{GPy}: \url{https://sheffieldml.github.io/GPy/}.} The key ingredient in
these works, of latent variance quantities smoothed by a GP, has merits and
can be effective when handled gingerly. The methods implemented in
\pkg{hetGP} are built around the following methodology.
Let $\delta_1, \delta_2, \dots, \delta_n$ be latent variance variables (or
equivalently latent nuggets), each corresponding to one of the $n$ unique
design sites $\bar{\x}_i$ under study. It is important to introduce latents
only for the unique-$n$ locations. A similar approach on the full-$N$ setup,
that is, without exploiting the Woodbury identities, is fraught with
identifiability and numerical stability challenges. Store these latents
diagonally in a matrix $\Deltan$, and place them under a GP prior with mean
$\beta_0,$
\[
\Deltan \sim \mathcal{N}_n(\beta_0, \nu_{(g)} (\Cg + g \An^{-1})),
\]
for the purpose of spatial smoothing, with lengthscales $\boldsymbol{\theta}_{(g)}$.
Then smooth $\lambda_i$ values can be
calculated by plugging the above $\Deltan$ quantities into the mean
predictive equations.
\begin{equation}
\Lan = \Cg (\Cg + g \An^{-1})^{-1} (\Deltan - \beta_0 \mathbb{I}_n) =: \Cg \Ug^{-1} (\Deltan - \beta_0 \mathbb{I}_n)
\label{eq:smooth}
\end{equation}
Smoothly varying $\Lan$ generalize $\Lan = \tau^2 \mathbb{I}_n$ from our
earlier homoskedastic setup, when describing our Woodbury shortcuts under
replication in Section \ref{sec:rep}. They also generalize the MoM estimated $\Sn$ from SK.
A nugget parameter $g$ controls the
smoothness of $\lambda_i$'s relative to the $\delta_i$'s. A nonzero mean is
preferable for this second GP since the predictions tend to revert to this
mean far from the observations. Instead of estimating this additional
hyperparameter or asking the user for a value, we found it better to use the
classical plugin estimator of the mean for a GP, that is, $\hat{\beta}_0 =
\Deltan^\top
\Ug^{-1} \Deltan (\ones_n^\top \Ug^{-1} \ones_n)^{-1}$.
Variances must be positive, and the equations above give nonzero probability
to negative $\delta_i$ and $\lambda_i$ values. One solution is to threshold
values at zero. Another is to model $\log(\Deltan)$ in this way instead, as
originally described by \citet{hetGP1}. Differences in the development are
slight. Here we favor the simpler, more direct version, in part to offer a
complementary presentation to the one in that paper.
Rather than Goldberg's MCMC, \citet{hetGP1} describe how to stay within a
(Woodbury) MLE framework, by defining a joint log-likelihood over both mean
and variance GPs:
\begin{align}
\label{eq:ell}
\tilde{\ell} =
c - \frac{N}{2} \log \hat{\nu}^2_N - \frac{1}{2} \sum\limits_{i=1}^n \left[(a_i - 1)\log \lambda_i
+ \log a_i \right] - \frac{1}{2} \log |\Un|
& - \frac{n}{2} \log \hat{\nu}_{(g)} - \frac{1}{2} \log |\Ug|.
\end{align}
Maximization may be facilitated via closed-form derivatives with respect to
all unknown quantities, all in $\mathcal{O}(n^3)$ time. For example, the
derivative with respect to the latent $\Deltan$ values follows
\begin{align*}
\frac{\partial \tilde{\ell}}{\partial \Deltan} &= \frac{\partial \Lan}{\partial \Deltan}
\frac{\partial \tilde{\ell}}{\partial \Lan} = \Cg \Ug^{-1} \frac{\partial \ell}{\partial \Lan} - \frac{\Ug^{-1}\Deltan}{\hat{\nu}_{(g)}}\\
\mbox{ where } \quad \frac{\partial \ell}{\partial \lambda_i} &= \frac{N}{2}
\times \frac{\frac{ (a_i - 1) \hat{\sigma}_i^2} {\lambda_i^2} + \frac{(\Un^{-1} \bar{\vecY}_n)^2_i}{a_i}}{\hat{\nu}_N}
- \frac{a_i - 1}{2\lambda_i} - \frac{1}{2a_i} (\Un)_{i, i}^{-1}.
\end{align*}
Just like in Section \ref{sec:gp}, implementation is fairly straightforward
once this likelihood and derivative are coded. Wrapper \code{mleHetGP} feeds
lower-level objective \code{hetGP:::logLikHet} and derivative
\code{hetGP:::dlogLikeHet} into \code{optim} after finding replicates with
\code{find_reps}.
\citet{hetGP1} show that $\tilde{\ell}$ is maximized when $\Deltan =
\Lan$ and $g=0$. In other words, smoothing the latent noise variables
(\ref{eq:smooth}) is unnecessary at the MLE. Optimizing the objective
naturally smooths the latent $\Deltan$ values, leading to GP predictions that
interpolate those smoothed values. However, smoothing is useful as a
device in three ways: (1) theoretically: connecting SK to Goldberg's latent
representation; (2) numerically: annealing to avoid local minima; and (3)
practically: yielding a smooth solution when the numerical optimizer is
stopped prematurely, which may be essential in big data (large unique-$n$)
contexts.
\subsection[Student-t variants]{Student-$t$ variants}
\label{sec:student}
Student-$t$ processes generalize GPs, keeping most of their benefits at almost
no extra cost, offering an improved robustness to outliers and larger tail
noise. Several choices exist in the literature; see, for example, the work of
\cite{Wang2017}. For implementation in \pkg{hetGP}, pairing with Woodbury
and input-dependent noise features \cite{Chung2018}, we follow the
methodology described by \cite{Shah2014}. Briefly, assuming a multivariate-$t$
prior, $\mathbf{Y}_N \sim
\mathcal{T}_N(\alpha, 0,
\KN))$ with $\alpha \in (2, \infty]$ being the additional degrees of freedom
parameter, the modified predictive distribution is multivariate-$t$,
\begin{equation}
\label{eq:predTP}
\mathbf{Y}_N(\x) | D_N \sim \mathcal{T}_N \left( \alpha + N, \mu_N(\x), \frac{\alpha + \beta - 2}{\alpha + N -2} \sigma_N^2(\x) \right)
\end{equation}
with $\beta = \vecY_N^\top \KN^{-1} \vecY_N$.
The corresponding likelihood has a closed form:
$$\log(L) = -\frac{N}{2} \log((\alpha - 2) \pi) - \frac{1}{2}\log|\KN| + \log \left(\frac{\Gamma \left(\frac{\alpha+N}{2}\right)}{\Gamma \left(\frac{\alpha}{2} \right)} \right) - \frac{(\alpha + N)}{2} \log \left( 1 + \frac{\beta}{\alpha - 2} \right),$$
and so does its derivatives. These are implemented by \code{mleHetTP} with
\code{hetGP:::logLikHetTP} and \code{hetGP:::dlogLikHetTP} under the hood, again with
similar \code{optim} calls. One additional mathematical detail is worth
noting, namely that we employ a different parameterization of $\KN = \sigma^2
\mathbf{C}_N + \tau^2 \mathbb{I}_N$ in this case, because the plugin value of
$\hat{\nu}$ does not apply here. But this parameterization of the covariance
enables an SK variant of this model (similar to that of
\cite{Xie2017}), by setting \code{log = FALSE} in
\code{settings} and giving the empirical variance estimates in
\code{Delta} with \code{known} (or \code{init} for initialization) in
\code{mleHetTP}.
We note that while TPs are more flexible that GPs, as $N$ goes to infinity
they become equivalent. In addition, the estimation of the parameter $\alpha$
can become unreliable \citep[see, e.g.,][]{Wang2017}, such that pre-specifying
it at a low value may be beneficial.
\subsection{Package structure and implementation details}
\label{sec:implement}
Although several implementation details have been provided already, the
content here is intended to give further insight into how \pkg{hetGP} works,
including pointers to additional features such as fast updates as new data
arrive.
\subsubsection{Package structure}
The main functions in \pkg{hetGP} are either related to GP fitting or their
use in sequential design tasks, which are the subject of Section
\ref{sec:seqdes}. Inferential interface functions \code{mleHomGP},
\code{mleHetGP}, \code{mleHomTP} and \code{mleHetTP} return
homoskedastic/heteroskedastic GP/TP models in the form of \code{S3} objects.
They are completed with \code{S3} methods:
\begin{itemize}
\item \code{plot}, \code{print}, \code{predict};
\item \code{update} to add one or several new observations, processing them to identify replicates.
Warm-started re-estimation of the hyperparameters can be invoked with argument \code{maxit} > 0.
\item \code{strip} to save a bare representation of the model by removing all
pre-computed quantities that enable fast prediction like inverse covariance
matrices. Conversely, the \code{rebuild} method recomputes all missing
quantities. Note that Cholesky decomposition is used in the hyperparameter
optimization for speed, while the generalized inverse via \pkg{MASS}'s
\code{ginv} can optionally be used when rebuilding. The extra stability
offered by \code{ginv} can be important when covariance matrix condition
numbers are small; however at the cost of a more expensive decomposition
computationally.
\end{itemize}
Section \ref{sec:seqdes} sequential design tasks can be carried out based on
any of the model types with \code{crit_IMSPE}, targeting for global predictive
accuracy, or \code{crit_optim} for Bayesian optimization and contour finding.
Based on an initial surrogate model fit, both functions provide the next point
to be evaluated according to the selected criterion. We discuss their options
as well as additional details and functions related to those tasks in due
course.
Data sets are also shipped with the package: the assemble to order \code{ato}
and Bayes factor \code{bfs} data, with examples coming soon in Section
\ref{sec:illus}.
\subsubsection{Focus on hyperparameter tuning}
Three kernel families are available in \pkg{hetGP}, parameterized as follows (univariate form):
\begin{itemize}
\item Gaussian: $k_\mathrm{G}(x, x') = \exp \left(-\frac{(x - x')^2}{\theta} \right)$;
\item Mat\'ern with smoothness parameter $3/2$:
$$k_{\mathrm{M32}}(x, x') = \left(1 + \frac{\sqrt{3} |x - x'|}{\theta} \right)
\exp \left(-\frac{ \sqrt{3} |x - x'|}{\theta} \right);$$
\item Mat\'ern with smoothness parameter $5/2$:
$$k_{\mathrm{M52}}(x, x') = \left(1 + \frac{\sqrt{5} |x - x'|}{\theta} + \frac{5 (x -x')^2}{3 \theta^2} \right)
\exp \left(-\frac{ \sqrt{5} |x - x'|}{\theta} \right).$$
\end{itemize}
In \pkg{hetGP} they are referred to respectively as \code{Gaussian}, \code{Matern3\_2}, and
\code{Matern5_2} and may be specified through the \code{covtype} argument.
Multivariate kernels are defined simply as the product of univariate ones,
with one lengthscale per dimension. Isotropic versions, i.e., with the same
lengthscale parameter in each dimension, can be obtained by providing scalar
values of the bound arguments \code{upper} and \code{lower}.
Selecting an appropriate range for lengthscale hyperparameter bounds (with
\code{lower} and \code{upper}) is difficult: specifications that are too small
lead to basically no learning, while overly large values result in
oscillations and matrix conditioning issues. Rules of thumb may be conditioned
on empirical inter-training point design distance distributions or other a
priori considerations. To automate the process of choosing these bounds,
\pkg{hetGP} offers the following as a pre-processing step for the coded
$[0,1]^d$ input domain unless defaults are manually over-ridden. Lengthscale
lower bounds are chosen such that the correlation for a distance between any two
points equal to the 5\% quantile on distances is at least 1\%. Likewise, an
upper bound is defined such that the correlation between two points at the
95\% quantile distance does not exceed 50\%. These values are further
rescaled if the domain is not $[0,1]^d$. Unless provided, the initial value
for $g$ in a homoskedastic model is based on the variance at replicates if
there are any; otherwise, it is set to $0.1$.
By default, GP (and not TP) models estimate an unknown mean (unless provided by the
user with \code{beta0} via the \code{known} argument). In the computer experiments literature
this setup is known as \emph{ordinary kriging}, as opposed to \emph{simple
kriging} with a zero mean. The plugin value of the mean, via MLE calculation,
may be derived as $\hat{\beta_0} = \frac{\mathbf{1}_N^\top
\KN^{-1} \vecY_N}{\mathbf{1}_N^\top \KN^{-1} \mathbf{1}_N} =
\frac{\mathbf{1}_n^\top \An^{-1} \Kn^{-1} \vecY_n}{\mathbf{1}_n^\top \Kn^{-1}
\mathbf{1}_n}$. Predictive equations may by modified as follows
\begin{align*}
\mu_{n, OK}(\x) &= \hat{\beta_0} + k_n(\x)^\top \Kn^{-1} (\vecY_n - \hat{\beta_0}\mathbf{1}_n), \\
\sigma^2_{n, OK}(\x) &= \sigma^2_{n}(\x) + \frac{(1 - k_n(\x)^\top
\Kn^{-1} \mathbf{1}_n)^2}{\mathbf{1}_n^\top \Kn^{-1} \mathbf{1}_n}.
\end{align*}
In particular, utilizing $\hat{\beta_0}$ results in an extra term in the
predictive variance.
The heteroskedastic setup entertained in \pkg{hetGP} relies on several key
points: careful initialization of the latent variables, linking of lengthscale
hyperparameters between latent GP and main GP, and safeguards on the variance
GP component of the likelihood, namely the variance smoothness
regularization/penalty term. A default initialization procedure, provided in
Algorithm \ref{alg:initGP} and invoked via \code{settings$initStrategy =
"residual"}, utilizes known or initial values of hyperparameters that can be
passed as a list to \code{known} and \code{init}, with elements \code{theta}
for the main GP lengthscale, \code{theta\_g} for the latent GP ones,
\code{g\_H} the nugget parameter for an homoskedastic GP (i.e., $\tau^2$),
\code{g} the smoothing parameter (that ultimately goes to zero), and the
latent variables \code{Delta}. Bounds for the hyperparameters of the noise
process can be passed with \code{noiseControl} as a list.
\begin{algorithm}[ht!]
\caption{Pseudo code for the default hyperparameter initialization procedure in \code{mleHetGP}.}
\label{alg:initGP}
\begin{algorithmic}[1]
\Require $D_N$, plus, optionally, initial values of $\boldsymbol{\theta}$, $\boldsymbol{\theta}_{(g)}$, $g_\mathrm{Hom}$, $g$, $\boldsymbol{\Delta}$.
\If {Initial $\boldsymbol{\theta}$ not provided}
\State $\boldsymbol{\theta} = \sqrt{\boldsymbol{\theta}_{\max} \boldsymbol{\theta}_{\min}}$.
\EndIf
\If {$g_\mathrm{Hom}$ is not provided}
\If{any $a_i$ > 5}
\State $g_\mathrm{Hom} \leftarrow \frac{1}{\VAR(\vecY_N)} \times \left( \sum \limits_{i = 1}^n \delta_{a_i > 5} \hat{\sigma}^2_i \right) / \left( \sum \limits_{i = 1}^n \delta_{a_i > 5} \right)$
\Else
\State $g_\mathrm{Hom} \leftarrow 0.05$
\EndIf
\EndIf
\State Apply \code{mleHomGP} on $D_N$ with $\boldsymbol{\theta}$ and $g_\mathrm{Hom}$, obtain or update $\boldsymbol{\theta}$, $g_\mathrm{Hom}$ and $\hat{\nu}_\mathrm{Hom}$.
\If {$\boldsymbol{\Delta}$ not provided}
\State Compute residuals from homoskedastic fit:
$$\delta_i = \frac{1}{a_i \hat{\nu}_\mathrm{Hom}} \sum \limits_{j = 1}^{a_i} \left(\mu(\x_i) - y_i^{(j)} \right)^2 \quad i=1,\dots,n.$$
\EndIf
\If {$\boldsymbol{\theta}_{(g)}$ or $g$ not provided}
\State Fit homoskedastic GP on $(\x_i, \delta_i)_{1 \leq i \leq n}$ with $\boldsymbol{\theta}$, obtain $\boldsymbol{\theta}_{g}$, $g$.
\EndIf
\State Fit heteroskedastic GP on $D_N$ with initial hyperparameters $\boldsymbol{\theta}_\mathrm{Hom}$, $\boldsymbol{\theta}_{(g)}$, $g$ and $\boldsymbol{\Delta}$.
\end{algorithmic}
\end{algorithm}
To reduce the number of hyperparameters and thus ease maximization of the
joint log-likelihood, by default the lengthscale hyperparameters of the latent
GP \code{theta\_g} are linked to the ones of the main GP \code{theta} by a
scalar \code{k\_theta\_g} in $[1, 100]$. If this assumption that the noise
variance is varying more smoothly than the mean is too limiting, as is the
case in one example below, linking between \code{theta} and
\code{theta\_g} can be removed with \code{settings$linkThetas = "none"}.
Another option is to use \code{settings$linkThetas = "constr"}, to use
$\boldsymbol{\theta}$ estimated in step 11 of Algorithm \ref{alg:initGP} as a
lower bound for \code{theta\_g}.
We have found that the initialization described above is sufficient for
the majority of cases. But there are some pitfalls related to the joint
likelihood objective in Equation~(\ref{eq:ell}). In fact, this setup may be
seen as a bi-objective optimization problem, giving a set of compromise
solutions between the log-likelihoods of the main and latent GPs, with a set
of Pareto optimal solutions. However, that perspective would require selecting
a solution afterwards, putting a human in the loop. If equal weights between
objectives are used, as we do here, a solution on nonconvex parts of the
Pareto front may not exist or may be impossible to find. Consequently, the
solution returned corresponds to high likelihood for the latent GP and low
likelihood for the mean GP. We usually observe this behavior when there is not
much heteroskedasticity in the data.
To circumvent this undesirable behavior, we observe that our goal here is
primarily to improve upon a homoskedastic fit of the data. This is enforced in
two ways. First, if the likelihood of the main GP (without penalty) is lower
than that of its homoskedastic counterpart, the penalty may safely be dropped
from the objective. From the bi-objective point of view, this amounts to
putting a constraint on the mean GP likelihood. Second, at the end of the
joint likelihood optimization, if the likelihood of the mean GP with
heteroskedastic noise is not better than that of the homoskedastic one, then
the homoskedastic model can be returned. This latter check can be deactivated
with \code{settings$checkHom = FALSE}. We also provide the \code{compareGP}
function to compare two models on the same data based on their likelihood.
\paragraph{Updating an existing model.}
In sequential design tasks, data are coming point by point, or in batches.
Instead of re-estimating all hyperparameters every time this happens, the
\code{update} function is engineered to re-use previous values as much as
possible. Specifically, initializing $\boldsymbol{\Delta}_{n+1}$ conditions upon
previous $\Deltan$ values; it is also possible to make this more
robust by fusing $\lambda_{n+1}$'s prediction with the empirical variance estimation, as proposed by
\citet{hetGP2}, with \code{method = "mixed"}. Since \code{g} may already be
very small, we find that increasing it slightly with \code{ginit} when re-optimizing
(\code{maxit} > 0) can be beneficial as a means of perturbing solver
initialization and thus potentially avoiding repeated termination at a sub-optimal local minima.
The overall procedure is summarized in Algorithm \ref{alg:updateGP}.
\begin{algorithm}[ht!]
\caption{Pseudo code for the update of hyperparameters procedure in \code{mleHetGP}.}
\label{alg:updateGP}
\begin{algorithmic}[1]
\Require previous \code{hetGP} model, \code{Xnew}, \code{Znew}
\If{\code{Xnew} is a replicate}
\State (Optional) Use leave-one-out prediction to re-estimate $\delta_i$ and empirical estimate $\sigma_i$
\Else
\State Predict $\lambda_{n+1}$ based on previous ($n$) fit
\State (Optional) Combine $\lambda_{n+1}$ prediction with empirical estimate
\EndIf
\If{\code{maxit} > 0}
\State Fit heteroskedastic GP on $D_{N+1}$ with initial hyperparameters $\boldsymbol{\theta}_\mathrm{Hom}$, $\boldsymbol{\theta}_{(g)}$, $g$ and $\boldsymbol{\Delta}_{n+1}$.
\Else
\State Update internal quantities (e.g., inverse covariance matrices)
\EndIf
\end{algorithmic}
\end{algorithm}
\subsection{Illustrations}
\label{sec:illus}
Here we consider several examples in turn,
introducing three challenging real-world computer model simulation examples.
\paragraph{Susceptible, infected, recovered.}
\citet{hetGP1} consider a 2D problem arising from a simulation of the spread
of an epidemic in a susceptible, infected, recovered (SIR) setting, as
described by \citet{hu2015sequential}. A function generating the data for
standardized inputs in the unit square (corresponding to $S$ and $I$) is
provided by \code{sirEval} in the \pkg{hetGP} package. Consider a
space-filling design of size $n=200$ unique runs, with a random number of
replicates $a_i \in \{1,\dots, 100\}$, for $i=1,\dots,n$. The $x_1$
coordinate represents the initial number of infecteds $I_0$, and the $x_2$
coordinate the initial number of susceptibles $S_0$.
<>=
Xbar <- randomLHS(200, 2)
a <- sample(1:100, nrow(Xbar), replace = TRUE)
X <- matrix(NA, ncol = 2, nrow = sum(a))
nf <- 0
for(i in 1:nrow(Xbar)) {
X[(nf + 1):(nf + a[i]),] <- matrix(rep(Xbar[i,], a[i]), ncol = 2,
byrow = TRUE)
nf <- nf + a[i]
}
@
The result is a full data set with $N = \Sexpr{identity(nf)}$ runs. The code
below gathers our response, which is the expected number of infecteds at the
end of the simulation.
<>=
Y <- apply(X, 1, sirEval)
@
The code below fits a \code{hetGP} model. By default, lengthscales for the
variance GP are linked to those from the mean GP, requiring that the former be
a scalar multiple $k>1$ of the latter. That can be undone, however, as we do
below primarily for illustrative purposes.
<>=
fit <- mleHetGP(X, Y, covtype = "Matern5_2", lower = rep(0.05, 2),
upper = rep(10, 2), settings = list(linkThetas = "none"), maxit = 1e4)
@
Around $\Sexpr{signif(fit$time, 2)}$ seconds are needed to train the model.
<>=
xx <- seq(0, 1, length = 100)
XX <- as.matrix(expand.grid(xx, xx))
p <- predict(fit, XX)
@
Figure \ref{fig:sir} shows the resulting predictive surface on a dense 2D grid.
The left panel in the figure shows the predictive mean surface. The right
panel shows the predicted standard deviation. Text overlaid on the panels
indicates the location of the training data inputs and the number of
replicates observed thereon.
<>=
psd <- sqrt(p$sd2 + p$nugs)
par(mfrow = c(1, 2))
image(xx, xx, matrix(p$mean, 100), xlab = "S0", ylab = "I0", col = cols,
main = "Mean Infected")
text(Xbar, labels = a, cex = 0.75)
image(xx, xx, matrix(psd, 100), xlab = "S0", ylab = "I0", col = cols,
main = "SD Infected")
text(Xbar, labels = a, cex = 0.75)
@
\paragraph{Bayes factor surfaces.} Bayes factor (BF) calculation for model
selection is known to be sensitive to hyperparameter settings, which is
further complicated (and obscured) by Monte Carlo evaluation that interjects
a substantial source of noise. To study BF surfaces in such
settings, \citet{franck:gramacy:2018} propose treating expensive BF
calculations, via MCMC say, as a (stochastic) computer simulation experiment.
The idea is that BF calculations at a space-filling design in the (input)
hyperparameter space can be used to map out the space and better understand
the sensitivity of model selection to those settings.
As a simple warmup example, consider an experiment described by
\citet[][Section 3.3--3.4]{gramacy:pantaleo:2010} involving BF calculations to
determine whether data are leptokurtic (Student-$t$ errors) or not (simply
Gaussian) as a function of the prior parameterization on the Student-$t$
degrees of freedom parameter, which they took to be $\nu \sim
\mathrm{Exp}(\theta = 0.1)$. Their intention was to be diffuse, but
ultimately they lacked an appropriate framework for studying sensitivity to
this choice. \citet{franck:gramacy:2018} created a grid of hyperparameter
values in $\theta$, evenly spaced in $\log_{10}$ space from $10^{-3}$ to
$10^6$ spanning ``solidly Student-$t$'' (even Cauchy) to ``essentially
Gaussian'' in terms of the mean of the prior over $\nu$. For each $\theta_i$
on the grid they ran the RJ-MCMC to approximate
$\mathrm{BF}_{\mathrm{St}\mathcal{N}}$ by feeding in sample likelihood
evaluations provided by \pkg{monomvn}'s \citep{monomvn} \code{blasso} to
compute a BF, following a postprocessing scheme described by
\citet{jacq:polson:rossi:2004}. In order to understand the Monte Carlo
variability in those calculations, ten replicates of the BFs under each
hyperparameter setting were collected. Each
$\mathrm{BF}_{\mathrm{St}\mathcal{N}}$ evaluation, utilizing $T=100000$ MCMC
samples, takes about 36 minutes to obtain on a 4.2 GHz Intel Core i7
processor, leading to a total runtime of about 120 hours to collect all 200
values saved. The data are provided with the \pkg{hetGP} package.
<>=
data("bfs")
thetas <- matrix(bfs.exp$theta, ncol = 1)
bfs <- as.matrix(t(bfs.exp[, -1]))
@
For reasons that will become clear momentarily, \citet{franck:gramacy:2018}
fit a heteroskedastic Student-$t$ process, described briefly in Section
\ref{sec:student}. Even though they fit in log-log space, the process is
still heteroskedastic and has heavy tails. We take this occasion to remark
that data can be provided directly in the structure internally used in
\code{hetGP}, with unique designs, averages and degree of replication.
<>=
bfs1 <- mleHetTP(X = list(X0 = log10(thetas), Z0 = colMeans(log(bfs)),
mult = rep(nrow(bfs), ncol(bfs))), Z = log(as.numeric(bfs)),
lower = 10^(-4), upper = 5, covtype = "Matern5_2")
@
<>=
dx <- seq(0, 1, length = 100)
dx <- 10^(dx * 4 - 3)
p <- predict(bfs1, matrix(log10(dx), ncol = 1))
@
Predictive evaluations were then extracted on a grid in the input space.
The results are shown in Figure \ref{fig:visbf}.
In the figure, each open circle is a $\mathrm{BF}_{\mathrm{St}\mathcal{N}}$
evaluation, plotted in $\log_{10}-\log_e$ space. In the left panel of the
figure, in addition to showing full posterior uncertainty (on $Y(\x)|D_N$),
we highlight a key advantage of the heteroskedastic framework inferential
framework implemented \pkg{hetGP}: the ability to provide error bars on both
the prediction of the (latent random) mean (field) $f$. The right panel in
the figure shows output provided by the \code{plot} method, utilizing
leave-one-out formulas for GPs \citep{Dubrule1983}, here adapted for TPs.
Keeping the hyperparameters estimated on the entire data, it provides the
prediction at existing designs as if it was unevaluated. This can provide
useful information on the quality of the fit, especially on real data sets
where the Gaussian/Student process hypothesis may not be valid.
<>=
par(mfrow = c(1, 2))
matplot(log10(thetas), t(log(bfs)), col = 1, pch = 21, ylab = "log(bf)",
main = "Bayes factor surface")
lines(log10(dx), p$mean, lwd = 2, col = 2)
lines(log10(dx), p$mean + 2 * sqrt(p$sd2 + p$nugs), col = 2, lty = 2,
lwd = 2)
lines(log10(dx), p$mean + 2 * sqrt(p$sd2), col = 4, lty = 3, lwd = 2)
lines(log10(dx), p$mean - 2 * sqrt(p$sd2 + p$nugs), col = 2, lty = 2,
lwd = 2)
lines(log10(dx), p$mean - 2 * sqrt(p$sd2), col = 4, lty = 3, lwd = 2)
legend("topleft", c("hetTP mean", expression(paste("hetTP interval on Y(x)|", D[N])), "hetTP interval on f(x)"), col = c(2,2,4), lty = 1:3,
lwd = 2)
plot(bfs1)
par(mfrow = c(1,1))
@
Clearly the $\mathrm{BF}_{\mathrm{St}\mathcal{N}}$ surface is heteroskedastic,
even after log transform, and has heavy tails. The take-home message from
these plots is that the BF surface is extremely sensitive to
hyperparameterization. When $\theta$ is small, the Student-$t$ (BF below 1) is
essentially a foregone conclusion, whereas if $\theta$ is large, the Gaussian
(BF above 1) is. A seemingly innocuous hyperparameter setting is essentially
determining the outcome of a model selection enterprise.
Although the computational burden involved in this experiment, 120 hours, is
tolerable, extending the idea to higher dimensions is problematic. Suppose one
wished to entertain $\nu \sim \mathrm{Gamma}(\alpha,
\beta)$, where the $\alpha=1$ case reduces to $\nu \sim \mathrm{Exp}(\beta
\equiv \theta)$ above. Over a similarly dense hyperparameter grid, the
runtime would balloon to 100 days, which is clearly unreasonable. Instead it
makes sense to build a surrogate model from a more limited space-filling
design and use the resulting posterior predictive surface to understand
variability in BFs in the hyperparameter space. Responses are gathered on a
space-filling design in $\alpha \times \beta$-space, via a Latin hypercube
sample of size 80, using a recently updated version of the \pkg{monomvn}
library to accommodate the Gamma prior, with replicates obtained at each
input setting, for a total of 400 runs. The data are also provided by the
\code{bfs} data object in \pkg{hetGP}.
<>=
D <- as.matrix(bfs.gamma[, 1:2])
bfs <- as.matrix(t(bfs.gamma[, -(1:2)]))
@
A similar \code{hetTP} fit can be obtained with the following command.
<>=
bfs2 <- mleHetTP(X = list(X0 = log10(D), Z0 = colMeans(log(bfs)),
mult = rep(nrow(bfs), ncol(bfs))), Z = log(as.numeric(bfs)),
lower = rep(10^(-4), 2), upper = rep(5, 2), covtype = "Matern5_2")
@
<>=
DD <- as.matrix(expand.grid(dx, dx))
p <- predict(bfs2, log10(DD))
@
Figure \ref{fig:visbf2} shows the outcome of that experiment, with mean
surface shown on the left and standard deviation surface on the right. The
numbers overlaid on the figure are the average
$\mathrm{BF}_{\mathrm{St}\mathcal{N}}$ obtained for the five replicates at
each input location.
<>=
par(mfrow = c(1, 2))
mbfs <- colMeans(bfs)
image(log10(dx), log10(dx), t(matrix(p$mean, ncol=length(dx))),
col = cols, xlab = "log10 alpha", ylab = "log10 beta",
main = "mean log BF")
text(log10(D[, 2]), log10(D[, 1]), signif(log(mbfs), 2), cex = 0.75)
contour(log10(dx), log10(dx), t(matrix(p$mean, ncol = length(dx))),
levels = c(-5, -3, -1, 0, 1, 3, 5), add = TRUE, col = 4)
image(log10(dx), log10(dx), t(matrix(sqrt(p$sd2 + p$nugs),
ncol = length(dx))), col = cols, xlab = "log10 alpha",
ylab = "log10 beta", main = "sd log BF")
text(log10(D[, 2]), log10(D[, 1]), signif(apply(log(bfs), 2, sd), 2),
cex = 0.75)
@
The story here is much the same as before in terms of $\beta$, which maps to
$\theta$ in the earlier experiment. Near $\alpha = 1$ (i.e., $\log_{10}
\alpha = 0$) the equivalence is exact. The left panel shows that along that
slice one can get just about whatever conclusion one wants. Smaller $\alpha$
values tell a somewhat more nuanced story, however. A rather large range of
smaller $\alpha$ values leads to somewhat less sensitivity in the outcome
because of the $\beta$ hyperparameter setting, except when $\beta$ is quite
large. Apparently, having a small $\alpha$ setting is essential if the data
is going to have any influence on model selection via BF. The right panel
shows that the variance surface is indeed changing over the input space,
justifying the heteroskedastic surrogate.
Another example with \code{hetTP} is provided by \citet{Chung2018}, with data
augmentation based on a latent variable scheme to account for missing data and
to enforce a monotonicity property for solving a challenging class of inverse
problems motivated by an influenza example.
\paragraph{Assemble to order.}
The assemble-to-order (ATO) problem \citep{hong:nelson:2006} involves a
queuing simulation targeting inventory management scenarios. The setup is as
follows. A company manufactures $m$ products. Products are built from base
parts called items, some of which are ``key'' in that the product cannot be
built without them. If a random request comes in for a product that is
missing a key item, a replenishment order is executed, which is filled after a
random delay. Holding items in inventory is expensive, so there is a balance
between inventory costs and revenue. \citeauthor{hong:nelson:2006} built a
\proglang{MATLAB} simulator for this setup, which was
reimplemented by \citet{xie:frazier:chick:2012}. \citet{hetGP1} describe an
out-of-sample experiment based on this latter implementation in its default
setting (originally from \citeauthor{hong:nelson:2006}), specifying item cost
structure, product makeup (their items) and revenue, and distribution of demand
and replenishment time, under target stock vector inputs $b \in
\{0,1,\dots,20\}^8$ for eight items.
Here we provide code that can be used to replicate results from that
experiment, which involved a uniform design of size $n_{\mathrm{tot}}=2000$ in
8D space, with ten replicates for a total design size of
$N_{\mathrm{tot}}=20000$. The \proglang{R} code below loads in that data,
which is stored in the data object \code{ato}, provided with \pkg{hetGP}.
<>=
data("ato")
@
In order to create an out-of-sample test setting, random training--test
partitions were constructed by randomly selecting $n=1000$ unique locations
and a uniform number of replicates $a_i \in
\{1,\dots,10\}$ thereupon. The \code{ato} data object also contains one such
random partition, which is subsequently coded into the unit cube $[0,1]^8$.
Further detail is provided in the package documentation for the
\code{ato} object. Actually the object provides two test sets. One is a
genuine out-of-sample test set, where the test sites do not intersect
with any of the training sets. The other is replicate based, involving
replicates $10-a_i$ not selected for training. The training set is large,
which makes MLE calculations slow, so the \code{ato} object provides a fitted
model for comparison. In the examples section of the \code{ato}
documentation, code is provided to reproduce that fit or to create a new one
based on a new random training--test partition.
<>=
c(n = nrow(Xtrain), N = length(unlist(Ztrain)), time = out$time)
@
The main reason for storing these objects is to enable a fast illustration of
prediction and out-of-sample comparison, in particular to have something to
compare with a more thoughtful sequential design scheme outlined in Section
\ref{sec:seqdes}. The code below performs predictions at the held-out test
locations and then calculates a proper score
\citep[][Equation~(27)]{gneiting:raftery:2007} against the ten replicates
observed at each of those locations. Higher scores are better.
A function \code{scores} in \pkg{hetGP} computes scores and root mean squared error (RMSE).
<>=
sc <- scores(model = out, Xtest = Xtest, Ztest = Ztest)
@
A similar calculation for the held-out training replicates is shown
below. These are technically out of sample, but accuracy is higher since
training data was provided at these locations.
<>=
sc.out <- scores(model = out, Xtest = Xtrain.out, Ztest = Ztrain.out)
@
Combined, the two test sets provide a single score
calculated on the entire corpus of held-out data. As expected, the result is
a compromise between the two score statistics calculated earlier.
<>=
c(test = mean(sc), train = mean(sc.out), combined = mean(c(sc, sc.out)))
@
\citet{hetGP1} repeat that training-test partition 100 times to
produce score boxplots that are then compared with simpler (e.g.,
homoskedastic) GP-based approaches. We refer the reader to Figure 2
in that paper for more details. To make a long story short, fits
accommodating heteroskedasticity in the proper way---via coupled GPs and fully
likelihood-based inference---are superior to all other (computationally
tractable) ways entertained.
\section{Sequential design}
\label{sec:seqdes}
We have been using uniform or space-filling designs in our examples above, and
with replication. The thinking is that coverage in the input space is
sensible and that replication will help separate signal from noise (and yield
computational advantages). Model-based designs, such as maximum entropy and
related criteria, are almost never appropriate in this setting. Such designs,
since they condition on the model, are hyperparameter sensitive; and before
data are collected, we have little to go on to choose good settings for those
values. This situation is exacerbated in the heteroskedastic case,
requiring the setting of latent inputs and additional hyperparameters. A far
better approach is to take things one step at a time: start with a small
space-filling design (small $N$), perhaps with some replication, and choose
the next point, $\x_{N+1}$, by exploring its impact on the model fit, say
through the predictive equations. An interesting question, in such settings,
is how much such criteria would prefer to replicate (repeat existing inputs)
versus explore (try new locations).
In the classical GP setup for computer simulations, with low or no noise and
an assumption of stationarity (i.e., constant stochasticity), one can argue
that replication is of little or no value. When signal-to-noise ratios are
low and/or changing over the input space, however, intuition suggests that
some replication will be valuable. Until recently, however, it was not known
how such choices would manifest in typical sequential design or data
acquisition decisions. Of course, the details depend on the goal of modeling
and prediction. Below we consider two settings: (1) obtaining accurate
predictions globally (Section \ref{sec:imspe}) and (2) optimizing or targeting
particular aspects of the predictive distribution such as level sets or
contours (Section \ref{sec:opt}).
\subsection{IMSPE}
\label{sec:imspe}
\citet{hetGP2} studied the exploration vs.~replication question in the
context of improving predictive accuracy by means of sequential design. A common
criterion in that setting is the {\em integrated mean-square prediction error}
(IMSPE), which is just predictive variance averaged over the input space,
specifically,
\begin{equation}
I_{n+1} \equiv \mathrm{IMSPE}(\xu_1, \dots, \xu_n, \x_{n+1}) = \int_{\x \in D} \sigma^2_{n+1}(\x) \, d\x.
\label{eq:imspe}
\end{equation}
This formula is written in terms of unique-$n$ inputs for reasons that we
shall clarify shortly. It could, however, instead be phrased in terms of full-$N$
equations. The idea would be to choose the next design location, $\x_{N+1}$,
which could be a new unique location $(\xu_{n+1})$ or a repeat of an existing
one (i.e., one of $\xu_1,\dots, \xu_n$), by minimizing $I_{n+1}$. The presentation in
\citeauthor{hetGP2} is more careful, but also more cumbersome, with
the notation in this regard.
In general, the integral in Equation~(\ref{eq:imspe}) requires numerical
evaluation~\citep{seo00,gra:lee:2009,Gauthier2014,Gorodetsky2016,Pratola2017}.
For example, the \pkg{tgp} package (option \code{Ds2x = TRUE}) sums over a
reference set, as well as averaging over the posterior distribution of (treed)
Gaussian process parameterization. However, conditional on GP
hyperparameters, and when the study region $D$ is an easily integrable domain
such as a hyperrectangle, the requisite integration may be calculated in
closed form. Although examples exist in the literature
\citep[e.g.,][]{Ankenman2010,anagnos:gramacy:2013,Burnaev2015,Leatherman2017}
for particular cases (and approximations), we are not aware of examples
providing the level of specificity and generality in derivation, or
implementation in code as provided in \pkg{hetGP}.
Let $X$ be uniformly sampled in $D$. The essence is as follows.
\begin{align}
I_{n+1} & = \E \{ \sigma^2_{n+1}(X) \} = \nu \E \{K_\theta(X, X) - \veck_{n+1}(X)^\top \mathbf{K}_{n+1}^{-1} \veck_{n+1}(X)\} \label{eq:imspecf} \\
&= \nu \E \{K_\theta(X,X)\} - \nu \mathrm{tr}(\mathbf{K}_{n+1}^{-1} \mathbf{W}_{n+1}), \nonumber
\end{align}
where $W_{ij} = \int_{\x \in D} k(\x_i, \x) k(\x_j, \x)\, d\x$.
Closed forms for the $W_{ij}$ exist with $D$ being a
hyper-rectangle, say. \citet{hetGP2} provide forms for several popular
covariance functions, including the Mat\'ern, that are coded in \pkg{hetGP}.
In the case of the Gaussian, we quote as follows:
\[
W_{ij}= \prod_{k=1}^d \dfrac{\sqrt{2\pi \theta_k} }{4} \exp\left\{-\dfrac{(x_{i,k}-x_{j,k})^2}{2 \theta_k}\right\}
\left[\mathrm{erf}\left\{\dfrac{2-(x_{i,k}+x_{j,k})}{\sqrt{2 \theta_k}}\right\}+ \mathrm{erf}\left\{\dfrac{x_{i,k}+x_{j,k}}{\sqrt{ 2\theta_k}}\right\} \right].
\]
Having a closed form is handy for evaluation. Even better is that a closed
form exists for the gradient, which facilitates numerical optimization for
sequential design. We leave the details of that calculation to
\citeauthor{hetGP2}.
To investigate how replication can be favored by IMSPE in choosing
$\x_{n+1}$, consider the following setup. Let $r(\x) =
\VAR[ Y(\x) \mid f(\x)]$ denote a belief about the (otherwise zero mean and
iid) noise process. The form of $r(\x)$ can be arbitrary. Below we set up
two univariate examples that follow splines that agree at five ``knots.'' In
this illustration, the $x$-locations of those knots could represent design
locations $x_i$ where responses have been observed.
<>=
rn <- c(4.5, 5.5, 6.5, 6, 3.5)
X0 <- matrix(seq(0.05, 0.95, length.out = length(rn)))
X1 <- matrix(c(X0, 0.2, 0.4))
Y1 <- c(rn, 5.2, 6.3)
r1 <- splinefun(x = X1, y = Y1, method = "natural")
X2 <- matrix(c(X0, 0.0, 0.3))
Y2 <- c(rn, 7, 4)
r2 <- splinefun(x = X2, y = Y2, method = "natural")
@
Figure \ref{fig:tworsimspe} provides a visual of these two variance surface
hypotheses, evaluated over the predictive grid provided below.
<>=
XX <- matrix(seq(0, 1, by = 0.005))
@
Below we shall refer to these surfaces as ``green'' and ``blue,''
respectively, referencing the colors from Figure \ref{fig:tworsimspe}. The
``knots'' are shown as red open circles.
Code below implements the closed form IMSPE of Equation~(\ref{eq:imspecf})
for a generic variance function \code{r}, like one of our splines from above.
The implementation uses some \code{hetGP} functions such as
\code{Wij} and \code{cov_gen}. (We shall illustrate the intended hooks
momentarily; these low-level functions are of value here in this toy
illustration, and as a window into the spirit of implementation in the package.)
<>=
IMSPE.r <- function(x, X0, theta, r) {
x <- matrix(x, nrow = 1)
Wijs <- Wij(mu1 = rbind(X0, x), theta = theta, type = "Gaussian")
K <- cov_gen(X1 = rbind(X0, x), theta = theta)
K <- K + diag(apply(rbind(X0, x), 1, r))
return(1 - sum(solve(K) * Wijs))
}
@
The next step is to apply this function on a grid for each of the two choices
for $r(\x)$, green and blue.
<>=
imspe1 <- apply(XX, 1, IMSPE.r, X0 = X0, theta = 0.25, r = r1)
imspe2 <- apply(XX, 1, IMSPE.r, X0 = X0, theta = 0.25, r = r2)
xstar1 <- which.min(imspe1)
xstar2 <- which.min(imspe2)
@
Figure \ref{fig:tworsimspe} shows these two surfaces along with the minimizing
values. The $x$-locations of the knots, our design sights $\x_1,\dots, \x_n$,
are shown as dashed red vertical bars.
In the figure the blue variance function hypothesis is minimized at a novel
$\x_{n+1}$ location, not coinciding with any of the previous design sites.
However, the green hypothesis is minimized at $\x_2$, reading from left to
right. The IMSPE calculated on this particular variance function $r(\x)$
prefers replication. That is not a coincidence or fabrication.
\citeauthor{hetGP2} showed that the next point $\x_{N+1}$ will be a replicate,
that is, be one of the existing unique locations $\xu_1,\dots,\xu_n$ rather
than a new $\xu_{n+1}$ when
\begin{align*}
r(\x_{N+1}) \geq \frac{\mathbf{k}_n(\x_{N+1})^\top \Kn^{-1}
\mathbf{W}_n \Kn^{-1} \mathbf{k}_n(\x_{N+1})
- 2 \mathbf{w}_{n+1}^\top \Kn^{-1} \mathbf{k}_n(\x_{N+1}) +
w_{n+1,n+1}}{\mathrm{tr}(\mathbf{B}_{k^*} \mathbf{W}_n)} - \sigma_n^2(\x_{N+1}),
\end{align*}
where $k^* = \mathrm{argmin}_{k \in \{1, \dots, n\}} \mathrm{IMSPE}(\x_k)$
and $\mathbf{B}_k =
\frac{(\Un^{-1})_{.,k} (\Un^{-1})_{k,.} }{\frac{\nu \lambda_k
}{a_k (a_k + 1)} - (\Un)^{-1}_{k,k}}$.
Basically, this relationship says that IMSPE will prefer replication when the
variance is ``large enough.'' Rather than read tea leaves more deeply than
that, let us see it in action in our toy example. The code below utilizes
some of \pkg{hetGP}'s internals to enable evaluation of the right-hand side of
the inequality above, in other words, treating it as an equality.
<>=
rx <- function(x, X0, rn, theta, Ki, kstar, Wijs) {
x <- matrix(x, nrow = 1)
kn1 <- cov_gen(x, X0, theta = theta)
wn <- Wij(mu1 = x, mu2 = X0, theta = theta, type = "Gaussian")
a <- kn1 %*% Ki %*% Wijs %*% Ki %*% t(kn1) - 2 * wn %*% Ki %*% t(kn1)
a <- a + Wij(mu1 = x, theta = theta, type = "Gaussian")
Bk <- tcrossprod(Ki[, kstar], Ki[kstar,]) /
(2 / rn[kstar] - Ki[kstar, kstar])
b <- sum(Bk * Wijs)
sn <- 1 - kn1 %*% Ki %*% t(kn1)
return(a / b - sn)
}
@
Calling that function with the \code{XX} grid defined above, with matching
covariance structure details, commences as follows.
<>=
bestk <- which.min(apply(X0, 1, IMSPE.r, X0 = X0, theta = 0.25, r = r1))
Wijs <- Wij(X0, theta = 0.25, type = "Gaussian")
Ki <- solve(cov_gen(X0, theta = 0.25, type = "Gaussian") + diag(rn))
rx.thresh <- apply(XX, 1, rx, X0 = X0, rn = rn, theta = 0.25, Ki = Ki,
kstar = bestk, Wijs = Wijs)
@
Figure \ref{fig:tworsimspe}, with a gray line indicating the threshold shows in particular
that the green hypothesis is everywhere above that threshold in this instance.
<>=
plot(X0, rn, xlab = "x", ylab = "r(x)", xlim = c(0, 1), ylim = c(2, 8),
col = 2, main = "Two variance hypotheses")
lines(XX, r1(XX), col = 3)
lines(XX, r2(XX), col = 4)
lines(XX, rx.thresh, lty = 2, col = "darkgrey")
points(XX[xstar1], r1(XX[xstar1]), pch = 23, bg = 3)
points(XX[xstar2], r2(XX[xstar2]), pch = 23, bg = 4)
points(X0, rn, col = 2)
@
<>=
plot(XX, imspe1, type = "l", col = 3, ylab = "IMSPE", xlab = "x",
ylim = c(0.6, 0.7), main = "IMSPE for two variances")
lines(XX, imspe2, col = 4)
abline(v = X0, lty = 3, col = 'red')
points(XX[xstar1], imspe1[xstar1], pch = 23, bg = 3)
points(XX[xstar2], imspe2[xstar2], pch = 23, bg = 4)
@
\begin{figure}
\centering
\includegraphics[width=0.49\textwidth, trim = 5 5 5 5, clip = TRUE]{./figure/threersfig-1.pdf}
\includegraphics[width=0.49\textwidth, trim = 3 5 5 5, clip = TRUE]{./figure/threeimspefig-1.pdf}
\caption{Two hypothetical variance functions (left) and the resulting IMSPE surfaces (right) with corresponding minima marked by diamonds.
The dashed grey line represents the replication threshold (left).}
\label{fig:tworsimspe}
\end{figure}
That green hypothesis is, of course, just one example of a variance function
that is above the requisite threshold for replication. Also, the hypotheses
we used were not derived from GP predictive equations. But the example is
designed to illustrate potential. Before turning to a more prescriptive
search for new sites, be they replicates or new unique locations, let us
summarize what we know. Replication (1) is good for the GP calculations ($n^3
\ll N^3$); (2) is preferred by IMSPE under certain (changing) variance regimes; and
(3) helps separate signal from noise. But how often is IMSPE going to ``ask''
for replications? The short answer is, not often enough, at least from an
empirical standpoint.
One challenge is numerical precision in optimization when mixing discrete and
continuous search. Given a continuum of potential new locations, up to
floating-point precision, a particular setting corresponding to a finite
number of replicate sites is not likely to be preferred over all other
settings, such as ones infinitesimally nearby (no matter what the theory
prefers). Another issue, which is more fundamental, is that IMSPE is myopic.
The value the current selection, be it a replicate or new unique location, is
not assessed {\em vis \`a vis} its impact on the future decision landscape.
In general, entertaining such decision spaces is all but impossible, although
that does not stop people from trying. See Section \ref{sec:opt} for a related
discussion in an optimization context.
\subsubsection{Replication-biased lookahead}
\citet{hetGP2} found that a ``replication-biased'' lookahead is manageable.
Specifically, consider a horizon $h$ into the future. Looking ahead over
those choices, one can select a new unique $\xu_{n+1}$ and $h$ replicates,
each at one of the $n+1$ unique locations. Or alternatively, one can take a
replicate first and a unique design element later (and there are $h$ ways to
do that). A longer horizon means a greater chance of replication but also
more calculation: $h+1$ continuous searches for new locations and
$(h+1)(h+2)/2-1$ discrete ones in search of potential replicates. The horizon
$h$ can thus be thought of as a tuning parameter. Before considering choices
for how to set this value, let us look at an example for fixed horizon, $h=5$.
Take $f(x) = 2(\exp(-30(x - 0.25)^2) + \sin(\pi x^2)) - 2$ from \citet{Boukouvalas2009}, which is
implemented as \code{f1d2} in \pkg{hetGP}, and observe $Y(x)\sim
\mathcal{N}(f(x), r(x))$, where the noise variance function is $r(x) = 1/3\exp(\sin(2 \pi x))^2$.
<>=
fn <- function(x) { 1/3 * (exp(sin(2 * pi * x))) }
fr <- function(x) { f1d2(x) + rnorm(length(x), sd = fn(x)) }
@
Consider an initial uniform design of size $N=n=10$ (i.e., without replicates)
and an initial \pkg{hetGP} fit based on a Gaussian kernel.
<>=
X <- seq(0, 1, length = 10)
Y <- fr(X)
mod <- mleHetGP(X = X, Z = Y, lower = 0.0001, upper = 1)
@
Next, let us search via IMSPE with horizon $h=5$ lookahead over replication.
The call to \code{IMSPE_optim} below utilizes the closed form IMSPE
(\ref{eq:imspecf}) and its derivatives within an \code{optim} call using
\code{method="L-BFGS-B"}.
<>=
opt <- IMSPE_optim(mod, h = 5)
c(X, opt$par)
@
Whether or not the chosen location, in position eleven above
(\Sexpr{signif(opt$par, 3)}), is a replicate depends on the random seed used
to compile this document, so it is difficult to provide precise commentary
in-line here. As mentioned earlier in Section \ref{sec:implement},
\pkg{hetGP} provides an efficient updating method,
\code{update}, utilizing $\mathcal{O}(n^2)$ or $\mathcal{O}(n)$ updating
calculations for new data points, depending on whether that point is unique or
a replicate, respectively. Details of those updates are provided by
\citet{hetGP2}, extending existing ones in the literature
\citep[e.g.,][]{gramacy:polson:2011,Chevalier2014,gramacy:apley:2015} to the
heteroskedastic case.
<>=
X <- c(X, opt$par)
Ynew <- fr(opt$par)
Y <- c(Y, Ynew)
mod <- update(mod, Xnew = opt$par, Znew = Ynew, ginit = mod$g * 1.01)
@
That is the basic idea. Let us continue and gather a total of 500 samples in
this way, in order to explore the aggregate nature of the sequential design so
constructed. Periodically (every 25 iterations in the code below), it can be
beneficial to restart the MLE calculations to help ``unstick'' any solutions
found in local modes of the likelihood surface. Gathering 500 points is
somewhat of an overkill for this simple 1D problem, but it helps create a nice
visualization.
<>=
for(i in 1:489) {
opt <- IMSPE_optim(mod, h = 5)
X <- c(X, opt$par)
Ynew <- fr(opt$par)
Y <- c(Y, Ynew)
mod <- update(mod, Xnew = opt$par, Znew = Ynew, ginit = mod$g * 1.01)
if(i %% 25 == 0) {
mod2 <- mleHetGP(X = list(X0 = mod$X0, Z0 = mod$Z0, mult = mod$mult),
Z = mod$Z, lower = 0.0001, upper = 1)
if(mod2$ll > mod$ll) mod <- mod2
}
}
@
<>=
nrow(mod$X0)
@
Of the total of $N=500$, the final design contained $n=\Sexpr{nrow(mod$X0)}$
unique locations. To help visualize and assess the quality of the final
surface with that design, the code below gathers predictive quantities on a
dense grid in the input space.
<>=
xgrid <- seq(0, 1, length = 1000)
p <- predict(mod, matrix(xgrid, ncol = 1))
pvar <- p$sd2 + p$nugs
@
Figure \ref{fig:forr} shows the resulting predictive
surface in red, with additional calculations to show the true surface, via
$f(x)$ and error-bars from $r(x)$, in black. Gray vertical bars help
visualize the degree of replication at each input location.
<>=
plot(xgrid, f1d2(xgrid), type = "l", xlab = "x", ylab = "y",
main="1d example, IMSPE h=5", ylim = c(-4, 5))
lines(xgrid, qnorm(0.05, f1d2(xgrid), fn(xgrid)), col = 1, lty = 2)
lines(xgrid, qnorm(0.95, f1d2(xgrid), fn(xgrid)), col = 1, lty = 2)
points(X, Y)
segments(mod$X0, rep(0, nrow(mod$X0)) - 4, mod$X0, mod$mult * 0.25 - 4,
col = "gray")
lines(xgrid, p$mean, col = 2)
lines(xgrid, qnorm(0.05, p$mean, sqrt(pvar)), col = 2, lty = 2)
lines(xgrid, qnorm(0.95, p$mean, sqrt(pvar)), col = 2, lty = 2)
legend("top", c("truth", "estimate"), col = 1:2, lty = 1:2)
@
Observe that the degree of replication, as well as the density of unique
design locations, is higher in the high-noise region than it is in the
low-noise region. In a batch design setting and in the unique situation where
relative noise levels were known, a rule of thumb of more samples or
replicates in the higher noise regime is sensible. The trouble is that such
regimes are rarely known in advance, and neither are the optimal density
differentials and degrees of replication. Proceeding sequentially allows us
to learn and adapt the design as we go.
\subsubsection{Tuning the horizon}
Above we used horizon $h=5$, but that was rather arbitrary. Although it
is difficult to speculate on details regarding the quality of the surface
obtained in Figure \ref{fig:forr}, improvements are likely possible. Chances are that the
uncertainty is overestimated in some regions and underestimated in others. A
solution lies in adapting the lookahead horizon online. \citet{hetGP2}
proposed two simple on-line adjustments that tune the horizon. The first
adjusts the horizon of the next IMSPE search in order to {\em target} a
desired ratio $\rho = n/N$ and thus manage the surrogate modeling cost:
\[
\mbox{Target:} \quad h_{N+1} \leftarrow \left\{
\begin{array}{lll}
h_N + 1 & \mbox{if } n/N > \rho & \mbox{and a new $\bar{\x}_{n+1}$ is chosen} \\
\max\{h_N-1,-1\} & \mbox{if } n/N < \rho & \mbox{and a replicate is chosen} \\
h_N & \mbox{otherwise}.
\end{array}
\right.
\]
The second attempts to {\em adapt} to minimize IMSPE regardless of computational cost:
\[
\mbox{Adapt:} \quad h_{N+1} \sim \mathrm{Unif} \{ a'_1, \dots, a'_n \} \quad \text{ with }\quad a'_i := \max(0, a_i^* - a_i),
\]
and $a_i^* \propto \sqrt{ r(\bar{x}_i) (\Kn^{-1} \mathbf{W}_n \Kn^{-1})_{i,i}}$ comes from
a criterion in the SK literature \citep{Ankenman2010}.
The code below duplicates the example above with the {\em adapt} heuristic.
Alternatively, a \code{horizon} call can be provided \code{target} and
\code{previous_ratio} arguments to implement the {\em target} heuristic
instead. First, reinitialize the design.
<>=
X <- seq(0, 1, length = 10)
Y <- fr(X)
mod.a <- mleHetGP(X = X, Z = Y, lower = 0.0001, upper = 1)
h <- rep(NA, 500)
@
Next, loop to obtain $N=500$ observations under the adaptive horizon scheme.
<>=
for(i in 1:490) {
h[i] <- horizon(mod.a)
opt <- IMSPE_optim(mod.a, h = h[i])
X <- c(X, opt$par)
Ynew <- fr(opt$par)
Y <- c(Y, Ynew)
mod.a <- update(mod.a, Xnew = opt$par, Znew = Ynew,
ginit = mod.a$g * 1.01)
if(i %% 25 == 0) {
mod2 <- mleHetGP(X = list(X0 = mod.a$X0, Z0 = mod.a$Z0,
mult = mod.a$mult), Z = mod.a$Z, lower = 0.0001, upper = 1)
if(mod2$ll > mod.a$ll) mod.a <- mod2
}
}
@
<>=
p.a <- predict(mod.a, matrix(xgrid, ncol = 1))
pvar.a <- p.a$sd2 + p.a$nugs
@
The left panel of Figure \ref{fig:adapt} shows the adaptively selected
horizon over the iterations of sequential design. The right panel shows the
final design and predictions, versus the truth, matching Figure \ref{fig:forr}
for the fixed horizon ($h=5$) case.
<>=
par(mfrow = c(1, 2))
plot(h, main = "Horizon", xlab = "Iteration")
plot(xgrid, f1d2(xgrid), type = "l", xlab = "x", ylab = "y",
main = "Adaptive Horizon Design", ylim = c(-4, 5))
lines(xgrid, qnorm(0.05, f1d2(xgrid), fn(xgrid)), col = 1, lty = 2)
lines(xgrid, qnorm(0.95, f1d2(xgrid), fn(xgrid)), col = 1, lty = 2)
points(X, Y)
segments(mod.a$X0, rep(0, nrow(mod.a$X0)) - 4, mod.a$X0, mod.a$mult * 0.25 - 4,
col = "gray")
lines(xgrid, p.a$mean, col = 2)
lines(xgrid, qnorm(0.05, p.a$mean, sqrt(pvar.a)), col = 2, lty = 2)
lines(xgrid, qnorm(0.95, p.a$mean, sqrt(pvar.a)), col = 2, lty = 2)
@
The left panel of the figure reveals that a horizon of $h=5$ is indeed not
totally uncommon, but it is also higher than generally preferred by the
adaptive scheme, which had $n=\Sexpr{nrow(mod.a$X0)}$ unique sites---more than
the $h=5$ case, but in the same ballpark compared with the full size of
$N=500$.
<>=
nrow(mod.a$X0)
@
The code below offers an out-of-sample comparison via RMSE (lower is better) to the truth and score (higher is better) against a
noisy analog.
<>=
ytrue <- f1d2(xgrid)
yy <- fr(xgrid)
rbind(rmse = c(h5 = mean((ytrue - p$mean)^2),
ha = mean((ytrue - p.a$mean)^2)),
score = c(h5 = scores(mod, matrix(xgrid), yy),
ha = scores(mod.a, matrix(xgrid), yy)))
@
Although speculating on the precise outcome of this comparison is difficult
because of the noisy nature of sampling and horizon updates, the typical
outcome is that the two comparators are similar on RMSE, which is quite small,
but that score prefers the adaptive scheme. Being able to adjust the horizon
enables the adaptive scheme to better balance exploration versus replication
in order to efficiently learn the signal-to-noise ratio in the data-generating
mechanism.
\subsubsection{A larger example}
For a larger example, let us revisit the ATO application from Section
\ref{sec:joint}. \citet{hetGP2} described a sequential design scheme utilizing
fixed, adaptive, and target horizon schemes. The
\code{ato} data object loaded earlier contained a \code{"hetGP"}-class
model called \code{out.a} that was trained with an adaptive horizon
IMSPE-based sequential design scheme. The size of that design and the time
it took to train are quoted by the output of the \proglang{R} code below.
<>=
c(n = nrow(out.a$X0), N = length(out.a$Z), time = out.a$time)
@
Recall that the earlier experiment involved $n=1000$ unique sites with an
average of five replicates at each, for a total of about $N=5000$ training
data points. The training set here is much smaller, having $N=2000$ at $n=$
\Sexpr{nrow(out.a$X0)} unique locations. Thus, the adaptive design has more
unique locations but still a nontrivial degree of replication, resulting in
many fewer overall runs of the ATO simulator. Utilizing the same out-of-sample
test set from the previous score-based comparison, the code below
calculates predictions and scores with this new sequential design.
<>=
sc.a <- scores(out.a, Xtest = Xtest, Ztest = Ztest)
c(batch = sc, adaptive = sc.a)
@
The sequential design leads to more accurate predictors than the batch design
does, despite having being trained on 40\% as many runs. To illustrate how
that design was built, we first need to ``rebuild'' the
\code{out.a} object. For compact storage, the covariance matrices, inverses,
and so forth have been deleted via the \code{strip} method. An alternative for
compactness is to use \code{return.matrices = FALSE} via \code{settings} in
the \code{mleHetGP} command.
<>=
out.a <- rebuild(out.a)
@
The calculation sequence for each step of the search for this design involves
first determining the horizon and then searching with that horizon via IMSPE.
In code, that amounts to the following.
<>=
Wijs <- Wij(out.a$X0, theta = out.a$theta, type = out.a$covtype)
h <- horizon(out.a, Wijs = Wijs)
control <- list(tol_dist = 1e-4, tol_diff = 1e-4, multi.start = 30)
opt <- IMSPE_optim(out.a, h, Wijs = Wijs, control = control)
@
Precalculating the \code{Wijs} saves a little time since these are needed for
both \code{horizon} and \code{IMSPE_optim}. Modifying the
default setup may be done with \code{control}, a list with elements
\code{multistart = 20} for the number of \code{optim} calls to be run starting
from a maximin LHS design of this size, up to \code{maxit} iterations, and
\code{tol_dist} (resp.\ \code{tol_diff}) defining the minimal distance (resp.\
criterion difference) with respect to existing designs under which replication
is preferred. This task may be performed in parallel with the \code{ncores}
argument.
<>=
opt$par
@
The 8D location above would then be fed into the computer model simulator and
the input output pair subsequently added into the design. An indication of
whether or not the new location is unique (i.e., actually new) or a replicate
is provided with \code{path} along with the \code{IMSPE_optim} output,
\code{par}, and \code{value}. The \code{path} list contains the best sequence
of points found by the lookahead procedure, with elements \code{par},
\code{value}, and \code{new} of the next design plus $h$ further ones selected
successively.
<>=
opt$path[[1]]$new
@
Since ATO inputs are actually on a grid, we would have
to first snap this continuous solution to that grid (after undoing the coding
of the inputs). In so doing we could create a replicate as
part of that discretization process.
\subsection{Contour finding and optimization}
\label{sec:opt}
So far we have focused on constructing a globally accurate surrogate
model, but it is also common to target specific regions of interest. Examples
include global minima or level sets \citep{Bogunovic2016}.
\paragraph{Bayesian optimization.} Starting with optimization, namely finding
$\x^* \in \mathrm{argmin}_{\x} \, f(\x)$, we follow the canonical efficient
global optimization framework \citep{Jones1998} with the expected improvement
(EI) criterion \citep{Mockus1978}. For a review of many variants of this and
other so-called Bayesian optimization (BO) methods, see, for example,
\citet{Shahriari2016} and \citet{Frazier2018}. \citet{Picheny2012} provide
benchmarks specifically for the noisy objective case. In that setting, where
more evaluations are needed to separate signal from noise, EI is appealing
since it does not need extra tuning parameters to balance exploitation and
exploration, does not require numerical approximation, and has a closed-form
derivative.
For a deterministic function $f$, the improvement $I: \x \in \R^d
\mapsto \R$ is defined as
\[
I(\x) = \max\left[\min \limits_{i \in \{1,\dots,n\}}
Y(\x_i) - Y(\x)\right],
\]
a random variable. EI is the expectation of the improvement conditioned on the
observations, which has a closed-form expression:
\[
\mathrm{EI} \equiv \E (I(\x) \mid D_N)
= (y^* - \mu_n(\x)) \Phi \left( \frac{y^* -
\mu_n(\x)}{\sigma_n(\x)} \right)
+ \sigma_n(\x) \phi \left(\frac{y^* -
\mu_n(\x)}{\sigma_n(\x)} \right),
\]
where $y^* = \min_{i \in \{1,\dots,n\}} y_i$ and $\phi$ and $\Phi$ are the
pdf and cdf of the standard normal distribution, respectively. In the noisy
case, $y^*$ defined in this way is no longer a viable option, but it can be
replaced for instance with $\min_{i \in \{1,\dots,n\}} \mu_n(\x_i)$ as
recommended by \citet{Vazquez2008} or $\min_\x \mu_n(\x)$ as described by
\citet{gramacy:lee:2011}.
Lookahead versions of EI have been studied \citep[see,
e.g.,][]{Ginsbourger2010,Gonzalez2016,Lam2016,Huan2016}, but a closed-form
expression exists only for one-step-ahead versions. The reason is that, unlike
IMPSE, future values of the criterion depend on future function values.
Inspired by \citet{Lam2016} and our IMSPE-analog (Section \ref{sec:imspe}), we
introduce here a replication-biased lookahead for EI. To circumvent the
unknown future function values issue, our simple implementation uses $y_{n+1}
\leftarrow
\mu_n(\x_{n+1})$, a kriging ``believer'' type of approach
\citep{Ginsbourger2010a}.
Going back to the 1D example, the code below implements this
replication-biased lookahead scheme in a maximization context (more appealing
on this test function). Notice that
we initialize with a small amount of replication. This is to guard against
initial (and incorrect) noiseless surrogate fits that cause numerical issues.
Initial designs for Bayesian optimization of noisy functions is still an open
area of research \citep{zhang2018distance}. Also, we fix a lookahead horizon of $h=5$; developing an
analog {\em target} and {\em adapt} scheme is left for future research as
well.
First, we reinitialize the design and fit, but this time with a small degree
of replication to stabilize the behavior of this example across \code{knitr}
builds.
<>=
X <- seq(0, 1, length = 10)
X <- c(X, X, X)
Y <- -fr(X)
mod <- mleHetGP(X = X, Z = Y)
@
We have 500 sequential design iterations as before, but this
time with EI. Parallel evaluation is possible via \code{mclapply}
from \pkg{parallel} \cite{RCore2019}.
<>=
library("parallel")
ncores <- 1 # or: detectCores()
for(i in 1:470) {
opt <- crit_optim(mod, crit = "crit_EI", h = 5, ncores = ncores)
X <- c(X, opt$par)
Ynew <- -fr(opt$par)
Y <- c(Y, Ynew)
mod <- update(mod, Xnew = opt$par, Znew = Ynew, ginit = mod$g * 1.01)
if(i %% 25 == 0) {
mod2 <- mleHetGP(X = list(X0 = mod$X0, Z0 = mod$Z0, mult = mod$mult),
Z = mod$Z, lower = 0.0001, upper = 1)
if(mod2$ll > mod$ll) mod <- mod2
}
}
@
Next, we obtain predictions on the grid.
<>=
p <- predict(mod, matrix(xgrid, ncol = 1))
pvar <- p$sd2 + p$nugs
@
<>=
plot(xgrid, f1d2(xgrid), type = "l", xlab = "x", ylab = "y",
ylim = c(-4, 5), main = "1d example with EI, h = 5")
lines(xgrid, qnorm(0.05, f1d2(xgrid), fn(xgrid)), col = 1, lty = 2)
lines(xgrid, qnorm(0.95, f1d2(xgrid), fn(xgrid)), col = 1, lty = 2)
points(X, -Y)
segments(mod$X0, rep(0, nrow(mod$X0)) - 4, mod$X0, mod$mult * 0.5 - 4,
col = "gray")
lines(xgrid, -p$mean, col = 2)
lines(xgrid, qnorm(0.05, -p$mean, sqrt(pvar)), col = 2, lty = 2)
lines(xgrid, qnorm(0.95, -p$mean, sqrt(pvar)), col = 2, lty = 2)
legend("top", c("truth", "estimate"), col = 1:2, lty = 1:2)
@
Figure \ref{fig:ei} provides a visualization similar to our IMSPE-based ones
in Section \ref{sec:imspe}. Observe that this lookahead-biased EI approach
leads to substantial replication in the areas of interest, judiciously
balancing exploration and exploitation in order to pin down the global minima
of the mean surface. The actual number of unique locations is extracted as
follows, which is a small fraction of the total budget of $N=500$.
<>=
nrow(mod$X0)
@
Most of the replication is focused in the areas of primary interest from an
optimization perspective, around $x \approx 0.75$, where we observe an inferior
local maximum with low noise and, more intensively, around the true global
optima located at the other end of the input space. EI is also available for
TPs, and the corresponding modifications are provided in Appendix
\ref{ap:TPmods}.
\paragraph{Contour finding.} A related problem is contour finding, also known
as level-set estimation. The objective is to identify a region of inputs of
interest: $\Gamma_f = \left\{\x \in \R^d : Y(\x) > T \right\}$ with $T$ a
given threshold, often zero without loss of generality. We describe briefly
here the criteria defined by \citet{Lyu2018} for both GPs and TPs that are
implemented in \pkg{hetGP}. Several others can be found in the
literature~\citep{Chevalier2013,Bogunovic2016,Azzimonti2016}, with a selection
of implementations provided by the \pkg{KrigInv} package
\citep{Chevalier2014a,Chevalier2018}.
The simplest criterion is maximum contour uncertainty (MCU), implemented by
\code{crit_MCU} in the \pkg{hetGP} package. MCU is based on the local
probability of misclassification, namely that $f$ is incorrectly predicted to
be below or above the threshold \citep[see also,
e.g.,][]{Bichon2008,Ranjan2008}. A second criterion, contour
stepwise uncertainty reduction (cSUR), implemented by \code{crit_cSUR} in
\pkg{hetGP}, amounts to calculating a one-step-ahead reduction of MCU. A more
computationally intensive, but more global, alternative involves
integrating cSUR over the domain in a manner similar to IMSPE for variance
reduction or so-called integrated expected conditional improvement
\citep[IECI,][]{gramacy:lee:2011} for Bayesian optimization. In practice, the
integral is approximated by a finite sum, which is the approach taken by
\code{crit_ICU} in \pkg{hetGP}. The final criterion available,
targeted mean square error (tMSE) \citep{Picheny2010}, is implemented in
\code{crit_tMSE} and is designed to reduce the variance close to the
limiting contour via a weight function.
For illustration, let us consider finding the zero contour in our simple 1D
example. The code below illustrates the \code{crit_cSUR} criteria, with the
\pkg{hetGP} package's generic \code{crit_optim} interface. Otherwise the setup
and \code{for} loop are identical to those for our IMSPE and EI-based
examples.
<>=
X <- seq(0, 1, length = 10)
X <- c(X, X, X)
Y <- fr(X)
mod <- mleHetGP(X = X, Z = Y)
for(i in 1:470) {
opt <- crit_optim(mod, crit = "crit_cSUR", h = 5, ncores = ncores)
X <- c(X, opt$par)
Ynew <- fr(opt$par)
Y <- c(Y, Ynew)
mod <- update(mod, Xnew = opt$par, Znew = Ynew, ginit = mod$g * 1.01)
if(i %% 25 == 0) {
mod2 <- mleHetGP(X = list(X0 = mod$X0, Z0 = mod$Z0, mult = mod$mult),
Z = mod$Z, lower = 0.0001, upper = 1)
if(mod2$ll > mod$ll) mod <- mod2
}
}
p <- predict(mod, matrix(xgrid, ncol = 1))
pvar <- p$sd2 + p$nugs
@
Again, we see that using the biasing lookahead procedure reduces the number
of unique designs, which is beneficial in terms of speed as well as accuracy
in general.
<>=
nrow(mod$X0)
@
In Figure \ref{fig:contour}, the repartition of unique designs is nontrivial
around locations of crossing of the threshold, with larger spread (and less
replication) where the crossing is in noisy areas. As with EI, this
preliminary setup is likely to be improved upon after further research.
<>=
plot(xgrid, f1d2(xgrid), type = "l", xlab = "x", ylab = "y",
ylim = c(-4, 5), main="1d example with cSUR, h = 5")
lines(xgrid, qnorm(0.05, f1d2(xgrid), fn(xgrid)), col = 1, lty = 2)
lines(xgrid, qnorm(0.95, f1d2(xgrid), fn(xgrid)), col = 1, lty = 2)
points(X, Y)
segments(mod$X0, rep(0, nrow(mod$X0)) - 4, mod$X0, mod$mult * 0.5 - 4,
col="gray")
lines(xgrid, p$mean, col = 2)
lines(xgrid, qnorm(0.05, p$mean, sqrt(pvar)), col = 2, lty = 2)
lines(xgrid, qnorm(0.95, p$mean, sqrt(pvar)), col = 2, lty = 2)
legend("top", c("truth", "estimate"), col = 1:2, lty = 1:2)
abline(h = 0, col = "blue")
@
%% -- Summary/conclusions/discussion -------------------------------------------
\section{Summary and discussion}
\label{sec:summary}
We have introduced the \proglang{R} package \pkg{hetGP} for heteroskedastic
Gaussian process regression, sequential experimental design, and Bayesian
optimization. Although the package is designed for dealing with noise and
changing signal-to-noise ratios in the setting of Gaussian process regression,
it offers a full-featured GP regression approach. Ordinary homoskedastic (and
noise-free) GP modeling is supported. When the data are observed with noise,
the package implements a Woodbury trick to make inference more efficient,
decomposing matrices sized by the number of unique input sites, rather than
ones sized by the full data set. Moreover, key functions of \pkg{hetGP} are
implemented in \proglang{C++} with \pkg{Rcpp} \citep{Eddelbuettel2011,
Eddelbuettel2013, Eddelbuettel2017}. This leads to dramatically faster
inference compared with other GP packages on CRAN when the level of
replication is high.
Working only with unique inputs has other advantages, particularly when it
comes to coupled-GP inference for nonlinearly changing (latent) variance
variables along with the usual mean-based analysis. By creating a unifying
likelihood-based inferential framework, complete with closed-form derivative
expressions, library-based optimization methods (e.g., \code{optim} in
\proglang{R}) can be deployed for efficient heteroskedastic modeling. Whereas
the earlier MCMC-based approaches could at best handle dozens of observations,
\pkg{hetGP} can handle thousands in a reasonable amount of computing time.
Although relevant to machine learning and spatial data applications, the
methods in the \pkg{hetGP} package target stochastic computer modeling
applications, which often exhibit heterogeneous noise effects. Agent-based
models are a good example. In that setting, the design of the experiment is
at least as important as modeling. Here we introduced several sequential
design schemes that work with \pkg{hetGP} model objects to organically grow
the design toward accurate (low-variance) prediction, optimization, and the
search for level sets. In all three scenarios, the scheme is able to balance
exploration, exploitation, and replication as a means of obtaining efficient
predictions for those targets. Extensions are provided to accommodate new
sequential design acquisition strategies toward novel surrogate modeling and
prediction applications.
%% -- Optional special unnumbered sections -------------------------------------
% \section*{Computational details}
%
% \begin{leftbar}
% If necessary or useful, information about certain computational details
% such as version numbers, operating systems, or compilers could be included
% in an unnumbered section. Also, auxiliary packages (say, for visualizations,
% maps, tables, \dots) that are not cited in the main text can be credited here.
% \end{leftbar}
%
% The results in this paper were obtained using
% \proglang{R}~\Sexpr{paste(R.Version()[6:7], collapse = ".")} with the
% \pkg{MASS}~\Sexpr{packageVersion("MASS")} package. \proglang{R} itself
% and all packages used are available from the Comprehensive
% \proglang{R} Archive Network (CRAN) at
% \url{https://CRAN.R-project.org/}.
\section*{Acknowledgments}
\begin{leftbar}
% All acknowledgments (note the AE spelling) should be collected in this
% unnumbered section before the references. It may contain the usual information
% about funding and feedback from colleagues/reviewers/etc. Furthermore,
% information such as relative contributions of the authors may be added here
% (if any).
The work of MB is supported by the U.S. Department of Energy, Office of
Science, Office of Advanced Scientific Computing Research, under Contract No.
DE-AC02-06CH11357. RBG gratefully acknowledges funding from a DOE LAB 17-1697
via subaward from Argonne National Laboratory for SciDAC/DOE Office of Science
ASCR and High Energy Physics, and partial support from National Science
Foundation grants DMS-1849794, DMS-1821258 and DMS-1621746. Many thanks
to D.~Austin Cole for comments on early drafts and to Gail
Pieper for her useful language editing.
\end{leftbar}
%% -- Bibliography -------------------------------------------------------------
%% - References need to be provided in a .bib BibTeX database.
%% - All references should be made with \cite, \citet, \citep, \citealp etc.
%% (and never hard-coded). See the FAQ for details.
%% - JSS-specific markup (\proglang, \pkg, \code) should be used in the .bib.
%% - Titles in the .bib should be in title case.
%% - DOIs should be included where available.
\bibliography{refs}
%% -- Appendix (if any) --------------------------------------------------------
%% - After the bibliography with page break.
%% - With proper section titles and _not_ just "Appendix".
\newpage
\begin{appendix}
\section{TP modifications}
\label{ap:TPmods}
The expression for EI with TPs is given, for example, in the work of \cite{Shah2014}:
\[
\mathrm{EI}_{\mathrm{TP}} = z(\x) \sigma(\x) \Lambda_{\alpha + N}(z(\x)) + \sigma(\x)\left(1 +
\frac{z(\x)^2 - 1}{\alpha + N -1}\right)\lambda_{\nu + N}(z(\x)),
\]
with $z(\x) = \frac{y^* - \mu(\x)}{\sigma(\x)}$ and $\lambda$, $\Lambda$ the pdf and cdf
of the Student-$t$ distribution.
We provide the expression for the corresponding gradient:
$$\nabla \mathrm{EI}_\mathrm{TP}(\x) = \nabla \left\{ z(\x) \sigma(\x) \Lambda_{\alpha + N}(z(\x))\right\} + \nabla \left\{ \sigma(\x)\left(1 + \frac{z(\x)^2 - 1}{\alpha + N -1}\right)\lambda_{\nu + N}(z(\x)) \right\}$$
where
$\nabla \left\{z(\x) \sigma(\x) \Lambda_{\alpha + N}(z(\x)) \right\} = - \nabla m(\x) \lambda(z(\x)) + s(\x) z(\x) \nabla z(\x) \lambda(z(\x))$
and
\begin{align*}
&\nabla \left\{ \sigma(\x)\left(1 + \frac{z(\x)^2 - 1}{b}\right)\lambda_{\nu + N}(z(\x)) \right\} =\\
&s(\x) \left[1 + \frac{z(\x)^2 - 1}{b}\right] \nabla z(\x) \lambda_\alpha'(z(\x)) + \left[ \nabla s(\x) \left[1 + \frac{z(\x)^2 - 1}{b}\right] + 2 s(\x) z(\x) \nabla z(\x)/b \right] \lambda(z(\x))
\end{align*}
with
$b = \alpha + N -1$, and
\[\lambda'_\alpha(z) = \frac{(-\alpha - 1) z \Gamma(\frac{\alpha + 1}{2})
(\frac{\alpha + z^2}{\alpha})^{-\alpha/2-3/2} }{\sqrt{\pi} \alpha^{3/2} \Gamma(\frac{\alpha}{2})}.
\]
\end{appendix}
%% -----------------------------------------------------------------------------
\end{document}