%\VignetteIndexEntry{twinSIR: Individual-level epidemic modeling for a fixed population with known distances}
%\VignetteEngine{knitr::knitr}
%% additional dependencies beyond what is required for surveillance anyway:
%\VignetteDepends{surveillance, quadprog}
<>=
## purl=FALSE => not included in the tangle'd R script
knitr::opts_chunk$set(echo = TRUE, tidy = FALSE, results = 'markup',
fig.path='plots/twinSIR-', fig.width = 8, fig.height = 4.5,
fig.align = "center", fig.scap = NA, out.width = NULL,
cache = FALSE, error = FALSE, warning = FALSE, message = FALSE)
knitr::render_sweave() # use Sweave environments
knitr::set_header(highlight = '') # no \usepackage{Sweave} (part of jss class)
## R settings
options(prompt = "R> ", continue = "+ ", useFancyQuotes = FALSE) # JSS
options(width = 85, digits = 4)
options(scipen = 1) # so that 1e-4 gets printed as 0.0001
## xtable settings
options(xtable.booktabs = TRUE, xtable.size = "small",
xtable.sanitize.text.function = identity,
xtable.comment = FALSE)
@
\documentclass[nojss,nofooter,article]{jss}
\title{%
\vspace{-1.5cm}
\fbox{\vbox{\normalfont\footnotesize
This introduction to the \code{twinSIR} modeling framework of the
\proglang{R}~package \pkg{surveillance} is based on a publication in the
\textit{Journal of Statistical Software} --
\citet[Section~4]{meyer.etal2014} -- which is the suggested reference
if you use the \code{twinSIR} implementation in your own work.}}\\[1cm]
\code{twinSIR}: Individual-level epidemic modeling for a fixed population with known distances}
\Plaintitle{twinSIR: Individual-level epidemic modeling for a fixed population with known distances}
\Shorttitle{Modeling epidemics in a fixed population with known distances}
\author{Sebastian Meyer\thanks{Author of correspondence: \email{seb.meyer@fau.de}}\\Friedrich-Alexander-Universit{\"a}t\\Erlangen-N{\"u}rnberg \And
Leonhard Held\\University of Zurich \And
Michael H\"ohle\\Stockholm University}
\Plainauthor{Sebastian Meyer, Leonhard Held, Michael H\"ohle}
%% Basic packages
\usepackage{lmodern} % successor of CM -> searchable Umlauts (1 char)
\usepackage[english]{babel} % language of the manuscript is American English
%% Math packages
\usepackage{amsmath,amsfonts} % amsfonts defines \mathbb
\usepackage{bm} % \bm: alternative to \boldsymbol from amsfonts
%% Packages for figures and tables
\usepackage{booktabs} % make tables look nicer
\usepackage{subcaption} % successor of subfig, which supersedes subfigure
%% knitr uses \subfloat, which subcaption only provides since v1.3 (2019/08/31)
\providecommand{\subfloat}[2][need a sub-caption]{\subcaptionbox{#1}{#2}}
%% Handy math commands
\newcommand{\abs}[1]{\lvert#1\rvert}
\newcommand{\norm}[1]{\lVert#1\rVert}
\newcommand{\given}{\,\vert\,}
\newcommand{\dif}{\,\mathrm{d}}
\newcommand{\IR}{\mathbb{R}}
\newcommand{\IN}{\mathbb{N}}
\newcommand{\ind}{\mathbb{I}}
\DeclareMathOperator{\Po}{Po}
\DeclareMathOperator{\NegBin}{NegBin}
\DeclareMathOperator{\N}{N}
%% Additional commands
\newcommand{\class}[1]{\code{#1}} % could use quotes (JSS does not like them)
\newcommand{\CRANpkg}[1]{\href{https://CRAN.R-project.org/package=#1}{\pkg{#1}}}
%% Reduce the font size of code input and output
\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl, fontsize=\small}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontsize=\small}
%% Abstract
\Abstract{
The availability of geocoded health data and the inherent temporal
structure of communicable diseases have led to
an increased interest in statistical models and software for
spatio-temporal data with epidemic features.
The \proglang{R}~package \pkg{surveillance} can handle
various levels of aggregation at which infective events have been recorded.
This vignette illustrates the analysis of individual-level surveillance data
for a fixed population, of which the complete SIR event history is assumed to be
known. Typical applications for the multivariate, temporal point process model
``\code{twinSIR}'' of \citet{hoehle2009} include the spread of infectious
livestock diseases across farms, household models for childhood diseases,
and epidemics across networks.
%% (For other types of surveillance data, see
%% \code{vignette("twinstim")} and \code{vignette("hhh4\_spacetime")}.)
We first describe the general modeling approach and then exemplify
data handling, model fitting, and visualization
for a particularly well-documented measles outbreak among children of the
isolated German village Hagelloch in 1861.
%% Due to the many similarities with the spatio-temporal point process model
%% ``\code{twinstim}'' described and illustrated in \code{vignette("twinstim")},
%% we condense the \code{twinSIR} treatment accordingly.
}
\Keywords{%
individual-level surveillance data,
endemic-epidemic modeling,
infectious disease epidemiology,
self-exciting point process,
branching process with immigration}
\begin{document}
<>=
## load the "cool" package
library("surveillance")
## Compute everything or fetch cached results?
message("Doing computations: ",
COMPUTE <- !file.exists("twinSIR-cache.RData"))
if (!COMPUTE) load("twinSIR-cache.RData", verbose = TRUE)
@
\section[Model class]{Model class: \code{twinSIR}} \label{sec:twinSIR:methods}
The spatio-temporal point process regression model ``\code{twinstim}''
(\citealp{meyer.etal2011}, illustrated in \code{vignette("twinstim")})
is indexed in a continuous spatial domain, i.e., the set of
possible event locations %(the susceptible ``population'')
consists of the whole observation region and is thus infinite.
In contrast, if infections can only occur at a known discrete set of sites, such
as for livestock diseases among farms,
the conditional intensity function (CIF) of the underlying point process
formally becomes $\lambda_i(t)$. It characterizes the
instantaneous rate of infection of individual $i$ at time $t$,
given the sets $S(t)$ and $I(t)$ of susceptible and infectious individuals,
respectively (just before time $t$).
%In a similar regression view as in \code{vignette("twinstim")},
\citet{hoehle2009} proposed the following endemic-epidemic
multivariate temporal point process model (``\code{twinSIR}''):
\begin{equation} \label{eqn:twinSIR}
\lambda_i(t) = \lambda_0(t) \, \nu_i(t) +
\sum_{j \in I(t)} \left\{ f(d_{ij}) +
\bm{w}_{ij}^\top \bm{\alpha}^{(w)} \right\} \:, %\qquad \text{if } i \in S(t)\:,
\end{equation}
if $i \in S(t)$, i.e., if individual $i$ is currently susceptible,
and $\lambda_i(t) = 0$ otherwise. The rate decomposes into two components. The
first, endemic component consists of a Cox proportional hazards formulation
containing a semi-parametric baseline hazard $\lambda_0(t)$ and a
log-linear predictor $\nu_i(t)=\exp\left( \bm{z}_i(t)^\top \bm{\beta}
\right)$ of covariates modeling infection from external
sources. Furthermore, an additive epidemic component captures
transmission from the set $I(t)$ of currently infectious individuals.
The force of infection of individual $i$ depends on the distance $d_{ij}$ to each
infective source $j \in I(t)$ through a distance kernel
\begin{equation} \label{eqn:twinSIR:f}
f(u) = \sum_{m=1}^M \alpha_m^{(f)} B_m(u) \: \geq 0 \:,
\end{equation}
which is represented by a linear combination of non-negative basis
functions $B_m$ with the $\alpha_m^{(f)}$'s being
the respective coefficients.
For instance, $f$ could be modeled by a B-spline
\citep[Section~8.1]{Fahrmeir.etal2013},
and $d_{ij}$ could refer to the Euclidean distance $\norm{\bm{s}_i - \bm{s}_j}$
between the individuals' locations $\bm{s}_i$ and $\bm{s}_j$, or to the geodesic
distance between the nodes $i$ and $j$ in a network.
The distance-based force of infection
is modified additively by a linear predictor of covariates $\bm{w}_{ij}$ describing
the interaction of individuals $i$ and~$j$ further.
Hence, the whole epidemic component of Equation~\ref{eqn:twinSIR} can be written
as a single linear predictor $\bm{x}_i(t)^\top \bm{\alpha}$
by interchanging the summation order to
\begin{equation} \label{eqn:twinSIR:x}
\sum_{m=1}^M \alpha^{(f)}_m \sum_{j \in I(t)} B_m(d_{ij}) +
\sum_{k=1}^K \alpha^{(w)}_k \sum_{j \in I(t)} w_{ijk}
= \bm{x}_i(t)^\top \bm{\alpha} \:,
\end{equation}
such that $\bm{x}_i(t)$ comprises all epidemic terms summed over $j\in
I(t)$. Note that the use of additive covariates $\bm{w}_{ij}$ on top
of the distance kernel in \eqref{eqn:twinSIR} is
different from \code{twinstim}'s multiplicative approach.
One advantage of the additive approach is that
the subsequent linear decomposition of the distance kernel allows one
to gather all parts of the epidemic component in a single linear predictor.
Hence, the above model represents a CIF extension of what in
the context of survival analysis is known as an
additive-multiplicative hazard
model~\citep{Martinussen.Scheike2006}. As a consequence, the
\code{twinSIR} model could in principle be fitted with the \CRANpkg{timereg}
package, which yields estimates for the cumulative
hazards. However, \citet{hoehle2009} chooses a more direct inferential
approach: To ensure that the CIF $\lambda_i(t)$ is non-negative, all
covariates are encoded such that the components of $\bm{w}_{ij}$ are
non-negative. Additionally, the parameter vector $\bm{\alpha}$ is
constrained to be non-negative. Subsequent parameter inference is then
based on the resulting constrained penalized likelihood which gives
directly interpretable estimates of $\bm{\alpha}$.
Future work could investigate the potential of a multiplicative
approach for the epidemic component in \code{twinSIR}.
\section[Data structure]{Data structure: \class{epidata}} \label{sec:twinSIR:data}
New SIR-type event data typically arrive in the form of a simple data frame with
one row per individual and sequential event time points as columns.
For the 1861 Hagelloch measles epidemic,
which has previously been analyzed by, e.g., \citet{neal.roberts2004},
such a data set of the 188 affected
children is contained in the \pkg{surveillance} package:
<>=
data("hagelloch")
head(hagelloch.df, n = 5)
@
The \code{help("hagelloch")} contains a description of all columns. Here
we concentrate on the event columns \code{PRO} (appearance of prodromes),
\code{ERU} (eruption), and \code{DEAD} (day of death if during the outbreak).
We take the day on which the index case developed first symptoms,
30 October 1861 (\code{min(hagelloch.df$PRO)}), as the start of the epidemic,
i.e., we condition on this case being initially infectious.
% t0 = 1861-10-31 00:00:00
As for \code{twinstim}, the property of point processes that concurrent
events have zero probability requires special treatment. Ties are due to the
interval censoring of the data to a daily basis -- we broke these ties
by adding random jitter to the event times within the given days. The resulting
columns \code{tPRO}, \code{tERU}, and \code{tDEAD} are relative to the defined
start time. Following \citet{neal.roberts2004}, we assume that each child
becomes infectious (S~$\rightarrow$~I event at time \code{tI}) one day before the appearance
of prodromes, and is removed from the epidemic (I~$\rightarrow$~R event at time \code{tR})
three days after the appearance of rash or at the time of death, whichever comes
first.
For further processing of the data, we convert \code{hagelloch.df} to
the standardized \class{epidata} structure for \code{twinSIR}.
This is done by the converter function \code{as.epidata},
which also checks consistency and optionally
pre-calculates the epidemic terms $\bm{x}_i(t)$ of Equation~\ref{eqn:twinSIR:x}
to be incorporated in a \code{twinSIR} model.
The following call generates the \class{epidata} object \code{hagelloch}:
<>=
hagelloch <- as.epidata(hagelloch.df,
t0 = 0, tI.col = "tI", tR.col = "tR",
id.col = "PN", coords.cols = c("x.loc", "y.loc"),
f = list(household = function(u) u == 0,
nothousehold = function(u) u > 0),
w = list(c1 = function (CL.i, CL.j) CL.i == "1st class" & CL.j == CL.i,
c2 = function (CL.i, CL.j) CL.i == "2nd class" & CL.j == CL.i),
keep.cols = c("SEX", "AGE", "CL"))
@
The coordinates (\code{x.loc}, \code{y.loc}) correspond to the location of the
household the child lives in and are measured in meters.
Note that \class{twinSIR} allows for tied locations of individuals, but assumes
the relevant spatial location to be fixed during the entire observation period.
By default, the Euclidean distance between the given coordinates will be used.
Alternatively, \code{as.epidata} also accepts a pre-computed distance matrix via
its argument \code{D} without requiring spatial coordinates.
The argument \code{f} lists distance-dependent basis functions $B_m$ for
which the epidemic terms $\sum_{j\in I(t)} B_m(d_{ij})$
shall be generated. Here, \code{household} ($x_{i,H}(t)$) and
\code{nothousehold} ($x_{i,\bar{H}}(t)$) count for each child the number of
currently infective children in its household and outside its household,
respectively.
Similar to \citet{neal.roberts2004}, we also calculate the covariate-based
epidemic terms \code{c1} ($x_{i,c1}(t)$) and \code{c2} ($x_{i,c2}(t)$)
% from $w_{ijk} = \ind(\code{CL}_i = k, \code{CL}_j = \code{CL}_i)$
counting the number of currently infective classmates.
Note from the corresponding definitions of $w_{ij1}$ and $w_{ij2}$ in \code{w}
that \code{c1} is always zero for children of the second class and \code{c2} is
always zero for children of the first class. For pre-school children, both
variables equal zero over the whole period.
By the last argument \code{keep.cols}, we choose to only keep the covariates
\code{SEX}, \code{AGE}, and school \code{CL}ass from \code{hagelloch.df}.
The first few rows of the generated \class{epidata} object are shown below:
<>=
head(hagelloch, n = 5)
@
The \class{epidata} structure inherits from counting processes as implemented by
the \class{Surv} class of package \CRANpkg{survival} and also
used in \CRANpkg{timereg}.
Specifically, the observation period is split up into consecutive time
intervals (\code{start}; \code{stop}] of constant conditional intensities.
As the CIF $\lambda_i(t)$ of Equation~\eqref{eqn:twinSIR} only changes at time points, where
the set of infectious individuals $I(t)$ or some endemic covariate in $\nu_i(t)$
change, those occurrences define the break points of the time intervals.
Altogether, the \code{hagelloch} event history consists of
\Sexpr{nrow(hagelloch)/nlevels(hagelloch$id)} time \code{BLOCK}s of
\Sexpr{nlevels(hagelloch[["id"]])} rows, where each row describes the state of
individual \code{id} during the corresponding time interval.
The susceptibility status and the I- and R-events are
captured by the columns \code{atRiskY}, \code{event} and
\code{Revent}, respectively. The \code{atRiskY} column indicates if
the individual is at risk of becoming infected in the current interval.
The event columns indicate, which individual was infected or removed at the
\code{stop} time. Note that at most one entry in the \code{event} and
\code{Revent} columns is 1, all others are 0.
Apart from being the input format for \code{twinSIR} models,
the \class{epidata} class has several associated methods
(Table~\ref{tab:methods:epidata}), which are similar in
spirit to the methods described for \class{epidataCS}.
<>=
print(xtable(
surveillance:::functionTable("epidata", list(Display = c("stateplot"))),
caption="Generic and \\textit{non-generic} functions applicable to \\class{epidata} objects.",
label="tab:methods:epidata"), include.rownames = FALSE)
@
For example, Figure~\ref{fig:hagelloch_plot} illustrates the course of
the Hagelloch measles epidemic by counting processes for the
number of susceptible, infectious and removed children,
respectively.
Figure~\ref{fig:hagelloch_households} shows the locations of the households.
An \code{animate}d map can also be produced
to view the households' states over time
and a simple \code{stateplot} shows the changes for a selected unit.
<>=
par(mar = c(5, 5, 1, 1))
plot(hagelloch, xlab = "Time [days]")
@
<>=
par(mar = c(5, 5, 1, 1))
hagelloch_coords <- summary(hagelloch)$coordinates
plot(hagelloch_coords, xlab = "x [m]", ylab = "y [m]",
pch = 15, asp = 1, cex = sqrt(multiplicity(hagelloch_coords)))
legend(x = "topleft", pch = 15, legend = c(1, 4, 8), pt.cex = sqrt(c(1, 4, 8)),
title = "Household size")
@
\section{Modeling and inference} \label{sec:twinSIR:fit}
\subsection{Basic example}
To illustrate the flexibility of \code{twinSIR} we will analyze
the Hagelloch data using class room and household indicators similar to
\citet{neal.roberts2004}. We include an additional endemic background rate
$\exp(\beta_0)$, which allows for multiple outbreaks triggered
by external sources. Consequently, we do not
need to ignore the child that got infected about one month after the end of the
main epidemic (see the last event mark in Figure~\ref{fig:hagelloch_plot}).
% ATM, there is no way to fit a twinSIR without an endemic component.
Altogether, the CIF for a child $i$ is modeled as
\begin{equation} \label{eqn:twinSIR:hagelloch}
\lambda_i(t) = Y_i(t) \cdot \left[ \exp(\beta_0) +
\alpha_H x_{i,H}(t) +
\alpha_{c1} x_{i,c1}(t) + \alpha_{c2} x_{i,c2}(t) +
\alpha_{\bar{H}} x_{i,\bar{H}}(t) \right] \:,
\end{equation}
where $Y_i(t) = \ind(i \in S(t))$ is the at-risk indicator.
By counting the number of infectious classmates separately for both school
classes as described in the previous section, we allow for class-specific
effects $\alpha_{c1}$ and $\alpha_{c2}$ on the force of infection.
The model is estimated by maximum likelihood \citep{hoehle2009}
using the call
<>=
hagellochFit <- twinSIR(~household + c1 + c2 + nothousehold, data = hagelloch)
@
and the fit is summarized below:
<>=
set.seed(1)
summary(hagellochFit)
@
<>=
## drop leading and trailing empty lines
writeLines(tail(head(capture.output({
<>
}), -1), -1))
@
The results show, e.g., a
\Sexpr{sprintf("%.4f",coef(hagellochFit)["c1"])} /
\Sexpr{sprintf("%.4f",coef(hagellochFit)["c2"])} $=$
\Sexpr{format(coef(hagellochFit)["c1"]/coef(hagellochFit)["c2"])}
times higher transmission between individuals in
the 1st class than in the 2nd class.
Furthermore, an infectious housemate adds
\Sexpr{sprintf("%.4f",coef(hagellochFit)["household"])} /
\Sexpr{sprintf("%.4f",coef(hagellochFit)["nothousehold"])} $=$
\Sexpr{format(coef(hagellochFit)["household"]/coef(hagellochFit)["nothousehold"])}
times as much infection pressure as infectious children outside the household.
The endemic background rate of infection in a population with no current measles
cases is estimated to be
$\exp(\hat{\beta}_0) = \exp(\Sexpr{format(coef(hagellochFit)["cox(logbaseline)"])}) = \Sexpr{format(exp(coef(hagellochFit)["cox(logbaseline)"]))}$.
An associated Wald confidence interval (CI) based on the asymptotic normality of
the maximum likelihood estimator (MLE) can be obtained by
\code{exp}-transforming the \code{confint} for $\beta_0$:
<>=
exp(confint(hagellochFit, parm = "cox(logbaseline)"))
@
Note that Wald confidence intervals for the epidemic parameters $\bm{\alpha}$
are to be treated carefully, because their construction does not
take the restricted parameter space into account.
For more adequate statistical inference,
the behavior of the log-likelihood near the MLE can be investigated using the
\code{profile}-method for \class{twinSIR} objects.
For instance, to evaluate the normalized profile log-likelihood of
$\alpha_{c1}$ and $\alpha_{c2}$ on an equidistant grid of 25 points within the
corresponding 95\% Wald CIs, we do:
<>=
prof <- profile(hagellochFit,
list(c(match("c1", names(coef(hagellochFit))), NA, NA, 25),
c(match("c2", names(coef(hagellochFit))), NA, NA, 25)))
@
The profiling result contains 95\% highest likelihood based CIs
for the parameters, as well as the Wald CIs for comparison:
<<>>=
prof$ci.hl
@
The entire functional form of the normalized profile log-likelihood on the
requested grid as stored in \code{prof$lp} can be visualized by:
<>=
plot(prof)
@
The above model summary also reports the one-sided AIC~\citep{hughes.king2003},
which can be used for model selection under positivity constraints on
$\bm{\alpha}$ as described in \citet{hoehle2009}.
The involved parameter penalty is determined by Monte Carlo simulation,
which is why we did \code{set.seed} before the \code{summary} call.
The algorithm is described in
\citet[p.~79, Simulation 3]{Silvapulle.Sen2005} and involves quadratic
programming using package \CRANpkg{quadprog} \citep{R:quadprog}.
If there are less than three constrained parameters in a \code{twinSIR} model,
the penalty is computed analytically.
\subsection{Model diagnostics}
<>=
print(xtable(
surveillance:::functionTable("twinSIR", functions=list(Display = c("checkResidualProcess"))),
caption="Generic and \\textit{non-generic} functions for \\class{twinSIR}. There are no specific \\code{coef} or \\code{confint} methods, since the respective default methods from package \\pkg{stats} apply outright.",
label="tab:methods:twinSIR"), include.rownames = FALSE)
@
Table~\ref{tab:methods:twinSIR} lists all methods for the \class{twinSIR} class.
For example, to investigate how the conditional intensity function decomposes into endemic and epidemic
components over time, we produce Figure~\ref{fig:hagellochFit_plot-1} by:
<>=
par(mar = c(5, 5, 1, 1))
plot(hagellochFit, which = "epidemic proportion", xlab = "time [days]")
checkResidualProcess(hagellochFit, plot = 1)
@
Note that the last infection was necessarily caused by the endemic component
since there were no more infectious children in the observed population which
could have triggered the new case.
We can also inspect temporal Cox-Snell-like \code{residuals} of the fitted point
process using the function \code{checkResidualProcess} as for the
spatio-temporal point process models in \code{vignette("twinstim")}.
The resulting Figure~\ref{fig:hagellochFit_plot-2} reveals some deficiencies of
the model in describing the waiting times between events, which might be related
to the assumption of fixed infection periods.
<>=
knots <- c(100, 200)
fstep <- list(
B1 = function(D) D > 0 & D < knots[1],
B2 = function(D) D >= knots[1] & D < knots[2],
B3 = function(D) D >= knots[2])
@
To illustrate AIC-based model selection,
we may consider a more flexible model for local spread using a
step function for the distance kernel $f(u)$ in Equation \ref{eqn:twinSIR:f}.
An updated model with
<>=
.allknots <- c(0, knots, "\\infty")
cat(paste0("$B_{", seq_along(fstep), "} = ", "I_{", ifelse(seq_along(fstep)==1,"(","["),
.allknots[-length(.allknots)], ";", .allknots[-1], ")}(u)$",
collapse = ", "))
@
can be fitted as follows:
<>=
<>
hagellochFit_fstep <- twinSIR(
~household + c1 + c2 + B1 + B2 + B3,
data = update(hagelloch, f = fstep))
@
<>=
set.seed(1)
AIC(hagellochFit, hagellochFit_fstep)
@
Hence the simpler model with just a \code{nothousehold} component instead
of the more flexible distance-based step function is preferred.
\section{Simulation} \label{sec:twinSIR:simulation}
Simulation from fitted \code{twinSIR} models is described in
detail in~\citet[Section~4]{hoehle2009}. The implementation is made
available by an appropriate \code{simulate}-method for class \class{twinSIR}.
We skip the illustration here and
refer to \code{help("simulate.twinSIR")}.
%--------------
% BIBLIOGRAPHY
%--------------
\bibliography{references}
<>=
save(prof, file = "twinSIR-cache.RData")
@
\end{document}