%\VignetteIndexEntry{hhh4: An endemic-epidemic modelling framework for infectious disease counts}
%\VignetteDepends{surveillance, Matrix}
\documentclass[a4paper,11pt]{article}
\usepackage[T1]{fontenc}
\usepackage[english]{babel}
\usepackage{graphicx}
\usepackage{color}
\usepackage{natbib}
\usepackage{lmodern}
\usepackage{bm}
\usepackage{amsmath}
\usepackage{amsfonts,amssymb}
\setlength{\parindent}{0pt}
\setcounter{secnumdepth}{1}
\newcommand{\Po}{\operatorname{Po}}
\newcommand{\NegBin}{\operatorname{NegBin}}
\newcommand{\N}{\mathcal{N}}
\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}}
\newcommand{\surveillance}{\pkg{surveillance}}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\hhh}{\texttt{hhh4}}
\newcommand{\R}{\textsf{R}}
\newcommand{\sts}{\texttt{sts}}
\newcommand{\example}[1]{\subsubsection*{Example: #1}}
%%% Meta data
\usepackage{hyperref}
\hypersetup{
pdfauthor = {Michaela Paul and Sebastian Meyer},
pdftitle = {'hhh4': An endemic-epidemic modelling framework for infectious disease counts},
pdfsubject = {R package 'surveillance'}
}
\newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}}
\title{\code{hhh4}: An endemic-epidemic modelling framework for infectious disease counts}
\author{
Michaela Paul and Sebastian Meyer\thanks{Author of correspondence: \email{seb.meyer@fau.de} (new affiliation)}\\
Epidemiology, Biostatistics and Prevention Institute\\
University of Zurich, Zurich, Switzerland
}
\date{8 February 2016}
%%% Sweave
\usepackage{Sweave}
\SweaveOpts{prefix.string=plots/hhh4, keep.source=T, strip.white=true}
\definecolor{Sinput}{rgb}{0,0,0.56}
\DefineVerbatimEnvironment{Sinput}{Verbatim}{formatcom={\color{Sinput}},fontshape=sl,fontsize=\footnotesize}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontshape=sl,fontsize=\footnotesize}
%%% Initial R code
<>=
library("surveillance")
options(width=75)
## create directory for plots
dir.create("plots", showWarnings=FALSE)
######################################################
## Do we need to compute or can we just fetch results?
######################################################
compute <- !file.exists("hhh4-cache.RData")
message("Doing computations: ", compute)
if(!compute) load("hhh4-cache.RData")
@
\begin{document}
\maketitle
\begin{abstract} \noindent
The \R\ package \surveillance\ provides tools for the visualization,
modelling and monitoring of epidemic phenomena.
This vignette is concerned with the \hhh\ modelling
framework for univariate and multivariate time series of infectious
disease counts proposed by \citet{held-etal-2005}, and further extended by
\citet{paul-etal-2008}, \citet{paul-held-2011}, \citet{held.paul2012}, and
\citet{meyer.held2013}. The implementation is illustrated using several
built-in surveillance data sets.
The special case of \emph{spatio-temporal} \hhh\ models is also covered
in \citet[Section~5]{meyer.etal2014},
which is available as the extra \verb+vignette("hhh4_spacetime")+.
\end{abstract}
\section{Introduction}\label{sec:intro}
To meet the threats of infectious diseases, many countries have established
surveillance systems for the reporting of various infectious diseases.
The systematic and standardized reporting at a national and regional level
aims to recognize all outbreaks quickly, even when aberrant cases are
dispersed in space. Traditionally, notification data, i.e.\ counts of cases
confirmed according to a specific definition and reported daily, weekly or
monthly on a regional or national level, are used for surveillance purposes.
The \R-package \surveillance\ provides functionality for the retrospective
modelling and prospective aberration detection in the resulting surveillance
time series.
Overviews of the outbreak detection functionality of \surveillance\ are given by
\citet{hoehle-mazick-2010} and \citet{salmon.etal2014}.
This document illustrates the functionality of the function \hhh\ for
the modelling of univariate and multivariate time series of infectious
disease counts. It is part of the \surveillance\ package as
of version 1.3.
The remainder of this vignette unfolds as follows:
Section~\ref{sec:data} introduces the S4 class data structure used to
store surveillance time series data within the package. Access and
visualization methods are outlined by means of built-in data sets. In
Section~\ref{sec:model}, the statistical modelling approach by
\citet{held-etal-2005} and further model extensions are described.
After the general function call and arguments are shown, the detailed
usage of \hhh\ is demonstrated in Section~\ref{sec:hhh} using data
introduced in Section~\ref{sec:data}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Surveillance data}\label{sec:data}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Denote by $\{y_{it}; i=1,\ldots,I,t=1,\ldots,T\}$ the multivariate time series
of disease counts for a specific partition of gender, age and location.
Here, $T$ denotes the length of the time series and $I$ denotes the number
of units (e.g\ geographical regions or age groups) being monitored.
Such data are represented using objects of the S4 class \sts\ (surveillance
time series).
\subsection[The sts data class]{The \sts\ data class}
The \sts\ class contains the $T\times I$ matrix of counts $y_{it}$ in
a slot \code{observed}. An integer slot \code{epoch} denotes the time index
$1\leq t \leq T$ of each row in \code{observed}. The number of observations
per year, e.g.\ 52 for weekly or 12 for monthly data, is denoted by \code{freq}.
Furthermore, \code{start} denotes a vector of length two containing the start
of the time series as \code{c(year, epoch)}.
For spatially stratified time series, the slot \code{neighbourhood}
denotes an $I \times I$ adjacency matrix with elements 1 if two regions are
neighbors and 0 otherwise. For map visualizations, the slot \code{map}
links the multivariate time series to geographical regions stored in a
\code{"SpatialPolygons"} object (package \pkg{sp}).
Additionally, the slot \code{populationFrac} contains a $T\times I$ matrix
representing population fractions in unit $i$ at time $t$.
The \sts\ data class is also described in
\citet[Section~2.1]{hoehle-mazick-2010},
\citet[Section~1.1]{salmon.etal2014},
\citet[Section~5.2]{meyer.etal2014},
and on the associated help page \code{help("sts")}.
\subsection{Some example data sets}
The package \surveillance\ contains a number of time series in the
\code{data} directory. Most data sets originate from the SurvStat@RKI
database\footnote{\url{https://survstat.rki.de}}, maintained by
the Robert Koch Institute (RKI) in Germany. Selected data sets will be
analyzed in Section~\ref{sec:hhh} and are introduced in the following.
Note that many of the built-in datasets are stored in the
S3 class data structure \mbox{\code{disProg}} used in ancient versions
of the \surveillance\ package (until 2006). They can be easily converted
into the new S4 \sts\ data structure using the function
\code{disProg2sts}. The resulting \sts\ object can be accessed similar
as standard \code{matrix} objects and allows easy temporal and spatial
aggregation as will be shown in the remainder of this section.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\example{Influenza and meningococcal disease, Germany, 2001--2006}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As a first example, the weekly number of influenza and meningococcal disease
cases in Germany is considered.
<>=
# load data
data("influMen")
# convert to sts class and print basic information about the time series
print(fluMen <- disProg2sts(influMen))
@
The univariate time series of meningococcal disease counts can be obtained
with
<>=
meningo <- fluMen[, "meningococcus"]
dim(meningo)
@
The \code{plot} function provides ways to visualize
the multivariate time series in time, space and space-time, as
controlled by the \code{type} argument:
\setkeys{Gin}{width=1\textwidth}
<>=
plot(fluMen, type = observed ~ time | unit, # type of plot (default)
same.scale = FALSE, # unit-specific ylim?
col = "grey") # color of bars
@
See \code{help("stsplot")} for a detailed description of the plot routines.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\example{Influenza, Southern Germany, 2001--2008}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The spatio-temporal spread of influenza in the 140 Kreise (districts)
of Bavaria and Baden-W\"urttemberg is analyzed using the weekly number of
cases reported to the RKI~\citep{survstat-fluByBw} in the years 2001--2008.
An \sts\ object containing the data is created as follows:
<>=
# read in observed number of cases
flu.counts <- as.matrix(read.table(system.file("extdata/counts_flu_BYBW.txt",
package = "surveillance"),
check.names = FALSE))
@
\begin{center}
\setkeys{Gin}{width=.5\textwidth}
<>=
# read in 0/1 adjacency matrix (1 if regions share a common border)
nhood <- as.matrix(read.table(system.file("extdata/neighbourhood_BYBW.txt",
package = "surveillance"),
check.names = FALSE))
library("Matrix")
print(image(Matrix(nhood)))
@
\end{center}
<>=
# read in population fractions
popfracs <- read.table(system.file("extdata/population_2001-12-31_BYBW.txt",
package = "surveillance"),
header = TRUE)$popFrac
# create sts object
flu <- sts(flu.counts, start = c(2001, 1), frequency = 52,
population = popfracs, neighbourhood = nhood)
@
These data are already included as \code{data("fluBYBW")} in \surveillance.
In addition to the \sts\ object created above, \code{fluBYBW} contains
a map of the administrative districts of Bavaria and Baden-W\"urttemberg.
This works by specifying a \code{"SpatialPolygons"} representation of the
districts as an extra argument \code{map} in the above \sts\ call. Such a
\code{"SpatialPolygons"} object can be obtained from, e.g, an external shapefile
using the \pkg{sf} functions \code{st\_read} followed by \code{as\_Spatial}.
A map enables plots and animations of the cumulative number of cases by region.
For instance, a disease incidence map of the year 2001 can be obtained as
follows:
\setkeys{Gin}{width=.5\textwidth}
\begin{center}
<>=
data("fluBYBW")
plot(fluBYBW[year(fluBYBW) == 2001, ], # select year 2001
type = observed ~ unit, # total counts by region
population = fluBYBW@map$X31_12_01 / 100000, # per 100000 inhabitants
colorkey = list(title = "Incidence [per 100'000 inhabitants]"))
@
\end{center}
<>=
# consistency check
local({
fluBYBW@map <- flu@map
stopifnot(all.equal(fluBYBW, flu))
})
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\example{Measles, Germany, 2005--2007}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The following data set contains the weekly number of measles cases in the 16
German federal states, in the years 2005--2007. These data
have been analyzed by \citet{herzog-etal-2010} after aggregation into
bi-weekly periods.
<>=
data("measlesDE")
measles2w <- aggregate(measlesDE, nfreq = 26)
@
\setkeys{Gin}{width=.75\textwidth}
\begin{center}
<>=
plot(measles2w, type = observed ~ time, # aggregate counts over all units
main = "Bi-weekly number of measles cases in Germany")
@
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Model formulation}\label{sec:model}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Retrospective surveillance aims to identify outbreaks and (spatio-)temporal
patterns through statistical modelling. Motivated by a branching process
with immigration, \citet{held-etal-2005} suggest the following model
for the analysis of univariate time series of infectious disease counts
$\{y_{t}; t=1,\ldots,T\}$.
The counts are assumed to be Poisson distributed with conditional mean
\begin{align*}
\mu_{t} = \lambda y_{t-1}+ \nu_{t}, \quad(\lambda,\nu_{t}>0)
\end{align*}
where $\lambda$ and $\nu_t$ are unknown quantities.
The mean incidence is decomposed additively into two components: an
epidemic or \emph{autoregressive} component $\lambda y_{t-1}$, and
an \emph{endemic} component $\nu_t$. The former should be able to capture
occasional outbreaks whereas the latter explains a baseline rate of cases
with stable temporal pattern.
\citet{held-etal-2005} suggest the following parametric model for the endemic
component:
\begin{align}\label{eq:nu_t}
\log(\nu_t) =\alpha + \beta t +
\left\{\sum_{s=1}^S \gamma_s \sin(\omega_s t) + \delta_s \cos(\omega_s t)\right\},
\end{align}
where $\alpha$ is an intercept, $\beta$ is a trend parameter, and the terms
in curly brackets are used to model seasonal variation. Here, $\gamma_s$ and
$\delta_s$ are unknown parameters, $S$ denotes the number of harmonics to
include, and $\omega_s=2\pi s/$\code{freq} are Fourier frequencies (e.g.\
\code{freq = 52} for weekly data).
For ease of interpretation, the seasonal terms in \eqref{eq:nu_t} can be
written equivalently as
\begin{align*}
\gamma_s \sin(\omega_s t) + \delta_s \cos(\omega_s t)= A_s \sin(\omega_s t +\varphi_s)
\end{align*}
with amplitude $A_s=\sqrt{\gamma_s^2+\delta_s^2}$
describing the magnitude, and phase difference $\tan(\varphi_s)=\delta_s/\gamma_s$
describing the onset of the sine wave.
To account for overdispersion, the Poisson model may be replaced by
a negative binomial model. Then, the conditional mean $\mu_t$ remains
the same but the conditional variance increases to $\mu_t (1+\mu_t \psi)$
with additional unknown overdispersion parameter $\psi>0$.
The model is extended to multivariate time series $\{y_{it}\}$ in
\citet{held-etal-2005} and \citet{paul-etal-2008} by including an additional
\emph{neighbor-driven} component, where past cases in other (neighboring)
units also enter as explanatory covariates. The conditional mean $\mu_{it}$
is then given by
\begin{align} \label{eq:mu_it}
\mu_{it} = \lambda y_{i,t-1} + \phi \sum_{j\neq i} w_{ji} y_{j,t-1} +e_{it} \nu_{t},
\end{align}
where the unknown parameter $\phi$ quantifies the influence of other units $j$
on unit $i$, $w_{ji}$ are weights reflecting between-unit transmission and $e_{it}$
corresponds to an offset (such as population fractions at time $t$ in region $i$).
A simple choice for the weights is $w_{ji}=1$ if units $j$ and $i$ are adjacent
and 0 otherwise. See \citet{paul-etal-2008} for a discussion of alternative
weights, and \citet{meyer.held2013} for how to estimate these weights in the
spatial setting using a parametric power-law formulation based on the order of
adjacency.
When analyzing a specific disease observed in, say, multiple regions or several
pathogens (such as influenza and meningococcal disease), the assumption
of equal incidence levels or disease transmission across units is
questionable. To address such heterogeneity, the unknown quantities
$\lambda$, $\phi$, and $\nu_t$ in \eqref{eq:mu_it} may also depend on unit
$i$. This can be done via
\begin{itemize}
\item unit-specific fixed parameters, e.g.\ $\log(\lambda_i)=\alpha_i$
\citep{paul-etal-2008};
\item unit-specific random effects, e.g\ $\log(\lambda_i)=\alpha_0 +a_i$,
$a_i \stackrel{\text{iid}}{\sim} \N(0,\sigma^2_\lambda)$ \citep{paul-held-2011};
\item linking parameters with known (possibly time-varying) explanatory
variables, e.g.\ $\log(\lambda_i)=\alpha_0 +x_i\alpha_1$ with
region-specific vaccination coverage $x_i$ \citep{herzog-etal-2010}.
\end{itemize}
In general, the parameters of all three model components may depend on both
time and unit.
A call to \hhh\ fits a Poisson or negative binomial model with conditional mean
\begin{align*}
\mu_{it} = \lambda_{it} y_{i,t-1} + \phi_{it} \sum_{j\neq i} w_{ji} y_{j,t-1} +e_{it} \nu_{it}
\end{align*}
to a (multivariate) time series of counts.
Here, the three unknown quantities are modelled as log-linear predictors
\begin{align}
\log(\lambda_{it}) &= \alpha_0 + a_i +\bm{u}_{it}^\top \bm{\alpha} \tag{\code{ar}}\\
\log(\phi_{it}) &= \beta_0 + b_i +\bm{x}_{it}^\top \bm{\beta} \tag{\code{ne}}\\
\log(\nu_{it}) &= \gamma_0 + c_i +\bm{z}_{it}^\top \bm{\gamma}\tag{\code{end}}
\end{align}
where $\alpha_0,\beta_0,\gamma_0$ are intercepts, $\bm{\alpha},\bm{\beta},\bm{\gamma}$
are vectors of unknown parameters corresponding to covariate vectors
$\bm{u}_{it},\bm{x}_{it},\bm{z}_{it}$, and $a_i,b_i,c_i$ are random effects.
For instance, model~\eqref{eq:nu_t} with $S=1$ seasonal terms may be
represented as $\bm{z}_{it}=(t,\sin(2\pi/\code{freq}\;t),\cos(2\pi/\code{freq}\;t))^\top$.
The stacked vector of all random effects
is assumed to follow a normal distribution with mean $\bm{0}$ and covariance
matrix $\bm{\Sigma}$.
In applications, each of the components \code{ar},
\code{ne}, and \code{end} may be omitted in parts or as a whole.
If the model does not contain random effects, standard likelihood inference can
be performed. Otherwise, inference is based on penalized quasi-likelihood as
described in detail in \citet{paul-held-2011}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Function call and control settings}\label{sec:hhh}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The estimation procedure is called with
<>=
hhh4(sts, control)
@
where \code{sts} denotes a (multivariate) surveillance time series and
the model is specified in the argument \code{control} in consistency
with other algorithms in \surveillance.
The \code{control} setting is a list of the following arguments (here with
default values):
<>=
control = list(
ar = list(f = ~ -1, # formula for log(lambda_it)
offset = 1), # optional multiplicative offset
ne = list(f = ~ -1, # formula for log(phi_it)
offset = 1, # optional multiplicative offset
weights = neighbourhood(stsObj) == 1), # (w_ji) matrix
end = list(f = ~ 1, # formula for log(nu_it)
offset = 1), # optional multiplicative offset e_it
family = "Poisson", # Poisson or NegBin model
subset = 2:nrow(stsObj), # subset of observations to be used
optimizer = list(stop = list(tol = 1e-5, niter = 100), # stop rules
regression = list(method = "nlminb"), # for penLogLik
variance = list(method = "nlminb")), # for marLogLik
verbose = FALSE, # level of progress reporting
start = list(fixed = NULL, # list with initial values for fixed,
random = NULL, # random, and
sd.corr = NULL), # variance parameters
data = list(t = epoch(stsObj)-1),# named list of covariates
keep.terms = FALSE # whether to keep the model terms
)
@
The first three arguments \code{ar}, \code{ne}, and \code{end}
specify the model components using \code{formula} objects.
By default, the counts $y_{it}$ are assumed to be Poisson distributed,
but a negative binomial model can be chosen by setting \mbox{\code{family = "NegBin1"}}.
By default, both the penalized and marginal log-likelihoods are maximized using
the quasi-Newton algorithm available via the \R\ function \code{nlminb}.
The methods from \code{optim} may also be used, e.g.,
\mbox{\code{optimizer = list(variance = list(method="Nelder-Mead")}} is a useful
alternative for maximization of the marginal log-likelihood with respect to the
variance parameters.
Initial values for the fixed, random, and variance parameters
can be specified in the \code{start} argument.
If the model contains covariates, these have to be provided in the \code{data}
argument. If a covariate does not vary across units, it may be given as a
vector of length $T$. Otherwise, covariate values must be given
in a matrix of size $T \times I$.
In the following, the functionality of \hhh\ is demonstrated using
the data sets introduced in Section~\ref{sec:data}
and previously analyzed in \citet{paul-etal-2008}, \citet{paul-held-2011} and
\citet{herzog-etal-2010}.
Selected results are reproduced. For a thorough discussion
we refer to these papers.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Univariate modelling}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As a first example, consider the univariate time series of meningococcal infections
in Germany, 01/2001--52/2006 \citep[cf.][Table~1]{paul-etal-2008}.
A Poisson model without autoregression and $S=1$ seasonal term is specified
as follows:
<>=
# specify a formula object for the endemic component
( f_S1 <- addSeason2formula(f = ~ 1, S = 1, period = 52) )
# fit the Poisson model
result0 <- hhh4(meningo, control = list(end = list(f = f_S1),
family = "Poisson"))
summary(result0)
@
To fit the corresponding negative binomial model,
we can use the convenient \code{update} method:
<>=
result1 <- update(result0, family = "NegBin1")
@
Note that the \code{update} method by default uses the parameter estimates from
the original model as start values when fitting the updated model; see
\code{help("update.hhh4")} for details.
We can calculate Akaike's Information Criterion for the two models to check
whether accounting for overdispersion is useful for these data:
<<>>=
AIC(result0, result1)
@
Due to the default control settings with \verb|ar = list(f = ~ -1)|,
the autoregressive component has been omitted in the above models.
It can be included by the following model update:
<>=
# fit an autoregressive model
result2 <- update(result1, ar = list(f = ~ 1))
@
To extract only the ML estimates and standard errors instead of a full model
\code{summary}, the \code{coef} method can be used:
<<>>=
coef(result2, se = TRUE, # also return standard errors
amplitudeShift = TRUE, # transform sine/cosine coefficients
# to amplitude/shift parameters
idx2Exp = TRUE) # exponentiate remaining parameters
@
Here, \code{exp(ar.1)} is the autoregressive coefficient $\lambda$ and can be
interpreted as the epidemic proportion of disease incidence
\citep{held.paul2012}.
Note that the above transformation arguments \code{amplitudeShift} and
\code{idx2Exp} can also be used in the \code{summary} method.
Many other standard methods are implemented for \code{"hhh4"} fits,
see, e.g., \code{help("confint.hhh4")}.
A plot of the fitted model components can be easily obtained:
\begin{center}
<>=
plot(result2)
@
\end{center}
See the comprehensive \code{help("plot.hhh4")} for further options.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Bivariate modelling}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Now, the weekly numbers of both meningococcal disease (\textsc{MEN}) and
influenza (\textsc{FLU}) cases are analyzed to investigate whether influenza
infections predispose meningococcal disease \citep[cf.][Table~2]{paul-etal-2008}.
This requires disease-specific parameters which are specified in the formula
object with \code{fe(\ldots)}.
In the following, a negative binomial model with mean
\begin{align*}
\binom{\mu_{\text{men},t}} {\mu_{\text{flu},t}}=
\begin{pmatrix}
\lambda_\text{men} & \phi \\
0 & \lambda_\text{flu} \\
\end{pmatrix} \binom{\text{\sc men}_{t-1}}{\text{\sc flu}_{t-1}}
+ \binom{\nu_{\text{men},t}}{\nu_{\text{flu},t}}\,,
\end{align*}
where the endemic component includes $S=3$ seasonal terms for the \textsc{FLU}
data and $S=1$ seasonal terms for the \textsc{MEN} data is considered.
Here, $\phi$ quantifies the influence of past influenza cases on the meningococcal
disease incidence.
This model corresponds to the second model of Table~2 in \citet{paul-etal-2008}
and is fitted as follows:
<>=
# no "transmission" from meningococcus to influenza
neighbourhood(fluMen)["meningococcus","influenza"] <- 0
neighbourhood(fluMen)
@
<>=
# create formula for endemic component
f.end <- addSeason2formula(f = ~ -1 + fe(1, unitSpecific = TRUE),
# disease-specific intercepts
S = c(3, 1), # S = 3 for flu, S = 1 for men
period = 52)
# specify model
m <- list(ar = list(f = ~ -1 + fe(1, unitSpecific = TRUE)),
ne = list(f = ~ 1, # phi, only relevant for meningococcus due to
weights = neighbourhood(fluMen)), # the weight matrix
end = list(f = f.end),
family = "NegBinM") # disease-specific overdispersion
# fit model
result <- hhh4(fluMen, control = m)
summary(result, idx2Exp=1:3)
@
A plot of the estimated mean components can be obtained as follows:
\setkeys{Gin}{width=1\textwidth}
\begin{center}
<>=
plot(result, units = NULL, pch = 20, legend = 2, legend.args = list(
legend = c("influenza-driven", "autoregressive", "endemic")))
@
\end{center}
Alternatively, use the \code{decompose} argument to show the unit-specific
contributions to the fitted mean:
\begin{center}
<>=
plot(result, units = NULL, pch = 20, legend = 2,
decompose = TRUE, col = c(7, 4))
@
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Multivariate modelling}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
For disease counts observed in a large number of regions, say, (i.e.\
highly multivariate time series of counts) the use of region-specific
parameters to account for regional heterogeneity is no longer feasible
as estimation and identifiability problems may occur.
Here we illustrate two approaches: region-specific random effects and
region-specific covariates.
For a more detailed illustration of areal \code{hhh4} models,
see \verb+vignette("hhh4_spacetime")+, which uses
\verb+data("measlesWeserEms")+ as an example.
\subsubsection*{Influenza, Southern Germany, 2001--2008}
\citet{paul-held-2011} propose a random effects formulation to analyze the weekly
number of influenza cases in \Sexpr{ncol(fluBYBW)} districts of Southern Germany.
For example, consider a model with random intercepts in the endemic component:
$c_i \stackrel{iid}{\sim} \N(0,\sigma^2_\nu), i=1,\ldots,I$.
Such effects are specified as:
<>=
f.end <- ~ -1 + ri(type = "iid", corr = "all")
@
The alternative \code{type = "car"} would assume spatially correlated random
effects; see \citet{paul-held-2011} for details.
The argument \code{corr = "all"} allows for correlation between region-specific
random effects in different components, e.g., random incidence levels $c_i$
in the endemic component and random effects $b_i$ in the neighbor-driven component.
The following call to \hhh\ fits such a random effects model with
linear trend and $S=3$ seasonal terms in the endemic component,
a fixed autoregressive parameter $\lambda$,
and first-order transmission weights $w_{ji}=\mathbb{I}(j\sim i)$
-- normalized such that $\sum_i w_{ji} = 1$ for all rows $j$ --
to the influenza data \citep[cf.][Table~3, model~B2]{paul-held-2011}.
<>=
# endemic component: iid random effects, linear trend, S=3 seasonal terms
f.end <- addSeason2formula(f = ~ -1 + ri(type="iid", corr="all") +
I((t-208)/100),
S = 3, period = 52)
# model specification
model.B2 <- list(ar = list(f = ~ 1),
ne = list(f = ~ -1 + ri(type="iid", corr="all"),
weights = neighbourhood(fluBYBW),
normalize = TRUE), # all(rowSums(weights) == 1)
end = list(f = f.end, offset = population(fluBYBW)),
family = "NegBin1", verbose = TRUE,
optimizer = list(variance = list(method = "Nelder-Mead")))
# default start values for random effects are sampled from a normal
set.seed(42)
@
<>=
if(compute){
result.B2 <- hhh4(fluBYBW, model.B2)
s.B2 <- summary(result.B2, maxEV = TRUE, idx2Exp = 1:3)
#pred.B2 <- oneStepAhead(result.B2, tp = nrow(fluBYBW) - 2*52)
predfinal.B2 <- oneStepAhead(result.B2, tp = nrow(fluBYBW) - 2*52,
type = "final")
meanSc.B2 <- colMeans(scores(predfinal.B2))
save(s.B2, meanSc.B2, file="hhh4-cache.RData")
}
@
<>=
# fit the model (takes about 35 seconds)
result.B2 <- hhh4(fluBYBW, model.B2)
summary(result.B2, maxEV = TRUE, idx2Exp = 1:3)
@
<>=
s.B2
@
Model choice based on information criteria such as AIC or BIC is well
explored and understood for models that correspond to fixed-effects likelihoods.
However, in the presence of random effects their use can be problematic.
For model selection in time series models, the comparison of successive
one-step-ahead forecasts with the actually observed data
provides a natural alternative. In this context, \citet{gneiting-raftery-2007}
recommend the use of strictly proper scoring
rules, such as the logarithmic score (logs) or the ranked probability score (rps).
See \citet{czado-etal-2009} and \citet{paul-held-2011} for further details.
One-step-ahead predictions for the last 2 years for model B2 could be obtained
as follows:
<>=
pred.B2 <- oneStepAhead(result.B2, tp = nrow(fluBYBW) - 2*52)
@
However, computing ``rolling'' one-step-ahead predictions from a random effects
model is computationally expensive, since the model needs to be refitted at
every time point. The above call would take approximately 45 minutes!
So for the purpose of this vignette, we use the fitted model based on the whole
time series to compute all (fake) predictions during the last two years:
<>=
predfinal.B2 <- oneStepAhead(result.B2, tp = nrow(fluBYBW) - 2*52,
type = "final")
@
The mean scores (logs and rps) corresponding to this set of predictions can then
be computed as follows:
<>=
colMeans(scores(predfinal.B2, which = c("logs", "rps")))
@
<>=
meanSc.B2[c("logs", "rps")]
@
Using predictive model assessments, \citet{meyer.held2013} found that
power-law transmission weights more appropriately reflect the spread of
influenza than the previously used first-order weights (which actually allow the
epidemic to spread only to directly adjacent districts within one week).
These power-law weights can be constructed by the function \code{W\_powerlaw}
and require the \code{neighbourhood} of the \sts\ object to contain adjacency
orders. The latter can be easily obtained from the binary adjacency matrix
using the function \code{nbOrder}. See the corresponding help pages or
\citet[Section~5]{meyer.etal2014} for illustrations.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection*{Measles, German federal states, 2005--2007}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
<>=
data("MMRcoverageDE")
cardVac1 <- MMRcoverageDE[1:16,3:4]
adjustVac <- function(cardVac, p=0.5, nrow=1){
card <- cardVac[,1]
vac <- cardVac[,2]
vacAdj <- vac*card + p*vac*(1-card)
return(matrix(vacAdj,nrow=nrow, ncol=length(vacAdj), byrow=TRUE))
}
vac0 <- 1 - adjustVac(cardVac1, p=0.5, nrow=frequency(measles2w)*3)
colnames(vac0) <- colnames(measles2w)
@
As a last example, consider the number of measles cases in the 16 federal states
of Germany, in the years 2005--2007. There is considerable regional variation
in the incidence pattern which is most likely due to differences in vaccination
coverage. In the following, information about vaccination coverage in each
state, namely the log proportion of unvaccinated school starters, is included
as explanatory variable in a model for the bi-weekly aggregated measles data.
See \citet{herzog-etal-2010} for further details.
Vaccination coverage levels for the year 2006 are available in the dataset
\code{MMRcoverageDE}. This dataset can be used to compute
the $\Sexpr{nrow(vac0)}\times \Sexpr{ncol(vac0)}$ matrix \code{vac0} with adjusted
proportions of unvaccinated school starters in each state $i$ used by
\citet{herzog-etal-2010}.
The first few entries of this matrix are shown below:
<<>>=
vac0[1:2, 1:6]
@
We fit a Poisson model, which links the autoregressive parameter with this covariate
and contains $S=1$ seasonal term in the endemic component
\citep[cf.][Table~3, model~A0]{herzog-etal-2010}:
<>=
# endemic component: Intercept + sine/cosine terms
f.end <- addSeason2formula(f = ~ 1, S = 1, period = 26)
# autoregressive component: Intercept + vaccination coverage information
model.A0 <- list(ar = list(f = ~ 1 + logVac0),
end = list(f = f.end, offset = population(measles2w)),
data = list(t = epoch(measles2w), logVac0 = log(vac0)))
# fit the model
result.A0 <- hhh4(measles2w, model.A0)
summary(result.A0, amplitudeShift = TRUE)
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusion}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As part of the \R~package \surveillance, the function \hhh\ provides
a flexible tool for the modelling of multivariate time series
of infectious disease counts. The presented count data model is able to
account for serial and spatio-temporal correlation, as well as
heterogeneity in incidence levels and disease transmission.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{apalike}
\renewcommand{\bibfont}{\small}
\bibliography{references}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}