%\VignetteIndexEntry{Getting started with outbreak detection}
\documentclass[a4paper,11pt]{article}
\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{natbib}
\bibliographystyle{apalike}
\usepackage{lmodern}
\usepackage{amsmath}
\usepackage{amsfonts,amssymb}
\newcommand{\pkg}[1]{{\bfseries #1}}
\newcommand{\surveillance}{\pkg{surveillance}}
\usepackage{hyperref}
\hypersetup{
pdfauthor = {Michael H\"ohle and Andrea Riebler and Michaela Paul},
pdftitle = {Getting started with outbreak detection},
pdfsubject = {R package 'surveillance'}
}
\title{Getting started with outbreak detection}
\author{
Michael H{\"o}hle\thanks{Author of correspondence: Department of Statistics, University of Munich, Ludwigstr.\ 33, 80539 M{\"u}nchen, Germany, Email: \texttt{hoehle@stat.uni-muenchen.de}} , Andrea Riebler and Michaela Paul\\
Department of Statistics\\
University of Munich\\
Germany }
\date{17 November 2007}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Sweave
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{Sweave}
%Put all in another directory
\SweaveOpts{prefix.string=plots/surveillance, width=9, height=4.5}
\setkeys{Gin}{width=1\textwidth}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Initial R code
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
<>=
library("surveillance")
options(SweaveHooks=list(fig=function() par(mar=c(4,4,2,0)+.5)))
options(width=70)
## create directory for plots
dir.create("plots", showWarnings=FALSE)
######################################################################
#Do we need to compute or can we just fetch results
######################################################################
CACHEFILE <- "surveillance-cache.RData"
compute <- !file.exists(CACHEFILE)
message("Doing computations: ", compute)
if(!compute) load(CACHEFILE)
@
\begin{document}
\fbox{\vbox{\small
\noindent\textbf{Disclaimer}: This vignette reflects package state at version
0.9-7 and is hence somewhat outdated. New functionality has been added
to the package: this includes various endemic-epidemic modelling frameworks
for surveillance data (\texttt{hhh4}, \texttt{twinSIR}, and
\texttt{twinstim}), as well as more outbreak detection methods
(\texttt{glrnb}, \texttt{boda}, and \texttt{farringtonFlexible}).
These new features are described in detail in \citet{meyer.etal2014} and
\citet{salmon.etal2014}, respectively.
%and corresponding vignettes are included in the package;
%see \texttt{vignette(package = "surveillance")} for an overview.
Note in particular that use of the new \texttt{S4} class \texttt{sts}
instead of \texttt{disProg} is encouraged to encapsulate time series data.
}}
{\let\newpage\relax\maketitle}
\begin{abstract}
\noindent This document gives an introduction to the \textsf{R} package
\surveillance\ containing tools for outbreak detection in routinely
collected surveillance data. The package contains an implementation
of the procedures described by~\citet{stroup89},
\citet{farrington96} and the system used at the Robert Koch
Institute, Germany. For evaluation purposes, the package contains
example data sets and functionality to generate surveillance data by
simulation. To compare the algorithms, benchmark numbers like
sensitivity, specificity, and detection delay can be computed for a
set of time series. Being an open-source package it should be easy
to integrate new algorithms; as an example of this process, a
simple Bayesian surveillance algorithm is described, implemented and evaluated.\\
\noindent{\bf Keywords:} infectious disease, monitoring, aberrations,
outbreak, time series of counts.
\end{abstract}
\newpage
\section{Introduction}\label{sec:intro}
Public health authorities have in an attempt to meet the threats of
infectious diseases to society created comprehensive mechanisms for
the collection of disease data. As a consequence, the abundance of
data has demanded the development of automated algorithms for the
detection of abnormalities. Typically, such an algorithm monitors a
univariate time series of counts using a combination of heuristic
methods and statistical modelling. Prominent examples of surveillance
algorithms are the work by~\citet{stroup89} and~\citet{farrington96}.
A comprehensive survey of outbreak detection methods can be found
in~\citep{farrington2003}.
The R-package \texttt{surveillance} was written with the aim of
providing a test-bench for surveillance algorithms. From the
Comprehensive R Archive Network (CRAN) the package can be downloaded
together with its source code. It allows users to test new algorithms
and compare their results with those of standard surveillance
methods. A few real world outbreak datasets are included together with
mechanisms for simulating surveillance data. With the package at
hand, comparisons like the one described by~\citet{hutwagner2005}
should be easy to conduct.
The purpose of this document is to illustrate the basic functionality
of the package with R-code examples. Section~\ref{sec:data} contains
a description of the data format used to store surveillance data,
mentions the built-in datasets and illustrates how to create new
datasets by simulation. Section~\ref{sec:algo} contains a short
description of how to use the surveillance algorithms and illustrate
the results. Further information on the individual functions can be
found on the corresponding help pages of the package.
\section{Surveillance Data}\label{sec:data}
Denote by $\{y_t\>;t=1,\ldots,n\}$ the time series of counts
representing the surveillance data. Because such data typically are
collected on a weekly basis, we shall also use the alternative
notation $\{y_{i:j}\}$ with $j=\{1,\ldots,52\}$ being the week number
in year $i=\{-b,\ldots,-1,0\}$. That way the years are indexed such
that most current year has index zero. For evaluation of the outbreak
detection algorithms it is also possible for each week to store -- if
known -- whether there was an outbreak that week. The resulting
multivariate series $\{(y_t,x_t)\>; t=1,\ldots,n\}$ is in
\texttt{surveillance} given by an object of class \texttt{disProg}
(disease progress), which is basically a \texttt{list} containing two
vectors: the observed number of counts and a boolean vector
\texttt{state} indicating whether there was an outbreak that week. A
number of time series are contained in the package (see
\texttt{data(package="surveillance")}),
mainly originating from the SurvStat@RKI database at
\url{https://survstat.rki.de/}
maintained by the Robert Koch Institute, Germany~\citep{survstat}.
For example the object \texttt{k1} describes cryptosporidiosis
surveillance data for the German federal state Baden-W\"{u}rttemberg
2001-2005. The peak in 2001 is due to an outbreak of cryptosporidiosis
among a group of army soldiers in a boot camp~\citep{bulletin3901}.
<>=
data(k1)
plot(k1, main = "Cryptosporidiosis in BW 2001-2005")
@
For evaluation purposes it is also of interest to generate
surveillance data using simulation. The package contains functionality
to generate surveillance data containing point-source like outbreaks,
for example with a Salmonella serovar. The model is a Hidden Markov
Model (HMM) where a binary state $X_t, t=1,\ldots,n$, denotes whether
there was an outbreak and $Y_t$ is the number of observed
counts, see Figure~\ref{fig:hmm}.
\begin{figure}[htb]
\centering
\includegraphics[width=.75\textwidth]{figures/HMM}
\caption{The Hidden Markov Model}
\label{fig:hmm}
\end{figure}
The state $X_t$ is a homogeneous Markov chain with transition matrix
\begin{center}
\begin{tabular}{c|cc}
$X_t\backslash X_{t+1}$ & 0 & 1\\
\hline $0$ & $p$ & $1 - p$ \\
$1$ & $1 - r$ & $r$
\end{tabular}
\end{center}
Hence $1-p$ is the probability to switch to an outbreak state and
$1-r$ is the probability that $X_t=1$ is followed by $X_{t+1}=1$.
Furthermore, the observation $Y_t$ is Poisson-distributed with
log-link mean depending on a seasonal effect and time trend, i.e.\
\[
\log \mu_t = A \cdot \sin \, (\omega \cdot (t + \varphi)) + \alpha +
\beta t.
\]
In case of an outbreak $(X_t=1)$ the mean increases with a value of
$K$, altogether
\begin{equation}\label{eq:hmm}
Y_t \sim \operatorname{Po}(\mu_t + K \cdot X_t).
\end{equation}
The model in (\ref{eq:hmm}) corresponds to a single-source,
common-vehicle outbreak, where the length of an outbreak is controlled
by the transition probability $r$. The daily numbers of outbreak-cases
are simply independently Poisson distributed with mean $K$. A
physiologically better motivated alternative could be to operate with
a stochastic incubation time (e.g.\ log-normal or gamma distributed)
for each individual exposed to the source, which results in a temporal
diffusion of the peak. The advantage of (\ref{eq:hmm}) is that
estimation can be done by a generalized linear model (GLM) using $X_t$
as covariate and that it allows for an easy definition of a correctly
identified outbreak: each $X_t=1$ has to be identified. More advanced
setups would require more involved definitions of an outbreak, e.g.\
as a connected series of time instances, where the number of outbreak
cases is greater than zero. Care is then required in defining what a
correctly identified outbreak for time-wise overlapping outbreaks means.
In \surveillance\ the function \verb+sim.pointSource+ is used to
simulate such a point-source epidemic; the result is an object of class
\verb+disProg+.
\label{ex:sts}
<<>>=
set.seed(1234)
sts <- sim.pointSource(p = 0.99, r = 0.5, length = 400,
A = 1, alpha = 1, beta = 0, phi = 0,
frequency = 1, state = NULL, K = 1.7)
@
<>=
plot(sts)
@
\section{Surveillance Algorithms}\label{sec:algo}
Surveillance data often exhibit strong seasonality, therefore most
surveillance algorithms only use a set of so called \emph{reference
values} as basis for drawing conclusions. Let $y_{0:t}$ be the
number of cases of the current week (denoted week $t$ in year $0$), $b$
the number of years to go back in time and $w$ the number of weeks
around $t$ to include from those previous years. For the year zero we
use $w_0$ as the number of previous weeks to include -- typically
$w_0=w$. Altogether the set of reference values is thus defined to be
\[
R(w,w_0,b) =
\left(\bigcup\limits_{i=1}^b\bigcup\limits_{j=\,-w}^w
y_{-i:t+j}\right) \cup
\left(\bigcup_{k=-w_0}^{-1} y_{0:t+k}\right)
\]
Note that the number of cases of the current week is not part of
$R(w,w_0,b)$.
A surveillance algorithm is a procedure using the reference values
to create a prediction $\hat{y}_{0:t}$ for the current week. This
prediction is then compared with the observed $y_{0:t}$: if the
observed number of cases is much higher than the predicted number, the
current week is flagged for further investigations. In order to do
surveillance for time $0:t$ an important concern is the choice of $b$
and $w$. Values as far back as time $-b:t-w$ contribute to
$R(w,w_0,b)$ and thus have to exist in the observed time series.
Currently, we have implemented four different type of algorithms in
\surveillance. The Centers for Disease Control and Prevention (CDC)
method~\citep{stroup89}, the Communicable Disease Surveillance Centre
(CDSC) method~\citep{farrington96}, the method used at the Robert Koch
Institute (RKI), Germany~\citep{altmann2003}, and a Bayesian approach
documented in~\citet{riebler2004}. A detailed description of each
method is beyond the scope of this note, but to give an idea of the
framework the Bayesian approach developed in~\citet{riebler2004} is
presented: Within a Bayesian framework, quantiles of the predictive
posterior distribution are used as a measure for defining alarm
thresholds.
The model assumes that the reference values are identically and
independently Poisson distributed with parameter $\lambda$ and a
Gamma-distribution is used as Prior distribution for $\lambda$. The
reference values are defined to be $R_{\text{Bayes}}= R(w,w_0,b) =
\{y_1, \ldots, y_{n}\}$ and $y_{0:t}$ is the value we are trying to
predict. Thus, $\lambda \sim \text{Ga}(\alpha, \beta)$ and
$y_i|\lambda \sim \text{Po}(\lambda)$, $i = 1,\ldots,{n}$. Standard
derivations show that the posterior distribution is
\begin{equation*}
\lambda|y_1, \ldots, y_{n} \sim \text{Ga}(\alpha + \sum_{i=1}^{n} y_i,
\beta + n).
\end{equation*}
Computing the predictive distribution
\begin{equation*}
f(y_{0:t}|y_1,\ldots,y_{n}) =
\int\limits^\infty_0{f(y_{0:t}|\lambda)\,
f(\lambda|y_1,\ldots,y_{n})}\, d\lambda
\end{equation*}
we get the Poisson-Gamma-distribution
\begin{equation*}
y_{0:t}|y_1,\ldots,y_{n} \sim
\text{PoGa}(\alpha + \sum_{i=1}^{n} y_i, \beta + n),
\end{equation*}
which is a generalization of the negative Binomial distribution,
i.e.\
\[
y_{0:t}|y_1,\ldots,y_{n} \sim \text{NegBin}(\alpha + \sum_{i=1}^{n}
y_i, \tfrac{\beta + n}{\beta + n + 1}).
\]
Using the Jeffrey's Prior $\text{Ga}(\tfrac{1}{2}, 0)$ as
non-informative Prior distribution for $\lambda$ the parameters of the
negative Binomial distribution are
\begin{align*}
\alpha + \sum_{i=1}^{n} y_i &= \frac{1}{2} + \sum_{y_{i:j} \in R_{\text{Bayes}}}\!\! y_{i:j} \quad
% \intertext{and}
\quad\text{and}\quad
\frac{\beta + n}{\beta + n + 1} = \frac{|R_{\text{Bayes}}|}{|R_{\text{Bayes}}| + 1}.
\end{align*}
Using a quantile-parameter $\alpha$, the smallest value $y_\alpha$ is computed, so that
\begin{equation*}
P(y \leq y_\alpha) \geq 1-\alpha.
\end{equation*}
Now
\begin{equation*}
A_{0:t} = I(y_{0:t} \geq y_\alpha),
\end{equation*}
i.e. if $y_{0:t}\geq y_\alpha$ the current week is flagged as an
alarm. As an example, the \verb+Bayes1+ method uses the last six weeks
as reference values, i.e.\ $R(w,w_0,b)=(6,6,0)$, and is applied to the
\texttt{k1} dataset with $\alpha=0.01$ as follows.
<>=
k1.b660 <- algo.bayes(k1,
control = list(range = 27:192, b = 0, w = 6, alpha = 0.01))
plot(k1.b660, disease = "k1")
@
Several extensions of this simple Bayesian approach are imaginable,
for example the inane over-dispersion of the data could be modeled by
using a negative-binomial distribution, time trends and mechanisms to
correct for past outbreaks could be integrated, but all at the cost of
non-standard inference for the predictive distribution. Here
simulation based methods like Markov Chain Monte Carlo or heuristic
approximations have to be used to obtain the required alarm
thresholds.
In general, the \verb+surveillance+ package makes it easy to add
additional algorithms -- also those not based on reference values --
by using the existing implementations as starting point.
The following call uses the CDC and Farrington procedure on the
simulated time series \verb+sts+ from page~\pageref{ex:sts}. Note that
the CDC procedure operates with four-week aggregated data -- to better
compare the upper bound value, the aggregated number of counts for
each week are shown as circles in the plot.
<>=
cntrl <- list(range=300:400,m=1,w=3,b=5,alpha=0.01)
sts.cdc <- algo.cdc(sts, control = cntrl)
sts.farrington <- algo.farrington(sts, control = cntrl)
@
<>=
if (compute) {
<>
}
@
<>=
par(mfcol=c(1,2))
plot(sts.cdc, legend.opts=NULL)
plot(sts.farrington, legend.opts=NULL)
@
Typically, one is interested in evaluating the performance of the
various surveillance algorithms. An easy way is to look at the
sensitivity and specificity of the procedure -- a correct
identification of an outbreak is defined as follows: if the algorithm
raises an alarm for time $t$, i.e.\ $A_t=1$ and $X_t=1$ we have a
correct classification, if $A_t=1$ and $X_t=0$ we have a
false-positive, etc. In case of more involved outbreak models, where
an outbreak lasts for more than one week, a correct identification
could be if at least one of the outbreak weeks is correctly
identified, see e.g.\ \citet{hutwagner2005}.
To compute various performance scores the function
\verb+algo.quality+ can be used on a \verb+survRes+ object.
<<>>=
print(algo.quality(k1.b660))
@
This computes the number of false positives, true negatives, false
negatives, the sensitivity and the specificity. Furthermore, \texttt{dist}
is defined as
\[
\sqrt{(Spec-1)^2 + (Sens - 1)^2},
\]
that is the distance to the optimal point $(1,1)$, which serves as a
heuristic way of combining sensitivity and specificity into a single
score. Of course, weighted versions are also imaginable. Finally,
\texttt{lag} is the average number of weeks between the first of a
consecutive number of $X_t=1$'s (i.e.\ an outbreak) and the first
alarm raised by the algorithm.
To compare the results of several algorithms on a single time series
we declare a list of control objects -- each containing the name and
settings of the algorithm we want to apply to the data.
<>=
control <- list(
list(funcName = "rki1"), list(funcName = "rki2"),
list(funcName = "rki3"), list(funcName = "bayes1"),
list(funcName = "bayes2"), list(funcName = "bayes3"),
list(funcName = "cdc", alpha=0.05),
list(funcName = "farrington", alpha=0.05)
)
control <- lapply(control, function(ctrl) {
ctrl$range <- 300:400; return(ctrl)
})
@
%
In the above, \texttt{rki1}, \texttt{rki2} and \texttt{rki3} are three
methods with reference values $R_\text{rki1}(6,6,0)$,
$R_\text{rki2}(6,6,1)$ and $R_\text{rki3}(4,0,2)$, all called with
$\alpha=0.05$. The \texttt{bayes*} methods use the
Bayesian algorithm with the same setup of reference values. The CDC
method is special since it operates on aggregated four-week blocks.
To make everything comparable, a common $\alpha=0.05$ level is used for
all algorithms.
All algorithms in \texttt{control} are applied to \texttt{sts} using:
<>=
algo.compare(algo.call(sts, control = control))
@
<>=
if (compute) {
acall <- algo.call(sts, control = control)
}
print(algo.compare(acall), digits = 3)
@
A test on a set of time series can be done as follows. Firstly, a list
containing 10 simulated time series is created. Secondly, all the
algorithms specified in the \texttt{control} object are applied to
each series. Finally the results for the 10 series are combined in one
result matrix.
<>=
#Create 10 series
ten <- lapply(1:10,function(x) {
sim.pointSource(p = 0.975, r = 0.5, length = 400,
A = 1, alpha = 1, beta = 0, phi = 0,
frequency = 1, state = NULL, K = 1.7)})
@
<>=
#Do surveillance on all 10, get results as list
ten.surv <- lapply(ten,function(ts) {
algo.compare(algo.call(ts,control=control))
})
@
<>=
if (compute) {
<>
}
@
<>=
#Average results
algo.summary(ten.surv)
@
<>=
print(algo.summary(ten.surv), digits = 3)
@
A similar procedure can be applied when evaluating the 14 surveillance
series drawn from SurvStat@RKI~\citep{survstat}. A problem is however,
that the series after conversion to 52 weeks/year are of length 209 weeks.
This is insufficient to apply e.g.\ the CDC algorithm. To conduct the
comparison on as large a dataset as possible the following trick is
used: The function \texttt{enlargeData} replicates the requested
\texttt{range} and inserts it before the original data, after which
the evaluation can be done on all 209 values.
<>=
#Update range in each - cyclic continuation
range = (2*4*52) + 1:length(k1$observed)
control <- lapply(control,function(cntrl) {
cntrl$range=range;return(cntrl)})
#Auxiliary function to enlarge data
enlargeData <- function(disProgObj, range = 1:156, times = 1){
disProgObj$observed <- c(rep(disProgObj$observed[range], times),
disProgObj$observed)
disProgObj$state <- c(rep(disProgObj$state[range], times),
disProgObj$state)
return(disProgObj)
}
#Outbreaks
outbrks <- c("m1", "m2", "m3", "m4", "m5", "q1_nrwh", "q2",
"s1", "s2", "s3", "k1", "n1", "n2", "h1_nrwrp")
#Load and enlarge data.
outbrks <- lapply(outbrks,function(name) {
data(list=name)
enlargeData(get(name),range=1:(4*52),times=2)
})
#Apply function to one
one.survstat.surv <- function(outbrk) {
algo.compare(algo.call(outbrk,control=control))
}
@
<>=
algo.summary(lapply(outbrks,one.survstat.surv))
@
<>=
if (compute) {
res.survstat <- algo.summary(lapply(outbrks,one.survstat.surv))
}
print(res.survstat, digits=3)
@
In both this study and the earlier simulation study the Bayesian
approach seems to do quite well. However, the extent of the
comparisons do not make allowance for any more supported statements.
Consult the work of~\citet{riebler2004} for a more thorough
comparison using simulation studies.
<>=
if (compute) { # save computed results
save(list=c("sts.cdc","sts.farrington","acall","res.survstat",
"ten.surv"),
file=CACHEFILE)
tools::resaveRdaFiles(CACHEFILE)
}
@
\section{Discussion and Future Work}
Many extensions and additions are imaginable to improve the package.
For now, the package is intended as an academic tool providing a
test-bench for integrating new surveillance algorithms. Because all
algorithms are implemented in R, performance has not been an issue.
Especially the current implementation of the Farrington Procedure is
rather slow and would benefit from an optimization possible with
fragments written in C.
One important improvement would be to provide more involved mechanisms
for the simulation of epidemics. In particular it would be interesting
to include multi-day outbreaks originating from single-source
exposure, but with delay due to varying incubation
time~\citep{hutwagner2005} or SEIR-like
epidemics~\citep{andersson2000}. However, defining what is meant by a
correct outbreak identification, especially in the case of overlapping
outbreaks, creates new challenges which have to be met.
\section{Acknowledgements}
We are grateful to K.\ Stark and D.\ Altmann, RKI, Germany, for
discussions and information on the surveillance methods used by the
RKI. Our thanks to C.\ Lang, University of Munich, for his work on the
R--implementation and M. Kobl, T. Schuster and M. Rossman,
University of Munich, for their initial work on gathering the outbreak
data from SurvStat@RKI. The research was conducted with financial
support from the Collaborative Research Centre SFB 386 funded by the
German research foundation (DFG).
\bibliography{references}
\end{document}