%\VignetteIndexEntry{Examples}
%\VignetteEngine{knitr::knitr}
\documentclass[12pt]{article}
\usepackage[cm]{fullpage}
\usepackage{hyperref}
\usepackage{amssymb,amsmath}
\DeclareMathOperator*{\minimize}{minimize}
\begin{document}
\title{PeakSegDisk usage examples}
\author{Toby Dylan Hocking}
\maketitle
Welcome to PeakSegDisk, an R package for optimal peak
detection in very large count data sequences.
<>=
knitr::opts_chunk$set(
collapse = TRUE,
fig.width = 7,
fig.height = 3,
fig.align = "center",
comment = "#>"
)
@
\section{Related work}
The PeakSeg R packages contain algorithms for inferring optimal
segmentation models subject to the constraint that up changes must be
followed by down changes, and vice versa. This ensures that the model
can be interpreted in terms of peaks (after up changes) and background
(after down changes).
\begin{description}
\item[PeakSegDP] the historically first PeakSeg package,
\url{https://CRAN.R-project.org/package=PeakSegDP} provides a
heuristic quadratic time algorithm for computing models from 1 to S
segments for a single sample. This was the original algorithm
described in our ICML'15 paper,
\url{http://jmlr.org/proceedings/papers/v37/hocking15.html}, but it
is neither fast nor optimal, so in practice we recommend to use our
newer packages below instead.
\item[PeakSegOptimal]
\url{https://CRAN.R-project.org/package=PeakSegOptimal} provides
log-linear time algorithms for computing optimal models with
multiple peaks for a single sample. The algorithms are faster and
more accurate than PeakSegDP. Citation: JMLR'20,
\url{https://jmlr.org/papers/v21/18-843.html}
\item[PeakSegDisk] \url{https://github.com/tdhock/PeakSegDisk}
provides an on-disk implementation of optimal log-linear algorithms
for computing multiple peaks in a single sample. Computes same
models as PeakSegOptimal but works for much larger data sets because
disk is used for storage instead of memory. Citation: JSS'22,
\url{https://www.jstatsoft.org/article/view/v101i10}
\item[PeakSegJoint]
\url{https://CRAN.R-project.org/package=PeakSegJoint} provides a
fast heuristic algorithm for computing models with a single common
peak in $0,...,S$ samples. Citation: PSB'20
\url{http://psb.stanford.edu/psb-online/proceedings/psb20/Hocking.pdf}
\item[PeakSegPipeline]
\url{https://github.com/tdhock/PeakSegPipeline} provides a pipeline
for genome-wide peak calling using the other PeakSeg packages.
\end{description}
The remainder of this vignette is dedicated to an explanation of how
to use PeakSegDisk.
\section{Simulate a noisy integer vector with changes}
The first example we will treat is detecting peaks in a vector of
integer data, with possibly the same values at adjacent
positions. This is an inefficient representation for large genomic
data, but it is the typical output from simulation functions like
\texttt{rpois}:
<<>>=
sim.seg <- function(seg.mean, size.mean=15){
seg.size <- rpois(1, size.mean)
rpois(seg.size, seg.mean)
}
set.seed(1)
seg.mean.vec <- c(1.5, 3.5, 0.5, 4.5, 2.5)
z.list <- lapply(seg.mean.vec, sim.seg)
(z.rep.vec <- unlist(z.list))
@
From the output above it is clear that these simulated data are
integers, with some identical values at adjacent positions.
Below we put these data into a data table in order to plot them along
with the model using ggplot2:
<>=
count.df <- data.frame(
chrom="chrUnknown",
chromStart=0:(length(z.rep.vec)-1),
chromEnd=1:length(z.rep.vec),
count=z.rep.vec)
if(require(ggplot2)){
gg.count <- ggplot()+
xlab("position")+
geom_point(aes(
chromEnd, count),
shape=1,
data=count.df)
gg.count
}
@
The true changepoints in the simulation are shown below.
<<>>=
n.segs <- length(seg.mean.vec)
seg.size.vec <- sapply(z.list, length)
seg.end.vec <- cumsum(seg.size.vec)
change.vec <- seg.end.vec[-n.segs]+0.5
change.df <- data.frame(
changepoint=change.vec)
if(require(ggplot2)){
gg.change <- gg.count+
geom_vline(aes(
xintercept=changepoint, color=model),
data=data.frame(change.df, model="simulation"))+
scale_color_manual(
values=c(
simulation="black",
fitted="green"))
gg.change
}
@
\section{Segment a vector of integers}
Let $z_1, \dots, z_n\in\mathbb Z_+$ be the sequence of $n$
non-negative count data in z.rep.vec, and let $w_1=\cdots=w_n=1$ be
weights which are all 1. The peak detection algorithm computes the
solution to the following optimization problem:
\begin{align*}
\minimize_{
\substack{
\mathbf m\in\mathbb R^n,\ \mathbf s\in\{0, 1\}^n\\
\mathbf c\in\{-1, 0,1\}^{n-1}\\
}
} &\ \
\sum_{i=1}^n w_i \ell(m_i, z_i) + \lambda \sum_{i=1}^{n-1} I(c_i = 1) \\
\text{subject to\ \ } &\ \text{no change: }c_i = 0 \Rightarrow m_i = m_{i+1}\text{ and }s_i=s_{i+1}
\nonumber\\
&\ \text{go up: }c_i = 1 \Rightarrow m_i \leq m_{i+1}\text{ and }(s_i,s_{i+1})=(0,1),
\nonumber\\
&\ \text{go down: } c_i = -1 \Rightarrow m_i \geq m_{i+1}\text{ and }(s_i,s_{i+1})=(1,0),\\
& \ \text{start and end down: } s_1=s_n=0.\nonumber
\end{align*}
where $\ell(m, z)= m - z\log m$ is the Poisson loss. The optimization
variables are $m_i$ for the segment mean, $s_i$ for hidden state, and
$c_i$ for type of changepoint. The penalty term is proportional to the number of
changepoint variables $c_i$ which are equal to 1 (which is the same as
the number of peaks in the resulting model).
To run the peak detection algorithm a numeric penalty parameter
$\lambda\geq 0$ must be specified by the user. The smallest value is 0
which yields max peaks, and the largest value is Inf which yields no
peaks. The code below runs the peak detection algorithm on this count
data vector, using the penalty parameter $\lambda = 10.5$:
<<>>=
fit <- list()
(fit$vec <- PeakSegDisk::PeakSegFPOP_vec(z.rep.vec, 10.5))
@
The model output list above includes \verb|segments|, a data table with
one row for each segment mean, and \verb|loss|, a data table with one
row that reports the model meta-data. Of interest are:
\begin{itemize}
\item \verb|penalty|, the user-provided penalty value,
\item \verb|segments|, the number of segments,
\item \verb|peaks|, the number of peaks (even-numbered segments),
\item \verb|bases|, the number of data points in repetitive form (not run-length encoding),
\item \verb|bedGraph.lines|, the number of data points in run-length encoding form,
\item \verb|mean.pen.cost|, the optimal mean loss plus penalty*peaks,
\item \verb|total.loss|, the optimal total Poisson loss over all data points,
\item \verb|equality.constraints|, the number of adjacent segment
means that are equal in the optimal solution. Note that when this
number is greater than 0, then there are some active equality
constraints, and the optimal model is therefore not feasible for the
strict inequality constraints, which implies that the optimum of the
problem with strict inequality constraints is undefined, i.e. for
any sub-optimal solution that satisfies the strict inequality
constraints, we can find a lower cost solution that satifies the
strict inequality constraints (but is still sub-optimal), by getting
closer to the solution with active equality constraints.
\item \verb|megabytes|, the storage space on disk used by the solver,
\item \verb|seconds|, the amount of time used by the solver,
\item \verb|mean.intervals|, \verb|max.intervals|, statistics over all intervals
(candidate changepoints) computed by the functional pruning
algorithm, useful for analyzing computational complexity, which is
linear in the number of intervals.
\end{itemize}
Note in particular that \verb|PeakSegFPOP_vec| internally uses \verb|rle| to
construct a run-length encoding, which is passed to the solver to save
time/storage. In this case the repetitive integer data vector contains
\Sexpr{fit$vec$loss$bases} elements but the coverage.bedGraph data file
contains only \Sexpr{fit$vec$loss$bedGraph.lines} lines. In real genomic data
sets the difference is typically much larger.
<<>>=
gg.change+
geom_segment(aes(
chromStart+0.5, mean, xend=chromEnd+0.5, yend=mean, color=model),
data=data.frame(fit$vec$segments, model="fitted"))
@
It is clear from the plot above that the first three changepoints are
estimated exactly and the last one is a bit over-estimated.
Also note that a default plot method is defined for these objects:
<<>>=
plot(fit$vec)
@
\section{Segment a data frame}
Another interface that can be used on a data.frame with $n$ rows and
exactly 4 columns (chrom, chromStart, chromEnd, count) is
\verb|PeakSegFPOP_df|. For each row $i\in\{1,\dots, n\}$, let
$z_i\in\mathbb Z_+$ be the non-negative count data (count column), and
let $w_i>0$ be the weight (equal to the number of bases,
chromEnd-chromStart). The optimization problem we solve is the same as
before. Note that this function does not perform run-length encoding
for you:
<<>>=
(fit$df <- PeakSegDisk::PeakSegFPOP_df(count.df, 10.5))
@
Note how \verb|bedGraph.lines| is now the same size as \verb|bases|,
\Sexpr{fit$df$loss$bedGraph.lines}. The time/storage complexity is
log-linear in the number of \verb|bedGraph.lines|, so it is more efficient
to use the run-length encoding. This can be easily done in R:
<<>>=
z.rle.vec <- rle(z.rep.vec)
chromEnd <- cumsum(z.rle.vec$lengths)
rle.df <- data.frame(
chrom="chrUnknown",
chromStart=c(0L, chromEnd[-length(chromEnd)]),
chromEnd,
count=z.rle.vec$values)
if(require(ggplot2)){
gg.rle <- ggplot()+
geom_segment(aes(
chromStart+0.5, count, xend=chromEnd+0.5, yend=count),
data=rle.df)+
geom_point(aes(
chromEnd, count),
shape=1,
data=rle.df)+
geom_vline(aes(
xintercept=changepoint, color=model),
data=data.frame(change.df, model="simulation"))+
scale_color_manual(
values=c(
simulation="black",
fitted="green"))+
xlab("position")
gg.rle
}
@
The plot above shows the run-length encoded data, with a \verb|geom_point|
for the last position in each run, and a \verb|geom_segment| extending left
to the first position. These data can be segmented as above:
<<>>=
(fit$rle <- PeakSegDisk::PeakSegFPOP_df(rle.df, 10.5))
if(require(ggplot2)){
gg.rle+
geom_segment(aes(
chromStart+0.5, mean, xend=chromEnd+0.5, yend=mean, color=model),
data=data.frame(fit$rle$segments, model="fitted"))
}
@
\section{Write the file yourself}
The interfaces discussed in the previous sections are perhaps the most
intuitive for useRs, but they are also the least efficient, so they
are not recommended for large data.
In this section we introduce the most efficient way of using PeakSegDisk, which involves:
\begin{itemize}
\item creating a ``problem'' directory for each segmentation problem
(sample and genome subset),
\item saving the data to \verb|coverage.bedGraph| in that directory,
\item and then running \verb|PeakSegFPOP_dir|.
\end{itemize}
The reason why this method is recommended for large data is because
\verb|PeakSegFPOP_dir| saves its results to the ``problem'' directory. So
if a certain result has already been computed, these result files are
used as a cache, and are read instead of doing computations, which
saves a lot of time.
The file system is used as the interface in
order to support very large data sets with very little memory usage.
To use \verb|PeakSegFPOP_dir| the data should be saved to a
chrXX-start-end/coverage.bedGraph file, where the problem directory
``chrXX-start-end'' should be named using a genome postion string:
\begin{itemize}
\item chrXX is the chromosome (which is irrelevant to the algorithm),
\item start is the 0-based first position of the region to segment (the smallest possible value is 0),
\item end is the 1-based end position (the smallest possible value is 1).
\end{itemize}
<<>>=
data.dir <- file.path(
tempfile(),
with(rle.df, sprintf(
"%s-%d-%d", chrom[1], min(chromStart), max(chromEnd))))
dir.create(data.dir, showWarnings=FALSE, recursive=TRUE)
coverage.bedGraph <- file.path(data.dir, "coverage.bedGraph")
write.table(
rle.df, coverage.bedGraph,
sep="\t", row.names=FALSE, col.names=FALSE)
@
The next step is to run the main solver,
<<>>=
(fit$dir <- PeakSegDisk::PeakSegFPOP_dir(data.dir, 10.5))
@
The
underlying C++ code
creates penalty-specific files such as
\verb|chrXX-start-end/coverage.bedGraph_penalty=0.1_loss.tsv| which
are used to store/cache the results. If the files already exist (and
are consistent) then \verb|PeakSegFPOP_dir| just reads them; otherwise
it runs the dynamic programming C++ code in order to create those
files, which are then read into R.
\section{Computing the model with a given number of peaks}
The \verb|sequentialSearch_dir| function can be used to compute the
optimal model with a certain number of peaks:
<<>>=
if(interactive() && requireNamespace("future"))future::plan("multisession")
(fit$search <- PeakSegDisk::sequentialSearch_dir(data.dir, 2L, verbose=1))
@
The algorithm must evaluate several penalty values to compute the
optimal model with a certain number of peaks. The \verb|others| component of
the model list above shows that
\begin{itemize}
\item the search starts with penalty values 0 and Inf, which result in models
with \Sexpr{fit$search$others[penalty==0, peaks]} and 0 peaks, respectively.
\item the next penalty evaluated is \Sexpr{sprintf("%.2f", fit$search$others[iteration==2, penalty])},
which results in \Sexpr{fit$search$others[iteration==2, peaks]} peaks.
\item the final penalty evaluated is \Sexpr{sprintf("%.2f", fit$search$others[iteration==3, penalty])},
which results in \Sexpr{fit$search$others[iteration==3, peaks]} peaks.
\end{itemize}
At each step (except the first) the new penalties are computed based
on the loss values found in the previous step.
If present with a registered parallel future plan,
the \verb|future.apply| package is used to run the first step
(penalties $0,\infty$) in parallel.
Note how the number of peaks and \verb|total.loss| of this model is
the same as the other models computed above,
<<>>=
lossDF <- function(L)data.frame(L$loss)[, names(fit$dir$loss)]
do.call(rbind, lapply(fit, lossDF))
@
Finally we demonstrate how the filesystem caching is especially useful
for the sequential search. In the code below we ask the sequential
search algorithm to compute the optimal model with four peaks:
<<>>=
four.peaks <- PeakSegDisk::sequentialSearch_dir(data.dir, 4L)
four.peaks$others[, .(iteration, penalty, peaks)]
@
Looking at the output above, we see that the first three iterations of
the sequential search require computing models with 26, 0, 6, 2
peaks. Since all of these have been previously computed (and saved to
disk), the dynamic programming algorithm does not need to be re-run,
and instead the model results are simply read from the files. After
that the dynamic programming is run for the subsequent iterations
4-6. In this particular example the savings in computation time is not
extraordinary, but in real genomic data, this can result in
substantial speed-ups.
\end{document}