To cite MALDIrppa in publications, please use:
Palarea-Albaladejo J., McLean K., Wright F. and Smith (2018). MALDIrppa: quality control and robust analysis for mass spectrometry data. Bioinformatics 34(3):522–523. (http://dx.doi.org/10.1093/bioinformatics/btx628).
Note that the following applies to MALDIrppa version up to 1.0.5. This version is no longer compatible with recent changes in CRAN policies, but if required it can be downloaded and installed manually on your system from the archive of old sources of the package at https://CRAN.R-project.org/package=MALDIrppa:
install.packages("https://cran.r-project.org/src/contrib/Archive/MALDIrppa/MALDIrppa_1.0.5-1.tar.gz",
repo=NULL, type="source")
Mass spectrometry is actively being used to discover proteomic
patterns in complex mixtures of peptides derived from biological
samples. Reproducibility is however a significant challenge in MALDI
protein profiling. Approaches to improve the analytical performance of
MALDI protein profiling include extensive pre-fractionation strategies,
immunocapture, pre-structured target surfaces, standardized matrix
(co)crystallization, improved instrument components, internal standard
peptides, quality-control samples, among others (Albrethsen 2007). Mass spectrometry (MS) data
pre-processing algorithms play a crucial role in rendering the
subsequent data analysis more robust and accurate. The package
MALDIrppa
contributes a number of procedures for robust
pre-processing and analysis, along with a number of functions to
facilitate common data management operations. It is thought to work in
conjunction with the MALDIquant
package (Gibb and Strimmer 2012), using object classes
and methods from this latter. These notes provide basic guidelines for
users to run a complete robust MALDI mass spectra pre-processing
pipeline using their own data with small adaptations. Further specific
details and examples on the full functionality of the
MALDIrppa
package can be found in the associated help
pages.
The data set spectra
used as example here is an
anonymised collection of mass spectra trimmed to the [2500, 13000]
m/z interval. Their m/z resolution is 30 times lower
than the original (achieved using the redResolution
function) so as to make it manageable for sharing and illustration
purposes. It consists of 4 technical replicates of 5 biological
replicates from 20 bacterial isolates. The mass spectra were originally
acquired over the [2000, 20000] Daltons mass range using the Bruker
Daltonics Ultraflex II mass spectrometer. The data set type
contains a simplified version of the associated metadata.
The following sections go through the stages to transform raw mass
spectra into formats suitable for downstream data analysis and modelling
for biological interpretation. These involve data transformation,
smoothing, baseline correction, normalisation, peak detection and peak
alignment and binning. The features of the signals depend on
technological progress and characteristics of the species under study.
Unfortunately there is not a standard methodology for MS data
pre-processing best working in all cases. This document illustrates a
generic pipeline through MALDIrppa
in combination with
MALDIquant
based on robust statistical methods which
contributes to lessen the intrinsic reproducibility issues with MS data
and facilitates obtaining reliable biological conclusions (Mclean et al. 2017). However, the finest
pre-processing in a particular case might require further discussion of
choices for parameter settings and methods.
The data are included in MALDIrppa
in the own R’s
RData
binary file format. They can be loaded into the
workspace using the data
function.
Alternatively, raw data stored in text files can be easily imported
into R and given the convenient MassSpectrum
class format
using the importSpectra
function included in
MALDIrppa
. For importing data from more specialised file
formats we suggest the reader to check out the package
MALDIquantForeign
. The function summarySpectra
computes a table including some basic summary statistics. A look at
these along with some graphical representations are useful to get a
first insight into the data set and identify possible issues. The next
code shows numerical results for the first 10 signals.
## ID No.MZ Min.MZ Max.MZ Min.Int Mean.Int Std.Int Med.Int MAD.Int Max.Int
## 1 160408F21 1857 2500.05 12995.77 19 722.2649 572.9375 664 486.2928 5888
## 2 160408F22 1857 2500.05 12995.77 18 637.1357 531.1853 582 382.5108 6128
## 3 160408F23 1857 2500.05 12995.77 8 542.4976 447.7237 483 401.7846 4713
## 4 160408F24 1857 2500.05 12995.77 34 957.3430 737.8402 903 587.1096 7843
## 5 160408G01 1857 2500.05 12995.77 42 1129.8185 871.5639 1064 652.3440 7860
## 6 160408G02 1857 2500.05 12995.77 24 637.1648 530.9029 595 358.7892 5459
## 7 160408G03 1857 2500.05 12995.77 12 505.9499 442.6567 466 298.0026 5198
## 8 160408G04 1857 2500.05 12995.77 43 984.1179 754.5525 936 530.7708 7333
## 9 160408G05 1857 2500.05 12995.77 27 906.9138 686.9536 854 558.9402 6068
## 10 160408G06 1857 2500.05 12995.77 46 906.9133 689.0228 891 481.8450 7505
Pre-processing methods are not immune to low-quality mass spectra and
inadvertently letting them through can severely distort the results. For
example, when dealing with a number of data sets we observed both
clusters of spurious peaks in some cases and sparse peaks in others
which interfere with the determination of common reference peaks for
peak alignment and binning. Common low-quality mass spectra are
characterised by extremely ripply and indented profiles due to low
signal intensities. The low signal intensity causes poor peak formation
and poorly resolved peaks. Moreover, ion suppression cases show one or a
few abnormally high peaks whereas the other peptide features are pulled
down. The package MALDIrppa
helps to detect exotic signals
by a semi-automatic screening process implemented in the
screenSpectra
function. It is based on robust scale
estimators (Rousseeuw and Croux 1993) of
the derivative mass spectra and median intensities. These components are
given a weight to build an atypicality score (\(A\)) for each signal, which is then
labelled as a suspicious case (failure
) if the score falls
beyond the estimated tolerance limits (see help for
screenSpectra
for options available). The atypicality score
\(A\) is calculated as
\[A = \frac{S^{\lambda}}{(\texttt{med}+1)^{(1-\lambda)/2}},\]
where \(S\) is the robust scale estimator used (either Rousseeuw’s \(Q\) or the well-known median absolute deviance, MAD, estimator are available), \(\texttt{med}\) is the median intensity of the raw signal and \(\lambda\) distributes the weight given to each of these components (\(\lambda = 1/2\) by default).
The screenSpectra
function generates a table
est.table
with individual information for each mass
spectrum. The following code illustrates its use with default settings
and creates a new scSpectra
class object called
sc.results
. The function can take a matrix of metadata
(type
in our case) as argument. This is convenient to
easily filter problematic cases out in both the spectra and metadata
sets for subsequent analysis. Ordinary R methods summary
and plot
can be applied on scSpectra
class
objects to obtain a summary table and graphs of the results.
## (10 first mass spectra)
## ID A score Class
## 1 160408F21 0.1683850 success
## 2 160408F22 0.1716699 success
## 3 160408F23 0.1887859 success
## 4 160408F24 0.1478541 success
## 5 160408G01 0.1479799 success
## 6 160408G02 0.1734508 success
## 7 160408G03 0.1772491 success
## 8 160408G04 0.1534341 success
## 9 160408G05 0.1716697 success
## 10 160408G06 0.1538116 success
##
## ----------------------------
##
## Scale estimator: Q
## Method: adj.boxplot
## Threshold: 1.5
## Limits: [0.1196,0.2942]
## Deriv. order: 1
## Lambda: 0.5
## No. potentially faulty spectra: 19 (4.75 %)
Among other features, the method plot
allows to
visualise the distribution of the atypicality scores
(type = "hist"
), customise the point labels and
interactively display marked spectra for individual visual inspection
(type = "casewise"
). When label = TRUE
it
shows the position index of the mass spectra in the data set. It is
advisable to visually inspect marked mass spectra, particular borderline
cases, and further investigate any pattern. For example, in our
illustrative data set we observe that mass spectra number 25 to 28
correspond to a series of technical replicates within the same
biological replicates (see e.g. type[20:30,]
). Focusing on
some of the most extreme \(A\) scores,
the following figures display mass spectra 28, 29 and 87, which
correspond with low signal intensity, ion suppression and flatline mass
spectra respectively. Flatline mass spectra are produced when the
instrument is unable to collect data that does not meet the minimum
resolution and intensity requirements. These low-quality mass spectra
are clearly identified using the \(A\)
score as seen above.
As an example of the problems caused by faulty, low-quality spectra, the next figure shows the case of two peak profiles originated from the same biological material. It is evident that they are carrying very different information.
For the purpose of this tutorial we move on by updating the original
spectra
and type
objects with the filtered
collection of mass spectra and associated metadata provided by
screenSpectra
through the fspectra
and
fmeta
elements of the output.
A number of parameters are required to be set for the different pre-processing stages. The functions include safe default values, but parameter settings throughout pre-processing stages interact in complex ways. Using simulation, we customised some of the key parameter so as to obtain optimal combinations in terms of sensibility and false discovery rate in peak profiles. They have been slightly modified to accommodate the characteristics of the reduced example data set used here. It is convenient to write an initial section in the pre-processing pipeline script to set the values for the parameters which will be used later on. This facilitates testing results from different settings.
Signal denoising or smoothing is conducted in MALDIrppa
by undecimated discrete wavelet transform (UDWT) (Coombes et al. 2005) using the
wavSmooting
function. Wavelets adapt well to changes in
peak shape and the UDWT produces the same results if the start of the
signal is shifted by a few time points (shift-invariance). This prevents
from undesirable artifacts into the signal near either end of the
spectrum due to large changes in wavelet coefficients as a consequence
of small shifts in location. In this case, smoothing is applied on
square root-transformed (sqrt
function in R) signals using
transfIntensity
for variance stabilisation. Note that in
fact this function allows to apply any sensible user-defined
transformation on the signal intensities (see MALDIrppa
documentation for an example). For baseline correction, the
statistics-sensitive non-linear iterative peak-clipping (SNIP) algorithm
as implemented in MALDIquant
generates positive intensities
and provides better results with high matrix effects (pronounced mode).
Regarding signal normalisation, MALDIquant
offers a number
of sensible methods through the calibrateIntensity
function. In particular, probabilistic quotient normalisation (PQN)
provides a robust method for scaling spectra, commonly originated from
different dilutions, into a common overall concentration (Dieterie et al. 2006).
spectra <- transfIntensity(spectra, fun = sqrt)
spectra <- wavSmoothing(spectra, method = "Wavelet", thresh.scale = thScale)
spectra <- removeBaseline(spectra, method = "SNIP", iterations = ite)
spectra <- calibrateIntensity(spectra, method = "PQN")
Note that in this illustrative pipeline we go on overwriting the
MassSpectrum
object containing the list of signals as we
apply the different operations to produce a clean and corrected
collection of signals. Given the usual large size of the data set, this
is convenient to save computer memory and speed up the process. However
we do not then keep individual backup copies of the spectra at each
pre-processing stage. If this is required, simply using the ordinary
save(objectname, filename)
line in R after each stage will
do. In any case, it is recommended to delete previous versions of the
data from R’s workspace as soon as they are saved in order to prevent
from memory usage problems.
Using detectPeaks
in MALDIquant
peak
profiles are produced by searching for and extracting local maximum
intensities along the corrected MS signals. A noise function is
estimated (see estimateNoise
) and signal-to-noise ratio
(SNR) and mass window size values are set in order to determine peaks
corresponding to genuine peptide-induced signals. In particular, peaks
are extracted if their intensity is greater than SNR times the estimated
noise within a mass \(\pm\) half window
size interval. Extracted peak lists still require alignment and binning
across profiles. This is necessary to correct for slight mass drifts and
arrange all peak profiles onto a common range of masses. A number of
landmark peaks that are matched a minimum number of times across
profiles serve as reference points for alignment, following procedures
implemented in MALDIquant
(the minFreq
argument sets this as a proportion, 80% frequency in our example).
Binning is then used to equalise masses of peaks found within a
tolerable relative mass deviation range and, hence, assumed to be
representatives of the same peptide (the tolerance
argument
sets this as a proportion, 0.3% deviance in our case). The
alignPeaks
function in MALDIrppa
is a wrapper
for a number of MALDIquant
functions involved in this stage
that simplifies the workflow into a single line of code. Moreover,
alignPeaks
implements an additional binning round which we
found useful to deal with misalignment issues that may be still present
after putting the peak profiles through default MALDIquant
alignment and binning procedures. We faced this situation when working
with mass spectra over a particularly high-resolution mass range.
peaks <- detectPeaks(spectra, SNR = SigNoi, halfWindowSize = hws)
peaks <- alignPeaks(peaks, minFreq = 0.8, tolerance = tol)
It is important to keep exploring the data throughout the
pre-processing workflow in order to identify potential issues or
unexpected results. Analogously to summarySpectra
used
above, summaryPeaks
produces a table of summary statistics
from the peak profiles. The following line of code shows for example
details for the first 10 peak profiles.
## ID No.MZ Min.MZ Max.MZ Min.Int Mean.Int Std.Int Med.Int MAD.Int Max.Int Min.SNR Mean.SNR Std.SNR Med.SNR MAD.SNR Max.SNR
## 1 1 17 2834.517 10498.76 1e-04 3e-04 2e-04 4e-04 2e-04 7e-04 2.0658 5.2588 2.5826 5.7045 3.0740 11.0548
## 2 2 19 2834.517 10498.76 1e-04 4e-04 2e-04 3e-04 2e-04 8e-04 2.1924 6.1643 3.1813 5.2446 3.4485 13.6566
## 3 3 16 2834.517 10498.76 2e-04 4e-04 2e-04 4e-04 2e-04 8e-04 2.2235 5.2367 2.4604 5.8425 3.1136 10.8363
## 4 4 19 2834.517 12185.69 1e-04 3e-04 2e-04 3e-04 2e-04 7e-04 2.0122 5.1029 2.7053 4.0903 3.0105 11.5476
## 5 5 17 2834.517 10498.76 1e-04 3e-04 2e-04 3e-04 2e-04 7e-04 2.4147 5.7237 2.7602 4.9986 3.7420 11.1606
## 6 6 22 2834.517 12185.69 1e-04 3e-04 2e-04 2e-04 1e-04 7e-04 2.1804 5.7572 3.5148 4.0762 2.1713 13.1585
## 7 7 18 2834.517 12185.69 1e-04 4e-04 2e-04 3e-04 2e-04 8e-04 2.2183 6.2045 3.3271 5.1182 3.2585 13.2845
## 8 8 18 2834.517 12185.69 1e-04 3e-04 2e-04 3e-04 2e-04 7e-04 2.2711 5.6168 2.9050 4.7738 2.6931 11.6444
## 9 9 20 2834.517 12185.69 1e-04 3e-04 2e-04 2e-04 2e-04 6e-04 2.0212 4.9953 2.7368 3.8578 2.5544 10.7881
## 10 10 26 2834.517 12185.69 1e-04 3e-04 2e-04 2e-04 1e-04 7e-04 2.0199 5.5798 3.6450 4.4211 2.5821 14.6057
The function countPeaks
directly counts the number of
peaks in each profile. Either an obvious excess or scarcity of peaks in
a profile may deserve a closer look. The following code produces the
list of counts and plots them for overall comparison using the profile
number as label.
At this point we have essentially concluded data pre-processing and it is then convenient to save a backup of the peak profiles and associate metadata (note that in this example the file is created in a temporary location, the user should choose their preferred location).
save(peaks, file = file.path(tempdir(), "peaks.pp.Rdata"))
save(type, file = file.path(tempdir(), "type.pp.Rdata"))
The peakPatterns
function is useful to visualise and
explore the resulting peak patterns. It displays peak presence (coloured
cells) and absence (blank cells). The barplot on the top reflects the
frequency of a peak across replicates. The function can be applied
either on a list of MassPeaks
objects, like our
peaks
, or on an intensity matrix. Any remaining data
artifacts or hints about peak features differentiating e.g. strains
should be evident here. This display can also be useful to check for
homogeneity of replicates from an isolate by simply subsetting the
replicates within that same isolate (see ?peakPatterns
for
more details and examples).
Note that due to process disturbances, chemical contamination or
human errors for example, some peak profiles may exhibit patterns beyond
anticipated biological variability at a level at which certain
homogeneity is expected, say at biological and technical replication
level. This is not related to low-quality, data acquisition problems as
before, the signals are correct, it is instead a case of outlying peak
profiles characterised for its large distance from the main mass of data
points. They usually include relevant information about certain
artifacts or substructures in the data, although in practice it is not
always straightforward to find the underlying reasons for their
deviating behaviour. As it is well-known in general data modelling,
outliers can heavily influence e.g. correlations, fitting in regression
models or similarity measures used by classification algorithms.
Outliers in our context are often sought after on a peptide-wise basis
(Erhard and Zimmer 2012), but we adhere to
the idea that peaks of outlying profiles do not necessarily stand out
individually. They can simply show an atypical relative pattern with
combinations of peaks that occur very rarely. Working within a robust
statistics framework we consider the multidimensional and
inter-dependent structure of the data for outlier detection (Rocke and Woodruff 1996; Maronna, Martin, and Yohai
2006). In particular, we adapt the multivariate outlier detection
method proposed by Filzmoser, Garrett, and
Reimann (2005) (adaptive outlier detection, AOD) which allows the
boundaries for a peak profile to be considered as potential outlier to
be adjusted to the number of replicates involved. By using a
multidimensional scaling (MDS) coordinate projections of the data, the
approach can be applied at different levels of replication and on varied
data formats, e.g. either peak intensities or peak presence/absence
binary data. This customised procedure is implemented in the
detectOutliers
function. It generates a logical vector
flagging out potential outlying replicates at a given level of
aggregation, e.g. at isolate level as set using the argument
by
in the following line of code. This can be used to
filter outliers out if that is judged adequate. The argument
binary
facilitates the use of the method on either peak
presence/absence profiles (binary = TRUE
) or raw peak
intensity profiles (binary = FALSE
).
The following figure illustrates the result from a collection of 20 replicates from the same bacterial isolate. Note that the method was applied on peak presence/absence profiles. Quantile ellipsoids based on the robust Mahalanobis distance determine outlierness thresholds. The graph displays ellipsoids at increasing quantiles, including the adjusted quantile ellipsoid produced by the AOD method. The profiles, projected onto a 2-dimensional space by MDS, falling beyond the adjusted quantile boundaries were marked as potential outliers. In this case there is evidence of a set of 4 technical replicates within a biological replicate that were probably contaminated samples (distant points located on the right-hand side of the graph).
The identification of outliers can be sometimes a primary purpose of
data analysis, as they might reveal hidden structures in the data of
biological interest. Other times outliers are a nuisance resulting from
errors or contamination, and it might be then convenient to discard or
downweight them prior to the actual data modelling.
MALDIrppa
provides a exploratory tool to inform of samples
that are considerably different from the majority. The final decision
about their biological meaning and how to deal with them will depend on
the assessment of the researcher. To simplify the exposition we here
proceed by discarding outlying peak profiles.
peaks.clean <- peaks[out[,2] == FALSE] # Discard outlying peak profiles
type.clean <- type[out[,2] == FALSE,] # and corresponding metadata
The inherent nature of the data implies that pre-processed data may
still contain peaks only reflecting noise and probably irrelevant
signals. MALDIquant
facilitates further cleaning by
eliminating rare peaks across replicates using the
filterPeaks
function. This is done by setting a minimum
frequency (minFreq
argument) for a peak at a desired level
of aggregation (isolate level in our case). Note that in
MALDIquant
the argument labels
is used to set
the grouping factor. The following line discards peaks occurring in less
than 25% replicates within isolate.
We can compare these filtered peak profiles with the more noisy raw
peaks profiles above using peakPatterns
.
Finally replicates are typically merged to obtain a single composite
representative peak profile by isolate. In accordance with our robust
approach we propose to do this by computing the median peak profile
across replicates within isolate. This contributes to downgrade the
influence of extreme peak intensities on the pooled estimates. The
associated metadata should be also aggregated in the same way. Here we
use the aggregate
function available in R base, but other
alternatives are available through several R packages. Note that we
specify FUN = length
simply as a trick, as we are
not really interested in any summary of the metadata. Moreover, we only
keep the information about the isolate, which is in the first column of
the resulting table (note that in this illustrative example
type.clean.fm
becomes a vector after this).
peaks.clean.fm <- mergeMassPeaks(peaks.clean.f, labels = type.clean$Isolate, method = "median")
type.clean.fm <- aggregate(type.clean, list(Isolate = type.clean$Isolate), FUN = length)[,1]
Statistical and machine learning methods typically work on data
matrices. The intensityMatrix
function in
MALDIquant
converts a list of MassPeaks
objects into a matrix. The replicates or composite profiles are arranged
by rows, and the columns correspond with the list of transformed
intensities along the common \(m/z\)
range determined after peak alignment and binning. When a peak is not
present at a certain mass position the cell is given a NA
value by default. The final intensity matrix is obtained as follows.
The intensity matrix and associated metadata can be stored in
different formats to facilitate its use as input in, for example,
specialised bioinformatics software packages. The following code uses
simple functions to save them as CSV text files and also in the popular
NEXUS format, which includes options to save either peak intensities or
binary patterns and to add peak weights among others (see
?writeIntensity
help files for more details). Note that in
this example the file is saved to a temporary location, the user should
choose their preferred destination. The argument labels
can
be used to add labels to the pre-processed samples. For this example
data set, we use the spectra IDs stored in type.clean.fm
,
which became a vector containing only the IDs in a previous stage. Thus,
using the $
operator for matrices is not required to select
the corresponding ID column here.
writeMetadata(type.clean.fm, file = file.path(tempdir(),"type.clean.fm"),
format = "csv") # save metadata
writeIntensity(int.clean.fm, file = file.path(tempdir(),"int.clean.fm"),
format = "csv", labels = type.clean.fm)
writeIntensity(int.clean.fm, file = file.path(tempdir(),"int.clean.fm"),
format = "NEXUS", labels = type.clean.fm)
The argument binary
can be used to export data in binary
(peak presence/absence) format. Note that this is internally turned
TRUE
whenever the NEXUS or FASTA format is chosen. As an
example, the next figure shows a phylogenetic network contructed using
the NeighborNet algorithm implemented in the SplitsTree software package
(https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/splitstree/)
directly from the NEXUS file generated above by
MALDIrppa
.