In the introductory mashr
vignettes we assumed that the
data were small enough that it was convenient to read them all in and do
all the analyses on the same data.
In larger applications, particularly eQTL studies, it can be more convenient to do different parts of the analyses on subsets of the tests. Specifically, if you have millions of tests in dozens of conditions, it might be helpful to consider subsets of these millions of tests at any one time. Here we illustrate this idea.
Our suggested workflow is to extract (at least) two subsets of tests from your complete data set:
Results from a subset of “strong” tests corresponding to stronger effects in your study. For example, these tests might have been identified by taking the “top” eQTL in each gene based on univariate test results, or by some other approach such as a simple meta-analysis.
Results from a random subset of all tests. It is
important that these be an unbiased representation of all the tests you
are considering, including null and non-null tests, because
mashr
uses these tests to learn about the amount of signal
in the data, and to “correct” estimates for the fact that many tests are
null (analagous to a kind of multiple testing correction.)
We will call the data from these two sets of tests
strong
and random
respectively.
To give some sense of the potential appropriate sizes of these
datasets: in our eQTL application in Urbut et al,
the strong
data contained about 16k tests (the top eQTL per
gene), and for the random
data we used 20k
randomly-selected tests. (If you suspect true effects are very sparse
then you might want to increase the size of the random subset, say to
200k).
The basic analysis strategy is now:
Learn correlation structure among null tests using
random
test.
Learn data-driven covariance matrices using strong
tests.
Fit the mashr model to the random
tests, to learn
the mixture weights on all the different covariance matrices and scaling
coefficients.
Compute posterior summaries on the strong
tests,
using the model fit from step 2. (At this stage you could actually
compute posterior summaries for any sets of tests you like. For example
you could read in all your tests in small batches and compute posterior
summaries in batches. But for illustration we will just do it on the
strong
tests.)
First we simulate some data to illustrate the ideas. To make this
convenient to run we simulate a small data. And we identify the strong
hits using mash_1by1
. But in practice you may want to use
methods outside of R to extract the matrices of data corresponding to
strong and random tests, and then read them in as you need them. For
example, see here
for scripts we use for processing fastQTL output.
library(ashr)
library(mashr)
set.seed(1)
= simple_sims(10000,5,1) # simulates data on 40k tests
simdata
# identify a subset of strong tests
.1by1 = mash_1by1(mash_set_data(simdata$Bhat,simdata$Shat))
m= get_significant_results(m.1by1,0.05)
strong.subset
# identify a random subset of 5000 tests
= sample(1:nrow(simdata$Bhat),5000) random.subset
We estimate the correlation structure in the null tests from the
random
data (not the strong
data because they
will not necessarily contain any null tests).
To do this we set up a temporary data object data.temp
from the random tests and use
estimate_null_correlation_simple
as in this vignette.
= mash_set_data(simdata$Bhat[random.subset,],simdata$Shat[random.subset,])
data.temp = estimate_null_correlation_simple(data.temp)
Vhat rm(data.temp)
Now we can set up our main data objects with this correlation structure in place:
= mash_set_data(simdata$Bhat[random.subset,],simdata$Shat[random.subset,],V=Vhat)
data.random = mash_set_data(simdata$Bhat[strong.subset,],simdata$Shat[strong.subset,], V=Vhat) data.strong
Now we use the strong tests to set up data-driven covariances.
= cov_pca(data.strong,5)
U.pca = cov_ed(data.strong, U.pca) U.ed
Now we fit mash to the random tests using both data-driven and
canonical covariances. (Remember the Crucial Rule! We have to fit using
a random set of tests, and not a dataset that is enriched for strong
tests.) The outputlevel=1
option means that it will not
compute posterior summaries for these tests (which saves time).
= cov_canonical(data.random)
U.c = mash(data.random, Ulist = c(U.ed,U.c), outputlevel = 1) m
# - Computing 5000 x 241 likelihood matrix.
# - Likelihood calculations took 0.17 seconds.
# - Fitting model with 241 mixture components.
# - Model fitting took 3.49 seconds.
Now we can compute posterior summaries etc for any subset of tests
using the above mash fit. Here we do this for the strong
tests. We do this using the same mash
function as above,
but we specify to use the fit from the previous run of mash by
specifying
g=get_fitted_g(m), fixg=TRUE
. (In mash
the
parameter g
is used to denote the mixture model which we
learned above.)
= mash(data.strong, g=get_fitted_g(m), fixg=TRUE) m2
# - Computing 1428 x 241 likelihood matrix.
# - Likelihood calculations took 0.04 seconds.
# - Computing posterior matrices.
# - Computation allocated took 0.01 seconds.
head(get_lfsr(m2))
# condition_1 condition_2 condition_3 condition_4 condition_5
# effect_13096 9.815945e-06 5.056808e-01 4.229107e-01 3.944224e-01 6.055467e-01
# effect_29826 6.571537e-05 6.637417e-01 5.837333e-01 6.358124e-01 5.768253e-01
# effect_14042 6.994353e-02 6.495479e-03 2.483348e-03 5.562270e-02 6.836385e-06
# effect_12524 1.119195e-01 4.107543e-01 2.985565e-02 2.579205e-05 1.001824e-01
# effect_15456 4.913414e-05 4.380260e-01 2.733414e-01 5.166882e-01 3.610422e-01
# effect_35844 2.623221e-09 4.570036e-09 1.864892e-07 1.013875e-09 4.094924e-11