ondisc_matrix
classThis tutorial shows how to use ondisc_matrix
, the core class implemented by ondisc
. An ondisc_matrix
is an R object that represents an expression matrix stored on-disk rather than in-memory. We cover the topics of initialization, querying basic information, subsetting, and pulling submatrices into memory. We begin by loading the ondisc
package.
library(ondisc)
ondisc
ships with several example datasets, stored in the “extdata” subdirectory of the package.
<- system.file("extdata", package = "ondisc")
raw_data_dir list.files(raw_data_dir)
#> [1] "cell_barcodes.tsv" "gene_expression.mtx" "genes.tsv"
#> [4] "guides.tsv" "perturbation.mtx"
The files “gene_expression.mtx”, “cell_barcodes.tsv,” and “genes.tsv” together define a gene-by-cell expression matrix. We save the full paths to these files in the variables mtx_fp
, barcodes_fp
, and features_fp
.
<- paste0(raw_data_dir, "/gene_expression.mtx")
mtx_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
barcodes_fp <- paste0(raw_data_dir, "/genes.tsv") features_fp
An ondisc_matrix
consists of two parts: an HDF5 (i.e., .h5) file that stores the expression data on-disk in a novel format, and an in-memory object that allows us to interact with the expression data from within R. The easiest way to initialize an ondisc_matrix
is by calling the function create_ondisc_matrix_from_mtx
. We pass to this function (i) a file path to the .mtx file storing the expression data, (ii) a file path to the .tsv file storing the cell barcodes, and (iii) a file path to the .tsv file storing the feature IDs and human-readable feature names. We optionally can specify the directory in which to store the initialized .h5 file, which in this tutorial we will take to be the temporary directory.
<- tempdir()
temp_dir <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
exp_mat_list barcodes_fp = barcodes_fp,
features_fp = features_fp,
on_disk_dir = temp_dir)
#>
|======== | 11%
|================= | 23%
|========================== | 36%
|==================================== | 48%
|============================================= | 61%
|====================================================== | 73%
|=============================================================== | 86%
|=========================================================================| 98%
|=========================================================================| 100%
#>
|======== | 11%
|================= | 23%
|========================== | 36%
|==================================== | 48%
|============================================= | 61%
|====================================================== | 73%
|=============================================================== | 86%
|=========================================================================| 98%
|=========================================================================| 100%
#> Writing CSC data.
#> Writing CSR data.
By default, create_ondisc_matrix_from_mtx
returns a list of three elements: (i) an ondisc_matrix
representing the expression data, (ii) a cell-wise covariate matrix, and (iii) a feature-wise covariate matrix. The exact cell-wise and feature-wise covariate matrices that are computed depend on the inputs to create_ondisc_matrix_from_mtx
(see documentation via ?create_ondisc_matrix_from_mtx for full details). The advantage to computing the cell-wise and feature-wise covariates at initialization is that it obviates the need to load the entire dataset into memory a second time.
<- exp_mat_list$ondisc_matrix
expression_mat head(expression_mat)
#> Showing 5 of 300 featuress and 6 of 900 cells:
#> Loading required package: Matrix
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 3 0 0 0 0 5
#> [2,] 0 2 0 0 0 0
#> [3,] 0 8 0 0 0 0
#> [4,] 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0
<- exp_mat_list$cell_covariates
cell_covariates head(cell_covariates)
#> n_nonzero n_umis p_mito
#> 1 43 214 0.04672897
#> 2 26 169 0.00000000
#> 3 22 116 0.05172414
#> 4 37 258 0.08139535
#> 5 36 224 0.08035714
#> 6 31 147 0.07482993
<- exp_mat_list$feature_covariates
feature_covariates head(feature_covariates)
#> mean_expression coef_of_variation n_nonzero
#> 1 0.7577778 2.981871 114
#> 2 0.5977778 3.302883 96
#> 3 0.5788889 3.539932 85
#> 4 0.6533333 3.341677 91
#> 5 0.5522222 3.578487 82
#> 6 0.5455556 3.541223 84
The initialized HDF5 file is named ondisc_matrix_1.h5
and is located in the temporary directory.
"ondisc_matrix_1.h5" %in% list.files(temp_dir)
#> [1] TRUE
A strength of create_ondisc_matrix_from_mtx
is that it does not assume that entire expression matrix fits into memory. The optional argument n_lines_per_chunk
can be used to specify the number of lines to read from the .mtx file at a time. Additionally, create_ondisc_matrix_from_mtx
is fast: the novel algorithm that underlies this function is highly efficient and implemented in C++ for maximum speed. Typically, create_ondisc_matrix_from_mtx
takes aboout 4-8 minutes/GB to run. Finally, for a given dataset, create_ondisc_matrix_from_mtx
only needs to be run once, even after closing and opening new R sessions.
We can use the functions get_feature_ids
, get_feature_names
, and get_cell_barcodes
to obtain the feature IDs, feature names (if applicable), and cell barcodes, respectively, of an ondisc_matrix
.
<- get_feature_ids(expression_mat)
feature_ids <- get_feature_names(expression_mat)
feature_names <- get_cell_barcodes(expression_mat)
cell_barcodes
head(feature_ids)
#> [1] "ENSG00000198060" "ENSG00000237832" "ENSG00000267543" "ENSG00000103460"
#> [5] "ENSG00000229637" "ENSG00000174990"
head(feature_names)
#> [1] "MARCH5" "AL138808.1" "AC015802.3" "TOX3" "PRAC2"
#> [6] "CA5A"
head(cell_barcodes)
#> [1] "GCTTTCGTCTAGACCA-1" "ACGGTCGTCGTTAGAC-1" "TTTACGTTCACCTCGT-1"
#> [4] "TGGATCATCCTTCAGC-1" "ACAGGGAAGACGCCCT-1" "ACCTACCAGTGTTCCA-1"
Additionally, we can use dim
, nrow
, and ncol
to obtain the dimension, number of rows (i.e., number of features), and number of columns (i.e., number of cells) of an ondisc_matrix
.
dim(expression_mat)
#> [1] 300 900
nrow(expression_mat)
#> [1] 300
ncol(expression_mat)
#> [1] 900
We can subset an ondisc_matrix
to obtain a new ondisc_matrix
that is a submatrix of the original. To subset an ondisc_matrix
, apply the [
operator and pass a numeric, logical, or character vector indicating the cells or features to keep. Character vectors are assumed to refer to feature IDs (for rows) and cell barcodes (for columns).
# numeric vector examples
# keep genes 100-110
<- expression_mat[100:110,]
x # keep all cells except 10 and 20
<- expression_mat[,-c(10,20)]
x # keep genes 50-100 and 200-250 and cells 300-500
<- expression_mat[c(50:100, 200:250), 300:500]
x
# character vector examples
# keep genes ENSG00000107581, ENSG00000286857, and ENSG00000266371
<- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),]
x # keep cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1
<- expression_mat[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")]
x
# logical vector example
# keep all genes except ENSG00000237832 and ENSG00000229637
<- expression_mat[!(get_feature_ids(expression_mat)
x %in% c("ENSG00000237832", "ENSG00000229637")),]
Subsetting an ondisc_matrix
leaves the original object unchanged.
expression_mat#> An ondisc_matrix with 300 features and 900 cells.
This important property, called object persistence, makes programming with ondisc_matrices
intuitive. The underlying HDF5 file is not copied upon subset; instead, information is shared across ondisc_matrix
objects, making subsets fast.
We can pull a submatrix of an ondisc_matrix
into memory, allowing us to perform computations on a subset of the data. To pull a submatrix into memory, use the [[
operator, passing a numeric, character, or logical vector indicating the cells or features to access. The data structure that underlies an ondisc_matrix
enables fast access to both rows and columns of the matrix.
# numeric vector examples
# pull gene 6
<- expression_mat[[6,]]
m # pull cells 200 - 250
<- expression_mat[[,200:250]]
m # pull genes 50 - 100 and cells 200 - 250
<- expression_mat[[50:100, 200:250]]
m
# character vector examples
# pull genes ENSG00000107581 and ENSG00000286857
<- expression_mat[[c("ENSG00000107581", "ENSG00000286857"),]]
m # pull cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1
<- expression_mat[[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")]]
m
# logical vector examples
# subset the matrix, keeping genes ENSG00000107581, ENSG00000286857, and ENSG00000266371
<- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),]
x # pull all genes except ENSG00000107581
<- x[[get_feature_ids(x) != "ENSG00000107581",]] m
The last example demonstrates that we can pull a submatrix of an ondisc_matrix
into memory after having subset the matrix.
One can remember the difference between [
and [[
by recalling R lists: [
is used to subset a list, and [[
is used to access elements stored within a list. Similarly, [
is used to subset an ondisc_matrix
, and [[
is used to access a submatrix stored within an ondisc_matrix
.
ondisc_matrix
As discussed previously, there are two components to an ondisc_matrix
: the HDF5 file stored on-disk, and the R object stored in memory. The latter contains a file path to the former, allowing us to interact with the expression data from within R.
To save an ondisc_matrix
, simply call saveRDS
on the ondisc_matrix
R object to create an .rds file.
saveRDS(object = expression_mat, file = paste0(temp_dir, "/expression_matrix.rds"))
rm(expression_mat)
We then can load the ondisc_matrix
by calling readRDS
on the .rds file.
<- readRDS(paste0(temp_dir, "/expression_matrix.rds")) expression_mat
We also can use the constructor of the ondisc_matrix
class to create an ondisc_matrix
from an already-initialized HDF5 file.
<- paste0(temp_dir, "/ondisc_matrix_1.h5")
h5_file <- ondisc_matrix(h5_file) expression_mat