The DGEobj package implements an S3 class data object, the DGEobj, that conceptually represents an extension of the capabilities of the RangedSummarizedExperiment (RSE) developed by Martin Morgan et al.. Both the DGEobj and the RSE object capture the initial data for a differential gene expression analysis, namely a counts matrix along with associated gene and sample annotation. Additionally, the DGEobj extends this concept to support capture of various downstream data objects, the intermediate steps in an analysis, and thus capture the entire workflow of an analysis project.
The availability of a collection of results in a specific structured data object has multiple advantages. Sharing DGE results with other colleagues is simplified because the entire analysis is encapsulated and documented within the DGEobj. A recipient of data in this format can examine details of the analysis based on the annotation built into the DGE object. Downstream users of a dataset can retrieve the original data or any intermediate data items the original analyst captured. Importantly, assembling a collection of DGE projects in a structured format enables facile automation of integrative multi-project meta-analyses.
The RSE object can capture as many assays as desired. An “assay” is defined as any matrix with n genes (rows) and m samples (columns). A limitation of the RSE object however is that the RSE only allows 1 instance of row data (typically gene annotation) and 1 instance of column data (sample annotation with one row for every column of data in the assay slot). This limits the RSE in terms of its ability to hold downstream data objects because many of those objects meet the definition of row data also (e.g. DGElist, Fit objects, topTable output). Other types of data (e.g. design matrices, sample QC) meet the definition of column data. Thus, the DGEobj was modeled after the RSE object, but extended to accommodate multiple row and column data types. The DGEobj is thus uniquely suited to capturing the entire workflow of a DGE analysis.
While the DGEobj was designed to support a RNA-Seq workflow, extensibility features enable the definition of other data types. Thus the DGEobj data structure may be easily configured for many proteomic or metabolomic applications, or an data domain where the starting data is a 2D matrix of assays and samples.
Multiple instances of a base type are accommodated by defining data “types” and “items”. Each data type is assigned a base type (e.g. geneData, GRanges, and Fit objects are all “types” of “baseType” = rowData).
Importantly, multiple instances of each type are allowed (exceptions described in Unique Items section) as long as each instance of a type is named uniquely. Each instance of a “type” is described as an “item” and each item must have a user-defined item name. The item name must be unique within a DGEobj.
The DGEobj package implements an S3 class data object, the DGEobj, that conceptually represents an extension of the capabilities of the RangedSummarizedExperiment (RSE) originally developed by Martin Morgan et al. Both the DGEobj and the RSE object capture the initial data for a differential gene expression analysis, namely a counts matrix along with associated gene and sample annotation. Additionally, the DGEobj extends this concept to support capture of various downstream data objects, the intermediate steps in an analysis, and thus capture the entire workflow of an analysis project.
The availability of a collection of results in a specific structured data object has multiple advantages. Sharing DGE results with other colleagues is simplified because the entire analysis is encapsulated and documented within the DGEobj. A recipient of data in this format can examine details of the analysis based on the annotation built into the DGE object. Downstream users of a dataset can drill back to the original data or tap any intermediate data items the original analyst captured. Importantly, assembling a collection of DGE projects in a structured format enables facile automation of integrative multi-project metanalyses.
The RSE object, which motivated building the DGEobj, can capture as many assays as desired. An “assay” is defined as any matrix with n genes (rows) and m samples (columns). A limitation of the RSE object however is that the RSE only allows 1 instance of row data (typically gene annotation) and 1 instance of column data (sample annotation with one row for every column of data in the assay slot). This limits the RSE in terms of its ability to hold downstream data objects because many of those objects meet the definition of row data also (e.g. DGElist, Fit objects, topTable output). Other types of data (e.g. design matrices, sample QC) meet the definition of column data. Thus, the DGEobj was modeled after the RSE object, but extended to accommodate multiple row and column data types. The DGEobj is thus uniquely suited to capturing the entire workflow of a DGE analysis.
The DGEobj supports four distinct data types, that we refer to as “base types”:
Fundamentally, the base type defines how an item should be subsetted (see the section on subsetting below).
The design goal of the DGEobj is that it should capture the workflow and analysis results of a single dataset. As such, certain items that constitute the “raw” or starting point data are assigned a “unique” attribute that limits such “types” to one instance per DGEobj.
Three items: counts, design (sample information), and geneData (or isoformData, or exonData) are defined as unique. If complete chromosome location data (Chr, Start, End, Strand) are supplied in the geneData item, then a GRanges item is also created upon initializing a DGEobj.
Importantly, multiple instances of each type are allowed (exceptions described in Unique Items section) as long as each instance of a type is named uniquely. Each instance of a “type” is described as an “item” and each item must have a user-defined item name. The item name must be unique within a DGEobj.
“Levels” are predefined for DGEobj objects to accommodate RNA-Seq and Affymetrix RNA expression applications (allowedLevels = c(“gene”, “isoform”, “exon”, “affy”)). A DGEobj may contain only one of these levels. Thus a user would need to create separate DGEobjs for gene and isoform level data in an RNA transcription analysis mode.
See the “Customizing the DGEobj” section below to define new levels.
Gene level data will be used within this vignette.
Several levels are predefined for DGEobj objects to accommodate RNA and proteomic applications (allowedLevels = c(“gene”, “isoform”, “exon”, “proteingroup”, “peptide”, “ptm”, “protein”)). A DGEobj may contain only one of these levels. Thus a user would need to create separate DGEobjs for gene and isoform level data in an RNA transcription analysis mode. Gene level data will be used Within this vignette.
An analysis can become multi-threaded. For example, multiple models can be fit to one dataset each with its own set of contrasts. Two features of the DGEobj serve to manage multi-threaded analyses. First, only one instance of the starting data types are allowed. Thus all analysis thread map back to the starting data. Secondly, parent/child relationships are captured. Each data item carries a parent attribute that holds the item name of the parent data item. In this way, for example, a topTable item can be linked to the design matrix and fit that produced it.
In the course of a workflow an analyst will often need to subset the DGEobj based on low signal or other characteristics. However, a downstream analyst may need to start their analysis with the original unsubsetted data. For this reason, when the DGEobj is initialized, a copy of the starting data is also stored in a metadata slot. Metadata slots are carried along without subsetting so the original data may always be retrieved from these metadata items. The item names of the original data in the metadata slot have a “_orig” suffix.
A DGEobj is initialized from a set of three data frames containing the primary assay matrix (typically a counts matrix for RNA-Seq applications), a set of row annotation (gene data) and an experiment design table with information about the samples (columns) of the assay matrix. For the example below, a GEO dataset will be downloaded and encapsulated as a DGEobj.
The following dataset was selected to demonstrate building and working with the DGEobj data structure.
Huang X, Cai H, Ammar R, Zhang Y et al. Molecular characterization of a precision-cut rat liver slice model for the evaluation of antifibrotic compounds. Am J Physiol Gastrointest Liver Physiol 2019 Jan 1;316(1):G15-G24. PMID: 30406699
Briefly, livers were removed from rats 4 weeks after bile duct ligation or sham operation. Rat liver slices were incubated in vitro with potential anti-fibrotic compounds. At the end of the incubation whole transcriptome RNA-Seq analysis was performed.
Files containing counts, sample annotations, and QC data associated with this project will be downloaded from the NCBI GEO resource.
The GEO data includes Ensembl gene IDs. Additional gene information such as chromosome positions, type of transcript, etc, will be downloaded from Ensembl using the biomaRt package.
Three properly formatted data frames are required to initialize a new DGEobj:
Counts matrix: For RNA-Seq, this is typically a genes x samples matrix or dataframe of numbers. Rownames must be present and contain a unique identifier (e.g. Ensembl geneid). Colnames should be a unique sample identifier.
Row Data: Additional annotations for the entities represented by each row. For the RNA-Seq case, this dataframe holds associated gene annotation. The rownames in the RowData must match the rownames in the counts matrix.
Col Data: Accordingly the ColData dataframe contains information about the samples (columns) of the counts matrix. We often refer to this as the “design” table because it typically contains the experimental details of the treatment of each sample. The rownames of ColData must match the colnames of the Counts Matrix.
In addition to these three items, the user must specify the “level” of the data (allowedLevels = c(“gene”, “isoform”, “exon”, “affy”)).
See the “Customizing the DGEobj” section below to define new levels.
This GEO project includes a counts table, design table, and QC metrics in the supplemental section. The data is downloaded to temp files and imported into data frames to prepare for building a DGEobj.
# Get the raw counts, sample annotation ("design") and QC data from GEO
# Source: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120804
<- "http://ftp.ncbi.nlm.nih.gov/geo/series/GSE120nnn/GSE120804/suppl"
getLocation <- "GSE120804_counts.txt.gz"
countsFile <- "GSE120804_geo_sample_annotation_edit.csv.gz"
designFile <- "GSE120804_qc_metrics.txt.gz"
qcFile
<- tempfile()
temp if (download.file(glue("{getLocation}/{countsFile}"), destfile = temp, mode = 'wb') > 0) {
stop("Counts Download Failed")
}<- read.delim(temp, stringsAsFactors = FALSE, row.names = 1)
counts
<- tempfile()
temp if (download.file(glue("{getLocation}/{designFile}"), destfile = temp, mode = 'wb')) {
stop("Design Download Failed")
}<- read.csv(temp, stringsAsFactors = FALSE)
design
<- tempfile()
temp if (download.file(glue("{getLocation}/{qcFile}"), destfile = temp, mode = 'wb')) {
stop("Alignment QC Download Failed")
}<- read.delim(temp, stringsAsFactors = FALSE) alignmentQC
Often the design table will need some manipulation or clean up to properly construct the columns required to support a planned analysis and also to meet the design constraints required for the DGEobj. It is recommended to restructure the design table as needed at the onset of the analysis.
In this example, the design file lacks a proper linking column to the counts data. The colnames from the counts data should be the rownames for the design table. Fortunately, the colnames from counts are present as a substring within the fastq filename column (design$raw.file). The samplenames will be parsed from the fastq filenames to create a proper linking column between the counts and design data.
rownames(design) <- str_sub(design$raw.file, start = 1, end = 21)
Additional manipulations to the design data include: * renaming a column * create a design$DiseaseStatus column to indicate whether a liver is derived from a BDL or sham animal * create an animal number column to capture which liver slice samples were derived from the same animal
#correct the desired case/spelling of one column
<- design %>%
design rename(ReplicateGroup = Replicate.group)
# Create a DiseaseStatus column by parsing ReplicateGroup
$DiseaseStatus <- rep("Sham", nrow(design))
design<- str_detect(design$ReplicateGroup, "BDL")
idx $DiseaseStatus[idx] <- "BDL"
design
# Create an animal# column. The animal number is encoded in the sample.name column
$AnimalNum <- str_match(design$Sample.name, "r[0-9]{1,3}") design
The GEO data for this project is supplied Ensembl gene IDs but no other gene information. Additional gene information will be collected by querying Ensembl directly.
# Now get the gene annotation from Ensembl/biomaRt
<- "rnorvegicus_gene_ensembl"
ens.ds <- useMart(biomart = "ensembl", dataset = ens.ds, host = "https://may2021.archive.ensembl.org")
ens.mart <- c("ensembl_gene_id", "rgd_symbol", "chromosome_name", "start_position",
ens.columns "end_position", "strand", "gene_biotype", "description")
<- getBM(attributes = ens.columns,
ens.data values = rownames(counts),
mart = ens.mart,
useCache = F) %>%
distinct(ensembl_gene_id, .keep_all = T)
# Filter the list to the genes used in the test dataset and properly format gene
# information for GenomicRanges use
<- left_join(data.frame(ensembl_gene_id = rownames(counts), stringsAsFactors = F),
gene.data
ens.data,by = "ensembl_gene_id") %>%
::rename(start = start_position, end = end_position) %>%
dplyrmutate(strand = case_when(strand == -1 ~ "-",
== 1 ~ "+",
strand TRUE ~ "*"))
rownames(gene.data) <- gene.data$ensembl_gene_id
# Get transcript level data and keep max for each gene, or alternatively, use the cds length
<- "rnorvegicus_gene_ensembl"
ens.ds <- useMart(biomart = "ensembl", dataset = ens.ds, host = "https://may2021.archive.ensembl.org")
ens.mart <- c("ensembl_gene_id", "ensembl_transcript_id", "transcript_length")
ens.columns <- getBM(attributes = ens.columns,
transcript.data values = rownames(counts),
mart = ens.mart,
useCache = F) %>%
arrange(desc(transcript_length)) %>%
distinct(ensembl_gene_id, .keep_all = T)
#Add a transcript_length column to gene.data
<- left_join(gene.data,
gene.data select(transcript.data, ensembl_gene_id, ExonLength = transcript_length))
rownames(gene.data) <- gene.data$ensembl_gene_id
$ensembl_gene_id <- NULL
gene.data
# enforce same order as counts
<- gene.data[rownames(counts),] gene.data
As mentioned earlier, rownames and column names provide links between the three starting data frames.
These statements perform a reality check on these two constraints.
all(rownames(counts) == rownames(gene.data))
## [1] TRUE
all(colnames(counts) == rownames(design))
## [1] TRUE
With the above constraints met, the three data frames can be used to instantiate a DGEobj. In addition the “level” of the data must be specified in the “level” argument (Allowed levels include “gene”, “isoform”, “exon”, “affy”). Finally, two attributes are added via the customAttr argument to annotate the genome and gene model set used for the alignments used to produce the count data.
<- DGEobj::initDGEobj(primaryAssayData = counts,
dgeObj rowData = gene.data,
colData = design,
level = "gene",
customAttr = list(Genome = "Rat.B6.0",
GeneModel = "Ensembl.R89"))
Although it is possible to add many or all project attributes with the customAttr argument, this can also be tedious when many more attributes are desired. A more convenient way to add project attributes to the DGEobj is provided using the annotateDGEobj function. This function reads key/value pairs from a text file. Thus a text file template can be provided to less technical collaborators to capture important experiment metadata.
The annotation file for this project looks like this:
# Some basic metadata about the project
level=gene source=https://github.com/cb4ds/DGEobj/blob/master/vignettes/DGEobj_Overview.Rmd ID=BDL_Rat_LiverSlice_03Dec2017 BMS_PID=P-20170808-0001 Title=Rat Liver Slices from Bile Duct Ligation animals Organism=Rat GeneModel=Ensembl.R89 PlatformType=RNA-Seq Description=Rat livers slices from sham or BDL +/- efficacious treatments incubated in vitro Keywords=Liver slices; Bile Duct Ligation Disease=Liver Fibrosis Tissue=Liver GEO=GSE120804
# additional descriptive attributes
Technology=Illumina-HiSeq LibraryPrep=TruSeq Stranded Total RNA AlignmentReference=Rat.B6.0 ReadType=PE Pipeline=RNA-Seq_BMS_v2.4.pscript AlignmentAlgorithm=OSA ScriptID=RNA-Seq_BMS_v2.4.pscript
# Institutional attributes
BusinessUnit=Discovery FunctionalArea=Fibrosis Vendor=BMS TBio_Owner=Ron Ammar TA_Owner=John Huang
# the annotation information is stored in a temporary .txt file
<- annotateDGEobj(dgeObj, annotationFile) dgeObj
The DGE object now contains the starting data and metadata and is analysis-ready.
To view a listing of the items in the DGEobj:
kable(inventory(dgeObj))
ItemName | ItemType | BaseType | Parent | Class | Row | Col | DateCreated |
---|---|---|---|---|---|---|---|
counts_orig | counts_orig | meta | matrix | 32883 | 48 | 2022-05-12 10:58:09 | |
counts | counts | assay | counts_orig | matrix | 32883 | 48 | 2022-05-12 10:58:09 |
design_orig | design_orig | meta | data.frame | 48 | 10 | 2022-05-12 10:58:09 | |
design | design | col | design_orig | data.frame | 48 | 10 | 2022-05-12 10:58:09 |
geneData_orig | geneData_orig | meta | data.frame | 32883 | 8 | 2022-05-12 10:58:09 | |
geneData | geneData | row | geneData_orig | data.frame | 32883 | 8 | 2022-05-12 10:58:09 |
granges_orig | granges_orig | meta | geneData_orig | GRanges | 32883 | NA | 2022-05-12 10:58:10 |
granges | granges | row | geneData | GRanges | 32883 | NA | 2022-05-12 10:58:10 |
Use the showMeta
function to list the meta data
associated with a project.
kable(showMeta(dgeObj))
Attribute | Value |
---|---|
class | DGEobj |
DGEobjDef.version | 1.2 |
level | gene |
Genome | Rat.B6.0 |
GeneModel | Ensembl.R89 |
NA | NA |
source | Omicsoft |
ID | BDL_Rat_LiverSlice_03Dec2017 |
Title | Rat Liver Slices from Bile Duct Ligation animals |
Organism | Rat |
PlatformType | RNA-Seq |
Description | Rat livers slices from sham or BDL +/- efficacious treatments incubated in vitro |
Keywords | Liver slices; Bile Duct Ligation |
Disease | Liver Fibrosis |
Tissue | Liver |
GEO | GSE120804 |
Technology | Illumina-HiSeq |
LibraryPrep | TruSeq Stranded Total RNA |
AlignmentReference | Rat.B6.0 |
ReadType | PE |
Pipeline | RNA-Seq_BMS_v2.4.pscript |
AlignmentAlgorithm | OSA |
ScriptID | RNA-Seq_BMS_v2.4.pscript |
BusinessUnit | Discovery |
FunctionalArea | Fibrosis |
Vendor | BMS |
TBio_Owner | Ron Ammar |
TA_Owner | John Huang |
The length of a DGEobj refers to the number of data items in the DGEobj.
length(dgeObj)
## [1] 8
The dimensions reported for a DGEobj are the dimensions of the assays contained in the DGEobj. That is, for the RNA-Seq application, the row dimension is the number of genes contained in the object and the column dimension is the number of samples contained in the object.
dim(dgeObj)
## [1] 32883 48
dim(dgeObj$counts)
## [1] 32883 48
The rownames and colnames of the DGEobj are defined by the primary “assay” matrix in a DGEobj, typically the counts matrix for RNA-Seq data.
Note, there is no facility provided to re-assign rownames or colnames of a DGEobj. The user should make any adjustments to row/colnames before instantiating the DGEobj.
A DGEobj may be square bracket subsetted similar the same way data frames and matrices may be subsetted. The subsetting function uses the base type to define how each data item is handled during subsetting. During subsetting, metadata items are carried along unchanged.
#subset to the first 100 genes
<- dgeObj[1:100,]
First100genes dim(First100genes)
## [1] 100 48
#subset to the first 10 samples
<- dgeObj[,1:10]
First10samples dim(First10samples)
## [1] 32883 10
#susbet genes and samples
<- dgeObj[1:100, 1:10]
AnotherSubset dim(AnotherSubset)
## [1] 100 10
Boolean vectors may also be used to subset dimensions of a DGEobj.
# select a subset of samples
<- dgeObj$design$DiseaseStatus == "BDL"
idx <- dgeObj[,idx]
BDLonly dim(BDLonly)
## [1] 32883 39
The inventory
function prints a table of the data items
present in a DGEobj. The output includes the item name, type, baseType,
parent, class, and date created. If the verbose = TRUE
argument is used, a funArgs
column is also included.
kable(inventory(dgeObj))
ItemName | ItemType | BaseType | Parent | Class | Row | Col | DateCreated |
---|---|---|---|---|---|---|---|
counts_orig | counts_orig | meta | matrix | 32883 | 48 | 2022-05-12 10:58:09 | |
counts | counts | assay | counts_orig | matrix | 32883 | 48 | 2022-05-12 10:58:09 |
design_orig | design_orig | meta | data.frame | 48 | 10 | 2022-05-12 10:58:09 | |
design | design | col | design_orig | data.frame | 48 | 10 | 2022-05-12 10:58:09 |
geneData_orig | geneData_orig | meta | data.frame | 32883 | 8 | 2022-05-12 10:58:09 | |
geneData | geneData | row | geneData_orig | data.frame | 32883 | 8 | 2022-05-12 10:58:09 |
granges_orig | granges_orig | meta | geneData_orig | GRanges | 32883 | NA | 2022-05-12 10:58:10 |
granges | granges | row | geneData | GRanges | 32883 | NA | 2022-05-12 10:58:10 |
If just the item names of data stored in the DGEobj are needed, use
the base R names
function.
names(dgeObj)
## [1] "counts_orig" "counts" "design_orig" "design" "geneData_orig" "geneData" "granges_orig" "granges"
Often other sample-related information is available. In this project, for example, the alignment statistics for the RNA-Seq were deposited in GEO and downloaded with the other data. This example adds the alignment statistics to the DGEobj. The alignment QC data as downloaded needs to be transposed to place samples in rows. Once transposed, this data meets the criteria for the colData type “alignQC” and is added to the DGEobj. As a new data item not derived directly from the existing items, the alignment QC lacks parent or funArgs values in this context.
rownames(alignmentQC) <- alignmentQC$Metric
<- alignmentQC %>%
alignmentQC select(-Metric) %>%
t() %>%
as.data.frame()
<- addItem(dgeObj,
dgeObj item = alignmentQC,
itemName = "AlignmentQC",
itemType = "alignQC")
As an analysis progresses, the data analyst should capture intermediate data objects deemed important to properly documenting the workflow. In the example here, the counts will be normalized and the resulting DGElist data item added to the DGEobj.
# Perform TMM normalization on the raw counts
<- dgeObj$counts %>%
dgelist DGEList() %>%
calcNormFactors(method = "TMM")
# add the resulting edgeR DGEList item to the DGEobj
<- addItem(dgeObj,
newdgeObj item = dgelist,
itemName = "normTMM",
itemType = "DGEList",
parent = "counts",
funArgs = "calcNormFactors; TMM"
)
kable(inventory(newdgeObj))
ItemName | ItemType | BaseType | Parent | Class | Row | Col | DateCreated |
---|---|---|---|---|---|---|---|
counts_orig | counts_orig | meta | matrix | 32883 | 48 | 2022-05-12 10:58:09 | |
counts | counts | assay | counts_orig | matrix | 32883 | 48 | 2022-05-12 10:58:09 |
design_orig | design_orig | meta | data.frame | 48 | 10 | 2022-05-12 10:58:09 | |
design | design | col | design_orig | data.frame | 48 | 10 | 2022-05-12 10:58:09 |
geneData_orig | geneData_orig | meta | data.frame | 32883 | 8 | 2022-05-12 10:58:09 | |
geneData | geneData | row | geneData_orig | data.frame | 32883 | 8 | 2022-05-12 10:58:09 |
granges_orig | granges_orig | meta | geneData_orig | GRanges | 32883 | NA | 2022-05-12 10:58:10 |
granges | granges | row | geneData | GRanges | 32883 | NA | 2022-05-12 10:58:10 |
AlignmentQC | alignQC | col | data.frame | 48 | 172 | 2022-05-12 10:58:10 | |
normTMM | DGEList | assay | counts | DGEList | 32883 | 48 | 2022-05-12 10:58:11 |
item is the actual data object to add.
itemName is the user-defined name for that object.
itemType is the predefined type.
parent is the name of the item that this object is derived from.
funArgs is a text field intended to documents the processing parameters of a given step in the analysis.
The parent argument is analyst-defined and is particularly important for a multi-threaded analysis. For example, if more than one fit is applied to the data, the parent argument maintains the thread and unambiguously identifies the child objects of each branch in the workflow. The value assigned to parent should be the item name of the parent data object.
By default addItem
will refuse to add an
itemname
that already exists. There is an overwrite
argument to the addItem
function although its use is
discouraged. Use the overwrite = TRUE
argument if it is
necessary to make a post-hoc change to an item that already exists in
the DGEobj.
If a whole analysis step, together with the addItem call, is contained within a function, it becomes easy to automate capture of function arguments.
Within a function call, match.call
returns the calling
function name and arguments. Modifying the normalization example
above:
<- function(x, method){
runNorm # Perform TMM normalization on the raw counts
<- x$counts %>%
dgelist DGEList() %>%
calcNormFactors(method = method)
# add the resulting edgeR DGEList item to the DGEobj
<- addItem(x,
x item = dgelist,
itemName = "normTMM",
itemType = "DGEList",
parent = "counts",
funArgs = match.call()
)
}
<- runNorm(dgeObj, method = "TMM")
newdgeObj
#use verbose=TRUE to see the funArgs column
<- inventory(newdgeObj, verbose = TRUE)[ ,c(1:3,9)]
inv kable(inv)
ItemName | ItemType | BaseType | FunArgs |
---|---|---|---|
counts_orig | counts_orig | meta | ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)) |
counts | counts | assay | ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)) |
design_orig | design_orig | meta | ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)) |
design | design | col | ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)) |
geneData_orig | geneData_orig | meta | ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)) |
geneData | geneData | row | ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)) |
granges_orig | granges_orig | meta | ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)) |
granges | granges | row | ::(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); DGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)); initDGEobj(counts, gene.data, design, gene, list(Genome = “Rat.B6.0”, GeneModel = “Ensembl.R89”)) |
AlignmentQC | alignQC | col | addItem(dgeObj, alignmentQC, AlignmentQC, alignQC) |
normTMM | DGEList | assay | runNorm(dgeObj, TMM) |
The addItem
function is typically used to capture each
item during the course of an analysis. Another use case is constructing
a DGEobj in a post-hoc fashion. In this scenario, it may be more
efficient to collect and add multiple items. The addItems
function supports this scenario.
In this example, add a normalized dgelist and a design matrix to the DGEobj:
# Perform TMM normalization on the raw counts
<- dgeObj$counts %>%
dgelist DGEList() %>%
calcNormFactors(method = "TMM")
# Formula must be composed of column names from the design table.
<- '~ 0 + ReplicateGroup'
formula
# User-defined name for the designMatrix
<- "ReplicateGroupDesign"
designMatrixName
# build the designMatrix
<- getItem(dgeObj, "design")
design <- model.matrix (as.formula(formula), design)
designMatrix #capture the formula as an attribute of the designMatrix
attr(designMatrix, "formula") <- formula
# Add both the dgelist and designMatrix to the DGEobj
<- addItems(dgeObj,
NewDgeObj itemList = list(dgelist=dgelist, ReplicateGroupDesign=designMatrix),
itemTypes = list("DGEList", "designMatrix"),
parents = list("counts", "design"))
itemList
is a named list of the data objects to add. The
names become the itemNames of the items in the DGEobj.
itemTypes
is a list of the types for each item in
itemList
. see showTypes(DGEobj)
.
parents
is a list of the parent item names associated
with each item in itemList
.
itemAttr
is an optional named list of attributes that
will be added to every item on the itemList
. Note that the
DGEobj has attributes and each item within the DGEobj can have its own
attributes. Here the attributes are being added to the individual
items.
A number of pre-defined item types have been implemented to support the limma/voom DGE workflow for RNA-Seq and Affymetrix data types.
To see a complete list of pre-defined item types use the showTypes function.
kable(arrange(showTypes(dgeObj), BaseType, Type))
Type | BaseType | |
---|---|---|
AffyRMA | AffyRMA | assay |
assay | assay | assay |
counts | counts | assay |
DGEList | DGEList | assay |
effectiveLength | effectiveLength | assay |
Elist | Elist | assay |
intensities | intensities | assay |
isoformFrac | isoformFrac | assay |
alignQC | alignQC | col |
col | col | col |
design | design | col |
designMatrix | designMatrix | col |
AffyRMA_orig | AffyRMA_orig | meta |
contrast_fit_treat | contrast_fit_treat | meta |
contrastMatrix | contrastMatrix | meta |
corFit | corFit | meta |
counts_orig | counts_orig | meta |
design_orig | design_orig | meta |
effectiveLength_orig | effectiveLength_orig | meta |
exonData_orig | exonData_orig | meta |
geneData_orig | geneData_orig | meta |
geneList | geneList | meta |
granges_orig | granges_orig | meta |
imputationMatrix | imputationMatrix | meta |
intensities_orig | intensities_orig | meta |
isoformData_orig | isoformData_orig | meta |
meta | meta | meta |
pathway | pathway | meta |
proteinData_orig | proteinData_orig | meta |
svobj | svobj | meta |
topTreat | topTreat | meta |
URL | URL | meta |
affyData | affyData | row |
contrast_fit | contrast_fit | row |
exonData | exonData | row |
fit | fit | row |
geneData | geneData | row |
granges | granges | row |
isoformData | isoformData | row |
proteinData | proteinData | row |
row | row | row |
topTable | topTable | row |
The capability to define new data types is core to the extensibility of the DGEobj data structure.
A set of data types has already been defined based on data types
encountered in a typical limma/Voom workflow. The newTypes
function is used to define data types to support a novel workflow.
# Add a new colData datatype called "sampleQC"
<- newType(dgeObj,
newDgeObj itemType = "sampleQC",
baseType = "col",
uniqueItem = FALSE)
After executing this call, the newDgeObj now contains the newly defined type definition and is ready to accept data items of that type.
Typically an analyst will need to access items stored in the DGEobj as the analysis progresses. Moreover, one of the main benefits of a well defined data object is that it facilitates data sharing. As such, there are several ways to extract one or more components from the DGEobj.
The getItem
function allows the user to retrieve any
item in a DGEobj by referencing its item name. Use the inventory
function (described above) to check the contents of the DGEobj including
the list of item names that are available.
<- getItem(dgeObj, "counts")
counts
#peek at the first 4 columns
head(counts[ ,1:4])
## T_20170823MAN1_C05P01 T_20170823MAN1_H05P01 T_20170823MAN1_C04P01 T_20170823MAN1_D03P01
## ENSRNOG00000046319 2 0 0.000 0.0000
## ENSRNOG00000047964 0 0 0.000 0.0000
## ENSRNOG00000050370 0 0 0.000 1.0000
## ENSRNOG00000032365 0 0 0.000 0.0000
## ENSRNOG00000040300 1 6 5.783 6.8135
## ENSRNOG00000058808 0 0 0.000 0.0000
There are several ways to retrieve multiple items as a list or data objects.
The user can supply a list of item names to retrieve specific items. The items requested are returned in a list.
<- getItems(dgeObj, list("counts", "geneData")) MyItems
If all the items the user wants to retrieve are of the same type, the
getType
function may be used to retrieve them as a named
list. For example, an analyst may want to retrieve a set of contrast
results stored as topTable output.
# The below example object is a post-workflow object that has multiple contrasts
<- readRDS(system.file("exampleObj.RDS", package = "DGEobj"))
FullWorkflowDGEobj
<- getType(FullWorkflowDGEobj, "topTable")
MyContrasts names(MyContrasts)
## [1] "BDL_vs_Sham" "EXT1024_vs_BDL" "Nint_vs_BDL" "Sora_vs_BDL"
Similarly, all items of the same baseType may be retrieved as a named list with the getBaseType function.
<- getBaseType(FullWorkflowDGEobj, "col")
MyColData names(MyColData)
## [1] "design" "ReplicateGroupDesign"
Finally, the DGEobj can be recast as a named list.
<- as.list(dgeObj)
MyList class(MyList)
## [1] "list"
names(MyList)
## [1] "counts_orig" "counts" "design_orig" "design" "geneData_orig" "geneData" "granges_orig" "granges" "AlignmentQC"
Returns the baseType for a given Type:
baseType(dgeObj, "DGEList")
## [1] "assay"
Certain metadata about a project can be stored in the attributes of
the DGEobj. Similarly, each item deposited in a DGEobj can have it’s own
set of attributes. Both of these types of attributes can be easily
accessed with the base R attr
function.
The showMeta
function will return a named list of all
attributes of a DGEobj.
To retrieve a specific attribute:
<- attr(dgeObj, "Keywords") keywords
To retrieve an attribute from an item, use dollar sign syntax to reference the item:
<- attr(FullWorkflowDGEobj$ReplicateGroupDesign, "formula") formula
For common workflow usage, the DGEobj should be considered read-only.
Items are added during the workflow and should not be modified further.
Thus, under the normal intended usage, no items should be removed.
Although, its use is discouraged, for completeness, a
rmItem
function exists.
To delete an item from a DGEobj:
<- rmItem(dgeObj, "granges") MyDgeObj
To add or update a single attribute, use the base Rattr
function:
attr(dgeObj, "Tissue") <- "Liver"
The setAttributes
function allows attaching a named list
of attributes. Unlike the base R attributes function, setAttributes will
add or update the named attributes without deleting any existing
attributes.
<- setAttributes(dgeObj, list(date=date(), analyst="JRT")) newDgeObj
This function returns any project annotation attached as attributes
to the DGEobj using function annotateDGEobj
. It returns the
attributes as a two column dataframe of Attribute/Value pairs. It only
returns attributes with character values.
kable(showMeta(dgeObj))
Attribute | Value |
---|---|
class | DGEobj |
DGEobjDef.version | 1.2 |
level | gene |
Genome | Rat.B6.0 |
GeneModel | Ensembl.R89 |
NA | NA |
source | Omicsoft |
ID | BDL_Rat_LiverSlice_03Dec2017 |
Title | Rat Liver Slices from Bile Duct Ligation animals |
Organism | Rat |
PlatformType | RNA-Seq |
Description | Rat livers slices from sham or BDL +/- efficacious treatments incubated in vitro |
Keywords | Liver slices; Bile Duct Ligation |
Disease | Liver Fibrosis |
Tissue | Liver |
GEO | GSE120804 |
Technology | Illumina-HiSeq |
LibraryPrep | TruSeq Stranded Total RNA |
AlignmentReference | Rat.B6.0 |
ReadType | PE |
Pipeline | RNA-Seq_BMS_v2.4.pscript |
AlignmentAlgorithm | OSA |
ScriptID | RNA-Seq_BMS_v2.4.pscript |
BusinessUnit | Discovery |
FunctionalArea | Fibrosis |
Vendor | BMS |
TBio_Owner | Ron Ammar |
TA_Owner | John Huang |
Fundamentally, a DGEobj is a named list of data items. R attributes are used to store information about the whole project as well as individual data items. Attributes are used to store project metadata supplied by the analyst. Attributes are also used to store data types and relationships in order to properly document the workflow.
User supplied meta data in the form of key/value pairs are attached
to the DGEobj as attributes where the attribute name is the key and the
attribute value is a character string. These metaData attributes can be
loaded from a text file using function annotateDGEobj
or
can be manipulated individually with the base R attr
function. The showMeta
function displays these user
supplied attributes.
Several attributes are used to capture information about each data
item (“type”, “basetype”, “dateCreated” and “funArgs.”), as well as
relationships between them (“parent”). The inventory
function provides a convenient table view of these attributes.
Operationally, the value of each of these attributes is a named list
where the names are the item names and the value is the value for that
item. For example, to query just the basetype attribute of the “design”
itemName:
attr(dgeObj, "basetype")[["design"]]
## [1] "col"
A class DGEobjDef has been created to define the data types acceptable to a DGEobj and also allow for the definition of new datatypes as needed. A default DGEobjDef includes data types for gene expression (Affymetrix and RNA-Seq) applications. The user may define additional levels and data types when initializing a DGEobjDef object (initDGEobjDef function).
To return the default DGEobjDef:
<- initDGEobjDef() defaultDGEobjDef
The initDGEobj function uses a default DGEobj definition tailored for the limma/voom workflow. If new data types are needed to support a different workflow, the user should define the new data types during a call to initDGEobjDef, and before calling the DGEobj constructor (initDGEobj). The initDGEobjDef function returns a DGEobjDef class object that can then be passed explicitly to initDGEobj.
There are four optional arguments to initDGEobjDef:
The example below defines new types to accommodate metabolomics data.
# simulate some data
<- nrow(dgeObj)
nr <- ncol(dgeObj)
nc
<- dgeObj$geneData
myMolecules <- dgeObj$design
design
<- rnorm(nr * nc) %>%
myMSquant exp() %>%
matrix(nrow = nr, ncol = nc)
rownames(myMSquant) <- rownames(dgeObj)
colnames(myMSquant) <- colnames(dgeObj)
<- initDGEobjDef(levels = "metabolite",
myDGEobjDef primaryAssayNames = "intensity",
types = c(normInt = "assay"))
<- initDGEobj(primaryAssayData = myMSquant,
MetabolomicObj rowData = myMolecules,
colData = design,
level = "metabolite",
DGEobjDef = myDGEobjDef)
kable(inventory(MetabolomicObj))
ItemName | ItemType | BaseType | Parent | Class | Row | Col | DateCreated |
---|---|---|---|---|---|---|---|
intensity_orig | intensity_orig | meta | matrix | 32883 | 48 | 2022-05-12 10:58:14 | |
intensity | intensity | assay | intensity_orig | matrix | 32883 | 48 | 2022-05-12 10:58:14 |
design_orig | design_orig | meta | data.frame | 48 | 10 | 2022-05-12 10:58:14 | |
design | design | col | design_orig | data.frame | 48 | 10 | 2022-05-12 10:58:14 |
metaboliteData_orig | metaboliteData_orig | meta | data.frame | 32883 | 8 | 2022-05-12 10:58:14 | |
metaboliteData | metaboliteData | row | metaboliteData_orig | data.frame | 32883 | 8 | 2022-05-12 10:58:14 |
There are simple dimension-based criteria to determine the basetype of a new datatype.
The DGEobj definitions created during initialization or added post-hoc by the newType function, are stored the “objDef” attribute of the DGEobj.
str(attr(dgeObj, "objDef"))
## List of 5
## $ basetype : Named chr [1:4] "row" "col" "assay" "meta"
## ..- attr(*, "names")= chr [1:4] "row" "col" "assay" "meta"
## $ type : Named chr [1:42] "row" "col" "assay" "meta" ...
## ..- attr(*, "names")= chr [1:42] "row" "col" "assay" "meta" ...
## $ uniqueType : chr [1:20] "counts" "counts_orig" "design" "design_orig" ...
## $ allowedLevels : chr [1:5] "gene" "exon" "isoform" "protein" ...
## $ primaryAssayNames: Named chr [1:5] "counts" "counts" "intensities" "intensities" ...
## ..- attr(*, "names")= chr [1:5] "gene" "exon" "isoform" "protein" ...
## - attr(*, "class")= chr "DGEobjDef"