% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/mainFunctions.R
\name{sourceSet}
\alias{sourceSet}
\title{Source Set}
\usage{
sourceSet(
  graphs,
  data,
  classes,
  seed = NULL,
  theta = 1,
  permute = TRUE,
  alpha = 0.05,
  shrink = FALSE,
  return.permutations = FALSE
)
}
\arguments{
\item{graphs}{a list of \code{graphNEL} objects  representing the pathways to be analyzed.}

\item{data}{a matrix of expression levels with column names for genes and row names for samples; gene names must be unique.}

\item{classes}{a vector of length equal to the number of rows of \code{data}. It  indicates the class (condition) of each statistical unit. Only two classes, labeled as 1 and 2, are allowed;}

\item{seed}{integer value to get a reproducible random result. See \code{\link[base]{Random}}.}

\item{theta}{positive numeric value greater then 1, that defines the number of permutation. If \code{permute=TRUE}, (m/\code{alpha} x \code{theta}) permutations are used, where m is the number of unique conditional tests to be performed; otherwise, (1/\code{alpha} x \code{theta}) permutations are supplied.}

\item{permute}{if \code{TRUE} permutation p-values are provided; if \code{FALSE}, asymptotic p-values are returned. NOTE: even if the argument permute is set to \code{FALSE} the function will permute the dataset; these permutations will be used to calculate the adjusted cut-off for the asymptotic p-values.}

\item{alpha}{the p-value threshold. Denotes the level at which FWER is controlled for each input graph.}

\item{shrink}{if \code{TRUE}, regularized estimation of the covariance matrices is performed; otherwise, maximum likelihood estimations is used.}

\item{return.permutations}{if \code{TRUE}, the function returns the matrix of test statistic values for the supplied (first row) and the permutated datasets.}
}
\value{
The output of the function is an object of the \code{sourceSetList} class. It contains as many lists as the input graphs, and each of them provides the following variables:
\itemize{
 \item{\code{primarySet}: a character vector containing the names of the variables belonging to the estimated source set (primary dysregulation);}
 \item{\code{secondarySet}: a character vector containing the names of the variables belonging to the estimated secondary set (secondary dysregulation);}
 \item{\code{orderingSet}: a list of character vectors containing the names of the variables belonging to the estimated source set of each ordering; the union of these elements contains all genes affected by some form of perturbation; }
 \item{\code{Components}: a data frame that contains information about unique  tests, including their associated p-values;}
 \item{\code{Decompositions}: a list of data frames, one for each identified ordering. Each data frame is a subset of size k (i.e., number of cliques), of the \code{Components} elements}
 \item{\code{Elements}: cliques and separators of  the underlying decomposable graph. See \code{Graph}}
 \item{\code{Thresholds}: a list with information regarding the multiple testing correction:
    \itemize{
      \item{\code{alpha}: the input (nominal) significance level;}
      \item{\code{value}: the corrected threshold that ensures the control of FWER at level \code{alpha};}
      \item{\code{type}: the used procedure (minP or maxT);}
      \item{\code{iterations}: the number of iterations for the step-down procedure;}
      \item{\code{nperms}: the number of permutations.}
    }
 }
 \item{\code{Graph}: decomposable graph used in the analysis. It may differ from the input graph. In fact, if  the input graph is not  decomposable, the function will internally moralize and triangulate it.}
 }
}
\description{
Identify the sets of variables that are potential sources of differential behavior,
(i.e., the primary genes) between two experimental conditions. The two experimental conditions are
associated to a set of graphs,
where each graph represents the topology of a biological pathway.
}
\details{
The \code{sourceSet} approach  models the data of the same pathway in two different
experimental conditions as realizations of two Gaussian graphical models sharing the same decomposable
graph G. Here, G = (V,E) is obtained from the pathway topology conversion, where V and E
represent genes and biochemical reactions, respectively.

We give full freedom to the user in providing the underlying graph G, requiring only  a
specific input format (i.e., a \code{graphNEL} object). So, the user can provide a list of
manually curated pathways or use developed software to translate the bases of knowledge.
To date, the most complete software available for this task is \code{graphite} R package (Sales et al. 2017).

The source set algorithm infers the set of primary genes (i.e., the source set) following - for each graph - five steps:
\itemize{
 \item{ decompose graph G in the set of the maximal cliques and the set of separators.}
 \item{ identify the cliques orderings, and the associated separators, that satisfy the running intersection property, using each cliques as root. See \code{\link[SourceSet]{ripAllRootsClique}}.}
 \item{ a) calculate marginal test statistics for the cliques and the separators, for both the original and the permutated datasets;
 b) compute the conditional test statistics for the unique components, calculated as the difference between clique and separator marginal test statistics;
 c) control the FWER, using the test statistics matrix of the previous point.}
 \item{ make the union of the sets of variables belonging to cliques that are associated to a significant test, within each decomposition. }
 \item{ derive the source set, defined as the intersection of the set of variables obtained in step 4 across decompositions.}
}

Although the interpretation of the source set for a single graph is intuitive, the interpretation of the
collection of results associated to a set of pathways might be complex. For this reason,
we propose a guideline for the meta-analysis providing descriptive statistics and predefined plots. See,
\code{\link[SourceSet]{infoSource}}, \code{\link[SourceSet]{easyLookSource}}, \code{\link[SourceSet]{sourceSankeyDiagram}},  \code{\link[SourceSet]{sourceCytoscape}} and  \code{\link[SourceSet]{sourceUnionCytoscape}}.
}
\note{
If \code{permute} and/or \code{shrink} parameters violate the conditions required for the
existence of the full-rank maximum likelihood estimates, the algorithm reserves the possibility to change the user
settings through internal controls.

Indeed, if the user wants to use the MLE of the covariance matrix (\code{shrink=FALSE}), all cliques -
in all pathways - must satisfy the \eqn{n > p_i} condition, where \eqn{n} is the number of samples for the
smaller class and \eqn{p_i} is the cardinality of the largest clique in the i-th pathway.
If even one clique does not satisfy this requirement, the regularized estimate must be used.
When a regularized estimate is employed (\code{shrink=TRUE}), the analytical null distribution
of the test statistics is no longer available, and we rely on permutation methods to obtain the
associated p-values.

To address the multiple testing problem we use two versions of the method proposed by Westfall and Young (2017),
which uses permutations to obtain the joint distribution of the p-values.
More specifically, when the maximum likelihood estimates of the covariance matrices are used (\code{shrink=FALSE}), the asymptotic p-values and the maxT approach is adopted.
While, if the regularized estimates are calculated (\code{shrink=TRUE}), asymptotic distribution is no longer valid and the min P version and the
per-hypothesis permutation p-values to obtain the joint distribution of the p-values are needed.
The number of permutations depends on the method, the alpa level chosen, and the number of hypotheses. A minimum number of 500 and a maximum number of 10.000 permutations are allowed.
}
\references{
Sales, G. et al. (2017). graphite: GRAPH Interaction from pathway Topological Environment, r package version 1.22.0 edition.

Westfall, P. and Young, S. (2017). Resampling-based multiple testing : examples and methods for p-value adjustment. Wiley.

Djordjilovic, Vera and Chiogna, Monica (2022) Searching for a source of difference in graphical models. Journal of Multivariate Analysis 190, 104973

Salviato, E. et al. (2019). \code{SourceSet}: a graphical model approach to identify primary genes in perturbed biological pathways. PLoS computational biology 15 (10), e1007357.
}
\seealso{
\code{\link[graphite]{pathways}}, \code{\link[SourceSet]{infoSource}}, \code{\link[SourceSet]{easyLookSource}}, \code{\link[SourceSet]{sourceSankeyDiagram}},  \code{\link[SourceSet]{sourceCytoscape}} and  \code{\link[SourceSet]{sourceUnionCytoscape}}
}
