The csmpv R package provides an extensive set of functions that encompass biomarker confirmation, variable selection, modeling, predictive analysis, and validation. It can handle diverse outcome variable types, including binary, continuous, and time-to-event data. Its key objectives include:
Biomarker Confirmation/Validation: This feature employs both single-variable and multivariable regression techniques to confirm and validate established biomarkers.
Biomarker Discovery: The package streamlines the identification of new biomarkers through variable selection methods like LASSO2, LASSO_plus, and LASSO2plus.
Predictive Model Development: By harnessing a fusion of machine learning and traditional statistical tools, this process facilitates the creation of predictive models focused on specific biomarkers.
Model Prediction: Developed models can predict outcomes when applied to new datasets.
Model Validation: These models validate outcomes when applied to novel datasets, provided an outcome variable is present.
To simplify the modeling process, we’ve designed an all-in-one function capable of managing predictive model development, prediction, and validation for all eight methods within this package across three distinct outcome types. This versatile function streamlines the process, allowing for a concise implementation with just a single function call. It can handle a single method with single or multiple outcome variables. Moreover, if a validation dataset is available, the prediction and validation processes can seamlessly integrate into a unified operation.
In addition to these core functionalities, the csmpv package introduces a unique approach allowing the creation of binary risk classification models based on survival models. This innovative feature enables predicting risk binary outcomes for new datasets using the developed model. Please note, the external validation of this model is limited due to the absence of binary classification variables in new datasets. Despite this limitation, the predicted binary classification can serve as a surrogate biomarker, and its correlation with survival outcomes in new datasets can be tested when survival outcome information is available.
To enhance user experience, the csmpv R package focuses on streamlining coding efforts. Each user-end function acts as a comprehensive wrapper condensing multiple analyses into a single function call. Additionally, result files are conveniently saved locally, further simplifying the analytical process.
The csmpv package is available on CRAN, and it can be directly installed in R using the following command:
install.packages("csmpv")
Alternatively, let’s proceed to install csmpv from GitHub using the devtools or remotes R package.
# Install devtools package if not already installed
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("devtools")
# Install csmpv package from GitHub
devtools::install_github("ajiangsfu/csmpv",force = TRUE)
# Using force = TRUE will ensure the installation, overriding any existing versions
# Install remotes package if not already installed
install.packages("remotes")
# Install csmpv package from GitHub
remotes::install_github("ajiangsfu/csmpv",force = TRUE)
# Using force = TRUE will ensure the installation, overriding any existing versions
Both methods will download and install the csmpv package from the GitHub repository. Please ensure an active internet connection and the necessary dependencies for a successful installation.
In this section, we will show some example code, however, before that, we will introduce example data first.
The example data was extracted from our in-house diffuse large B-cell lymphoma (DLBCL) dataset, specifically utilizing supplemental Table S1 from Alduaij et al. (2023, DOI: 10.1182/blood.2022018248).
To handle missing values, we retained only samples with all the necessary variables and no missing values. Furthermore, to ensure compatibility with all eight modeling methods within csmpv, we transformed all categorical variables into binary format, overcoming limitations in XGBoost (eXtreme Gradient Boosting) and LASSO (Least Absolute Shrinkage and Selection Operator) when dealing with categorical variables with more than two levels.
Following these procedures, an object named datlist was generated and is included in csmpv, accessible straightforwardly after installing and loading csmpv, as demonstrated below.
library(csmpv)
data("datlist", package = "csmpv")
tdat = datlist$training
dim(tdat)
## [1] 216 22
vdat = datlist$validation
dim(vdat)
## [1] 217 22
Subsequently, we defined three outcome variables and their respective independent variables.
To illustrate different types of outcome variables, we’ll define examples for binary, continuous, and time-to-event categories: - Binary: DZsig (dark zone signature) - Continuous: Age - Time-to-event: FFP (freedom from progression)
For binary and time-to-event variables, independent variables are defined as:
Xvars = c("B.Symptoms","MYC.IHC","BCL2.IHC", "CD10.IHC","BCL6.IHC",
"MUM1.IHC","Male","AgeOver60", "stage3_4","PS1","LDH.Ratio1",
"Extranodal1","Bulk10cm","HANS_GCB", "DTI")
For the continuous variable, the corresponding independent variables align with those above, excluding AgeOver60 due to its correlation with the outcome variable Age:
AgeXvars = setdiff(Xvars, "AgeOver60")
To enhance reproducibility and minimize variability from random number generation, we established and set a specific random seed:
set.seed(12345)
Users can define their own temporary directory to save all results. If not, tempdir() can be used to get the system’s temporary directory.
temp_dir = tempdir()
knitr::opts_knit$set(root.dir = temp_dir)
Whether this procedure is labeled as biomarker confirmation, validation, or testing, the fundamental aspect involves regular regression analyses on both single and multiple variables across three distinct outcome categories: binary, continuous, and time-to-event. In this context, our objective is to assess the presence of an association between outcomes and a set of independent variables. It’s important to note that this differs from model validation, which will be covered subsequently.
To confirm biomarkers for binary outcomes:
bconfirm = confirmVars(data = tdat, biomks = Xvars, Y = "DZsig",
outfile = "confirmBinary")
The confirmVars function acts as a wrapper, invoking various functions to perform regression analysis based on different outcome types. By default, the outcome type is binary, requiring no explicit specification when handling binary outcomes.
Upon execution, the bconfirm object comprises a multivariable model and a list of two forest plots. The first plot consolidates individual forest plots for each single variable, while the second represents the forest plot for the multivariable model. These outputs are locally saved, along with a combined table containing models for each single variable.
print(bconfirm$fit)
##
## Call: glm(formula = f1, family = "binomial", data = datain)
##
## Coefficients:
## (Intercept) B.Symptoms MYC.IHC BCL2.IHC CD10.IHC BCL6.IHC
## -25.32435 -1.64452 3.51205 1.05367 4.04662 1.46629
## MUM1.IHC Male AgeOver60 stage3_4 PS1 LDH.Ratio1
## -2.67217 1.23337 0.22958 1.02946 2.44845 0.86299
## Extranodal1 Bulk10cm HANS_GCB DTI
## 1.04828 -1.04460 15.76444 -0.02641
##
## Degrees of Freedom: 215 Total (i.e. Null); 200 Residual
## Null Deviance: 154.8
## Residual Deviance: 66.87 AIC: 98.87
bconfirm$allplot[[2]]
For instance, the above output showcases a multivariable model and it’s corresponding forest plot.
To confirm biomarkers for continuous outcomes:
cconfirm = confirmVars(data = tdat, biomks = AgeXvars, Y = "Age",
outcomeType = "continuous",
outfile = "confirmContinuous")
The same confirmVars function is called; however, this time, we specify the outcome type as continuous.
In a similar fashion, the cconfirm object comprises two elements: a multivariable model and a list of two forest plots. The first plot consolidates all forest plots for each single variable, while the second represents the forest plot for the multivariable model. All these outputs are saved locally, accompanied by a combined table containing models for each single variable.
Below, you’ll find the multivariable model and it’s corresponding forest plot:
print(cconfirm$fit)
##
## Call: glm(formula = f1, data = datain)
##
## Coefficients:
## (Intercept) B.Symptoms MYC.IHC BCL2.IHC CD10.IHC BCL6.IHC
## 62.21516 -4.12244 1.76448 2.10272 -0.73413 1.65380
## MUM1.IHC Male stage3_4 PS1 LDH.Ratio1 Extranodal1
## 1.47318 -1.47252 -1.24508 7.81629 2.41929 -4.91746
## Bulk10cm HANS_GCB DTI
## -1.91890 0.09254 -0.03078
##
## Degrees of Freedom: 215 Total (i.e. Null); 201 Residual
## Null Deviance: 41610
## Residual Deviance: 37060 AIC: 1756
cconfirm$allplot[[2]]