The csmpv R package provides an extensive set of functions that encompass biomarker confirmation, variable selection, modeling, predictive analysis, and validation. It can handle diverse outcome variable types, including binary, continuous, and time-to-event data. Its key objectives include:

To simplify the modeling process, we’ve designed an all-in-one function capable of managing predictive model development, prediction, and validation for all eight methods within this package across three distinct outcome types. This versatile function streamlines the process, allowing for a concise implementation with just a single function call. It can handle a single method with single or multiple outcome variables. Moreover, if a validation dataset is available, the prediction and validation processes can seamlessly integrate into a unified operation.

In addition to these core functionalities, the csmpv package introduces a unique approach allowing the creation of binary risk classification models based on survival models. This innovative feature enables predicting risk binary outcomes for new datasets using the developed model. Please note, the external validation of this model is limited due to the absence of binary classification variables in new datasets. Despite this limitation, the predicted binary classification can serve as a surrogate biomarker, and its correlation with survival outcomes in new datasets can be tested when survival outcome information is available.

To enhance user experience, the csmpv R package focuses on streamlining coding efforts. Each user-end function acts as a comprehensive wrapper condensing multiple analyses into a single function call. Additionally, result files are conveniently saved locally, further simplifying the analytical process.

I Installation

The csmpv package is available on CRAN, and it can be directly installed in R using the following command:

install.packages("csmpv")

Alternatively, let’s proceed to install csmpv from GitHub using the devtools or remotes R package.

# Install devtools package if not already installed
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("devtools")

# Install csmpv package from GitHub
devtools::install_github("ajiangsfu/csmpv",force = TRUE)
# Using force = TRUE will ensure the installation, overriding any existing versions
# Install remotes package if not already installed
install.packages("remotes")
# Install csmpv package from GitHub
remotes::install_github("ajiangsfu/csmpv",force = TRUE)
# Using force = TRUE will ensure the installation, overriding any existing versions

Both methods will download and install the csmpv package from the GitHub repository. Please ensure an active internet connection and the necessary dependencies for a successful installation.

II Example Code

In this section, we will show some example code, however, before that, we will introduce example data first.

1. Example data

The example data was extracted from our in-house diffuse large B-cell lymphoma (DLBCL) dataset, specifically utilizing supplemental Table S1 from Alduaij et al. (2023, DOI: 10.1182/blood.2022018248).

To handle missing values, we retained only samples with all the necessary variables and no missing values. Furthermore, to ensure compatibility with all eight modeling methods within csmpv, we transformed all categorical variables into binary format, overcoming limitations in XGBoost (eXtreme Gradient Boosting) and LASSO (Least Absolute Shrinkage and Selection Operator) when dealing with categorical variables with more than two levels.

Following these procedures, an object named datlist was generated and is included in csmpv, accessible straightforwardly after installing and loading csmpv, as demonstrated below.

library(csmpv)
data("datlist", package = "csmpv")
tdat = datlist$training
dim(tdat)
## [1] 216  22
vdat = datlist$validation
dim(vdat)
## [1] 217  22

Subsequently, we defined three outcome variables and their respective independent variables.

To illustrate different types of outcome variables, we’ll define examples for binary, continuous, and time-to-event categories: - Binary: DZsig (dark zone signature) - Continuous: Age - Time-to-event: FFP (freedom from progression)

For binary and time-to-event variables, independent variables are defined as:

Xvars = c("B.Symptoms","MYC.IHC","BCL2.IHC", "CD10.IHC","BCL6.IHC",
 "MUM1.IHC","Male","AgeOver60", "stage3_4","PS1","LDH.Ratio1",
 "Extranodal1","Bulk10cm","HANS_GCB", "DTI")

For the continuous variable, the corresponding independent variables align with those above, excluding AgeOver60 due to its correlation with the outcome variable Age:

AgeXvars = setdiff(Xvars, "AgeOver60")

To enhance reproducibility and minimize variability from random number generation, we established and set a specific random seed:

set.seed(12345)

Users can define their own temporary directory to save all results. If not, tempdir() can be used to get the system’s temporary directory.

temp_dir = tempdir()
knitr::opts_knit$set(root.dir = temp_dir)

2. Biomarker confirmation/validation

Whether this procedure is labeled as biomarker confirmation, validation, or testing, the fundamental aspect involves regular regression analyses on both single and multiple variables across three distinct outcome categories: binary, continuous, and time-to-event. In this context, our objective is to assess the presence of an association between outcomes and a set of independent variables. It’s important to note that this differs from model validation, which will be covered subsequently.

2.1 Binary outcome

To confirm biomarkers for binary outcomes:

bconfirm = confirmVars(data = tdat, biomks = Xvars, Y = "DZsig",
                       outfile = "confirmBinary")

The confirmVars function acts as a wrapper, invoking various functions to perform regression analysis based on different outcome types. By default, the outcome type is binary, requiring no explicit specification when handling binary outcomes.

Upon execution, the bconfirm object comprises a multivariable model and a list of two forest plots. The first plot consolidates individual forest plots for each single variable, while the second represents the forest plot for the multivariable model. These outputs are locally saved, along with a combined table containing models for each single variable.

print(bconfirm$fit)
## 
## Call:  glm(formula = f1, family = "binomial", data = datain)
## 
## Coefficients:
## (Intercept)   B.Symptoms      MYC.IHC     BCL2.IHC     CD10.IHC     BCL6.IHC  
##   -25.32435     -1.64452      3.51205      1.05367      4.04662      1.46629  
##    MUM1.IHC         Male    AgeOver60     stage3_4          PS1   LDH.Ratio1  
##    -2.67217      1.23337      0.22958      1.02946      2.44845      0.86299  
## Extranodal1     Bulk10cm     HANS_GCB          DTI  
##     1.04828     -1.04460     15.76444     -0.02641  
## 
## Degrees of Freedom: 215 Total (i.e. Null);  200 Residual
## Null Deviance:       154.8 
## Residual Deviance: 66.87     AIC: 98.87
bconfirm$allplot[[2]]

For instance, the above output showcases a multivariable model and it’s corresponding forest plot.

2.2 Continous outcome

To confirm biomarkers for continuous outcomes:

cconfirm = confirmVars(data = tdat, biomks = AgeXvars, Y = "Age",
                       outcomeType = "continuous",
                       outfile = "confirmContinuous")

The same confirmVars function is called; however, this time, we specify the outcome type as continuous.

In a similar fashion, the cconfirm object comprises two elements: a multivariable model and a list of two forest plots. The first plot consolidates all forest plots for each single variable, while the second represents the forest plot for the multivariable model. All these outputs are saved locally, accompanied by a combined table containing models for each single variable.

Below, you’ll find the multivariable model and it’s corresponding forest plot:

print(cconfirm$fit)
## 
## Call:  glm(formula = f1, data = datain)
## 
## Coefficients:
## (Intercept)   B.Symptoms      MYC.IHC     BCL2.IHC     CD10.IHC     BCL6.IHC  
##    62.21516     -4.12244      1.76448      2.10272     -0.73413      1.65380  
##    MUM1.IHC         Male     stage3_4          PS1   LDH.Ratio1  Extranodal1  
##     1.47318     -1.47252     -1.24508      7.81629      2.41929     -4.91746  
##    Bulk10cm     HANS_GCB          DTI  
##    -1.91890      0.09254     -0.03078  
## 
## Degrees of Freedom: 215 Total (i.e. Null);  201 Residual
## Null Deviance:       41610 
## Residual Deviance: 37060     AIC: 1756
cconfirm$allplot[[2]]