1 Introduction

This vignette provides a step-by-step guide to tuning the parameters of the PLSKO' function. PLSKO is a knockoff generator that generates knockoff features for a given dataset. ThePLSKO’ function requires several parameters to be specified, and tuning these parameters can impact the performance of the knockoff generator and make ensure that the generated knockoff features are valid. This vignette provides a detailed explanation of the parameters of the PLSKO function and demonstrates how to tune these parameters using semi-simulation.

2 Parameters of the PLSKO Function

The PLSKO function requires several parameters to be specified. The key parameters of the PLSKO function are as follows:

  • Number of components (ncomp): An integer specifying the number of components to use in the PLS regression. Generally, using more components leads 1) more variance between variables captured, 2) makes the knockoff variables more similar to the original variables, therefore 3) lower FDR and lower power, which means 4) fewer discoveries. Also, more components increase the computational cost.

  • Threshold of neighbourhood definition (threshold.abs or threshold.q): The threshold of neighbourhood definition is used to define the neighbour variables. In default setting of PLSKO and this The neighbourhood information is used to capture the correlation structure between variables. The threshold can be defined as an absolute value (threshold.abs) or as a quantile (threshold.q) of the absolute values of the correlation coefficients between variables. A higher threshold leads to a sparser neighbourhood structure, which can improve the performance of the knockoff generator.

  • Sparsity level (sparsity): A numeric value between 0 (excluded) and 1 specifying the sparsity level in the PLS regression. 1 means no sparsity. Proper sparsity level can improve the performance of the knockoff generator in effectively controlling the false discovery rate (FDR) and increasing the power.

3 General idea

The idea is to use semi-simulation to evaluate the empirical FDR and power of the PLSKO function with different parameter values and select the optimal parameter values which can achieve the desired FDR.

According to Model-X knockoff definition, knockoff variables are generated using only, regardless of the response variable . Therefore, as long as we have a valid knockoff variable set, we can use it for any response variable , and make sure the FDR is controlled at the desired level.

Semi-simulation means that we generate the response variable (i.e. an artificial ) with known true coefficients with using the real data we are going to use (i.e. ) and use the PLSKO function to generate knockoff features. We then compare the selected variables with the true coefficients to evaluate the FDR and power of the PLSKO function. By repeating this process with different parameter values, we can identify the optimal parameter values that achieve the desired FDR.

4 Main function: plsko_tuning

The plsko_tuning function is used to tune the parameters of the PLSKO function, allowing users to test a series of parameter value combinations for their dataset. Apart from the parameter values to be tested, the plsko_tuning function also requires some important input, including: - X: Your real omics data.

  • n_ko: The number of knockoff variables to generate for each parameter configuration. Default is 10. Please note that the larger the number of knockoff variables, the more confident the tuning results, but the more computational cost.

  • q: The false discovery rate (FDR) level to control. Default is 0.05. Usually based on your own research design.

  • fdp.measure: The measure to assess the FDR of the certain parameter configuration over repeated n_ko simulations. Default is median, which means we use the median of the FDPs over n_ko simulations to represent the FDR of the certain parameter configuration. Other options include mean, either (either mean or median under than the target level), both (require both of mean and median of FDP smaller than FDR). This measure does not affect the tuning results, but only affects the selection of the optimal parameter configuration.

  • early.stop: A logical value indicating whether to stop the simulation early if the FDR of a certain parameter configuration is already smaller than the target level. Default is FALSE. If TRUE, the simulation will stop early if the FDR of a certain parameter configuration is already smaller than the target level, which can save computational time. If FALSE, the simulation will continue until all parameter configurations are tested.

  • p_s: The number of assumed true variables in the data. Default is 10% of the number of variables. This value is preferred to be larger in large dataset, e.g. 1/q, to ensure the estimate of FDR is stable and more accurate.

4.1 Tips

  • The plsko_tuning function tests across the combinations of different parameters. For example, when you test 5 different values for ncomp, 5 different values for threshold.abs, and 5 different values for sparsity, the total number of parameter configurations to test is 5 * 5 * 5 = 125. Adding up the number of knockoff variables (n_ko) to generate for each parameter configuration, the total number of simulations is 125 * n_ko, which means 125 * n_ko knockoff variables set will be generated and evaluated. Therefore, the computational cost can be high when testing a large number of parameter configurations. It is recommended to start with a small number of parameter configurations and increase the number of configurations gradually based on the computational resources available.

  • The plsko_tuning function is computationally intensive, especially when the number of knockoff variables (n_ko) is large or the number of parameter configurations to test is large. Therefore, it is recommended to run the function in parallel computing and with sufficient computational resources.

  • A single run of knockoff generation might take long time, especially when the number of variables in X or the components (ncomp) is large. For example, a single run of knockoff generation for X with 100 variables and 100 samples might take about 2 seconds; a single run of knockoff generation for X with 1200 variables, 150 samples and 5 components might take about 25 seconds. It is recommended to test the computational cost of a single run of knockoff generation before running the plsko_tuning function.

  • The plsko_tuning function gives the optimal of absolute threshold threshold.abs only.

5 Case 1: cfRNA data for pre-eclampsia study

Here we show two examples of tuning the parameters of the PLSKO function using semi-simulation on real data.

library(PLSKO)
data(cfRNA_placenta)
str(cfRNA_placenta)
#> List of 2
#>  $ counts  : num [1:71, 1:81] 6.97 5.07 6.92 5.85 6.35 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:71] "501143_1" "505372_1" "512025_1" "515494_1" ...
#>   .. ..$ : chr [1:81] "ENSG00000003436" "ENSG00000004939" "ENSG00000007944" "ENSG00000035862" ...
#>  $ metadata:'data.frame':    71 obs. of  9 variables:
#>   ..$ group       : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ lib.size    : num [1:71] 48955 108046 86493 120683 217499 ...
#>   ..$ norm.factors: num [1:71] 1.196 1.342 0.907 1.286 0.852 ...
#>   ..$ title       : chr [1:71] "501143_1" "505372_1" "512025_1" "515494_1" ...
#>   ..$ disease     : chr [1:71] "control" "control" "control" "pre-eclampsia" ...
#>   ..$ cohort      : chr [1:71] "Validation 1" "Discovery" "Validation 1" "Discovery" ...
#>   ..$ time        : chr [1:71] "≤12 weeks gestation" "≤12 weeks gestation" "≤12 weeks gestation" "≤12 weeks gestation" ...
#>   ..$ patientID   : chr [1:71] "501143" "505372" "512025" "515494" ...
#>   ..$ PE          : Factor w/ 2 levels "control","PE": 1 1 1 2 2 1 1 1 1 1 ...
# Basic input
X <- cfRNA_placenta$counts
n_ko <- 10
q <- 0.05
p_s <- 10

# Parameters to test
ncomp <- c(3, 5, 7)
threshold.abs <- c(0, 0.3, 0.5, 0.8)
sparsity <- 1
# We test on:
expand.grid(ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity)
# Run the tuning
start.time <- Sys.time()

cfRNA_tune <- plsko_tuning(X, n_ko = n_ko, p_s = p_s, q = q, ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity)
#> [1] "Optimal configuration: ncomp = 7, threshold.abs = 0.3, sparsity = 1, median FDP = 0"
end.time <- Sys.time()

# Print the results
plot(cfRNA_tune)

# Print the time
end.time - start.time
#> Time difference of 1.448801 mins

The plsko_tuning function run for 2 minutes to test 12 parameter configurations with 10 repeated simulations for each configuration on a 71 * 81 dataset working on 31 machines parallel. We observed that the neighbourhood threshold threshold.abs has a significant impact on the FDR and power of the PLSKO function, while the number of components ncomp have a moderate impact. The optimal parameter configuration is ncomp = 2, threshold.abs = 0.2, and sparsity = 1, although the mean of FDP is slightly higher than the target level.

Then we apply this optimal parameter configuration to the real data to generate the knockoff variables and select the important variables.

y <- cfRNA_placenta$metadata$PE
ncomp <- cfRNA_tune$optimal$ncomp
threshold.abs <- cfRNA_tune$optimal$threshold.abs
sparsity <- cfRNA_tune$optimal$sparsity

cfRNA_result <- plsAKO(X, y, n_ko = 25, ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity)
#> Warning in plsAKO(X, y, n_ko = 25, ncomp = ncomp, threshold.abs =
#> threshold.abs, : lasso.lcd is not applicable for categorical response variable,
#> change to lasso.logistic
print(cfRNA_result)
#> Call:
#> plsAKO(X = X, y = y, n_ko = 25, ncomp = ncomp, threshold.abs = threshold.abs,
#>     sparsity = sparsity)
#>
#> Selected variables (AKO):
#> ENSG00000076770 ENSG00000112419 ENSG00000112773 ENSG00000131149 ENSG00000136929
#>              11              24              25              33              42
#> ENSG00000172331 ENSG00000198517
#>              63              75
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#>
#>  7 11 12 17 24 25 33 42 62 63 64 75 78 80
#>  1 17  3  6 23 23 15 12  1 12  3 11  4  1
# Or if you like using the configuration with the lowest average FDP
user_optimal <- cfRNA_tune$mean[which.min(cfRNA_tune$mean$mean.fdp), ]
cfRNA_result_alt <- plsAKO(X, y, n_ko = 25, ncomp = user_optimal$ncomp, threshold.abs = user_optimal$threshold.abs, sparsity = user_optimal$sparsity)
#> Warning in plsAKO(X, y, n_ko = 25, ncomp = user_optimal$ncomp, threshold.abs =
#> user_optimal$threshold.abs, : lasso.lcd is not applicable for categorical
#> response variable, change to lasso.logistic
print(cfRNA_result_alt)
#> Call:
#> plsAKO(X = X, y = y, n_ko = 25, ncomp = user_optimal$ncomp, threshold.abs = user_optimal$threshold.abs,
#>     sparsity = user_optimal$sparsity)
#>
#> Selected variables (AKO):
#> ENSG00000112419 ENSG00000112773 ENSG00000131149 ENSG00000136929
#>              24              25              33              42
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#>
#> 11 17 24 25 33 42 63 64 75 78
#>  5  2 19 14 13  9  2  3  4  1

6 Case 2: Proteomics data for late pre-eclampsia study

Now we show another example of tuning the parameters of the PLSKO function using semi-simulation on real data. This is a proteomics study of late on-set pre-eclampsia in placenta tissue, generated from SomaLogic platform. The data is preprocessed and ready to use.

# download preprocessed data from github
url <- "https://raw.github.com/guannan-yang/PLSKO/main/paper%20codes/plos1Prot/SL_prot_time1_data.rds"
path <- "C:/Users/guannany/OneDrive - The University of Melbourne/PhD/random/"
download.file(url, destfile = paste0(path, "SL_prot_time1_data.rds"), mode = "wb")

SL_prot_time1_data <- readRDS(paste0(path, "SL_prot_time1_data.rds"))

str(SL_prot_time1_data, list.len = 5)
#> List of 3
#>  $ metadata:'data.frame':    166 obs. of  3 variables:
#>   ..$ ID    : num [1:166] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ GA    : num [1:166] 12.3 9.7 8 12.9 15.1 12.6 22.9 13.3 16.6 22.9 ...
#>   ..$ LatePE: num [1:166] 2 1 1 1 2 2 2 1 2 2 ...
#>  $ log_expr:'data.frame':    166 obs. of  1125 variables:
#>   ..$ C4b                                     : num [1:166] 10.68 11.18 11.63 9.61 11.37 ...
#>   ..$ Coagulation Factor XI                   : num [1:166] 11.3 10.7 10.8 10.7 10.6 ...
#>   ..$ CTACK                                   : num [1:166] 10.8 10.7 10.9 10.5 10.7 ...
#>   ..$ Endostatin                              : num [1:166] 15.9 15.8 16 15.6 16 ...
#>   ..$ TIMP-1                                  : num [1:166] 8.44 8.63 8.22 7.82 8.34 ...
#>   .. [list output truncated]
#>  $ expr    :'data.frame':    166 obs. of  1125 variables:
#>   ..$ C4b                                     : num [1:166] 1640 2327 3173 779 2654 ...
#>   ..$ Coagulation Factor XI                   : num [1:166] 2514 1718 1751 1674 1606 ...
#>   ..$ CTACK                                   : num [1:166] 1759 1628 1849 1483 1614 ...
#>   ..$ Endostatin                              : num [1:166] 59633 56729 67505 51188 66199 ...
#>   ..$ TIMP-1                                  : num [1:166] 347 395 298 226 324 ...
#>   .. [list output truncated]

The data contains 166 samples and 1125 proteins. We want to identify the proteins that are related to pre-eclampsia. We keep 500 proteins with the largest variance for the analysis.

Let’s tune on this data.

# Keep the most variable 500 proteins
SL_prot_time1_logexpr_top500 <- SL_prot_time1_data$log_expr[, order(apply(SL_prot_time1_data$log_expr, 2, var), decreasing = TRUE)[1:500]]

# Basic input
X <- SL_prot_time1_logexpr_top500
n_ko <- 25
q <- 0.05
p_s <- 25

# Parameters to test
ncomp <- c(4, 6, 8, 10)
threshold.abs <- c(0, 0.1, 0.3, 0.5)
sparsity <- c(0.5, 0.8, 1)

# Run the tuning
start.time <- Sys.time()
prot_tune <- plsko_tuning(as.matrix(X), n_ko = n_ko, p_s = p_s, q = q, ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity)
#> [1] "Optimal configuration: ncomp = 10, threshold.abs = 0.1, sparsity = 0.5, median FDP = 0.0384615384615385"
end.time <- Sys.time()

difftime(end.time, start.time, units = "secs")
#> Time difference of 1203.497 secs
plot(prot_tune)

It took 28 minutes to test 48 parameter configurations with 25 repeated simulations for each configuration on a 166 * 1125 dataset working on 31 machines parallel. We observed that the the neighbourhood threshold threshold.abs and number of components ncomp have a significant impact on the FDR and power of the PLSKO function, while the sparsity level sparsity have a slight impact.

The optimal parameter configuration is ncomp = 10, threshold.abs = 0.1, and sparsity = 0.5, with the mean of FDP is slightly higher than the target level, and median of FDP at 0.038.

Then we apply this optimal parameter configuration to the real data to generate the knockoff variables and select the important variables.

y <- as.factor(SL_prot_time1_data$metadata$LatePE)
ncomp <- prot_tune$optimal$ncomp
threshold.abs <- prot_tune$optimal$threshold.abs
sparsity <- prot_tune$optimal$sparsity

prot_result <- plsAKO(as.matrix(X), y, n_ko = 25, q = 0.05,
                      ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity,
                      w.method = "lasso.logistic")
print(prot_result)
#> Call:
#> plsAKO(X = as.matrix(X), y = y, n_ko = 25, q = 0.05, w.method = "lasso.logistic",
#>     ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity)
#>
#> Selected variables (AKO):
#> Thrombin     PTK6    MMP-7
#>       12      102      226
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#>
#>  12  14  42 102 202 226 407 416
#>  13   2   2  10   1  11   1   5

PLS-AKO selects 3 important proteins that are related to late on-set pre-eclampsia in placenta tissue, including Thrombin, PTK6 and MMP-7, with selection frequency of 12, 10 and 11 of 25 repeated PLSKO runs, respectively.

7 Conclusion

Each dataset is unique, and the performance of the PLSKO function can be influenced by the dataset characteristics and the parameter values used. By tuning the parameter of PLSKO function using semi-simulation, we can optimise parameter values that achieve the desired FDR with more confidence.

For more information on the PLSKO function and the plsko_tuning function, please refer to the package documentation and the function help files.

8 Please cite

Yang, G., Menkhorst, E., Dimitriadis, E., & Le Cao, K. A. (2024). PLSKO: a robust knockoff generator to control false discovery rate in omics variable selection. bioRxiv, 2024-08.

9 Session Info

sessionInfo()
#> R version 4.3.0 (2023-04-21 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=C                       LC_CTYPE=English_Australia.utf8
#> [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
#> [5] LC_TIME=English_Australia.utf8
#>
#> time zone: Australia/Sydney
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#> [1] PLSKO_0.2.0
#>
#> loaded via a namespace (and not attached):
#>   [1] gridExtra_2.3       remotes_2.5.0       rlang_1.1.5
#>   [4] magrittr_2.0.3      matrixStats_1.5.0   compiler_4.3.0
#>   [7] roxygen2_7.3.2      vctrs_0.6.5         reshape2_1.4.4
#>  [10] stringr_1.5.1       profvis_0.4.0       pkgconfig_2.0.3
#>  [13] shape_1.4.6.1       crayon_1.5.3        fastmap_1.2.0
#>  [16] backports_1.5.0     ellipsis_0.3.2      promises_1.3.2
#>  [19] rmarkdown_2.29      sessioninfo_1.2.2   purrr_1.0.2
#>  [22] xfun_0.50           glmnet_4.1-8        cachem_1.1.0
#>  [25] jsonlite_1.8.9      knockoff_0.3.6      progress_1.2.3
#>  [28] later_1.4.1         BiocParallel_1.36.0 broom_1.0.7
#>  [31] parallel_4.3.0      prettyunits_1.2.0   R6_2.5.1
#>  [34] bslib_0.9.0         stringi_1.8.4       RColorBrewer_1.1-3
#>  [37] car_3.1-3           pkgload_1.4.0       jquerylib_0.1.4
#>  [40] Rcpp_1.0.14         iterators_1.0.14    knitr_1.49
#>  [43] usethis_3.1.0       httpuv_1.6.15       Matrix_1.6-1.1
#>  [46] splines_4.3.0       igraph_2.1.4        tidyselect_1.2.1
#>  [49] rstudioapi_0.17.1   abind_1.4-8         yaml_2.3.10
#>  [52] doParallel_1.0.17   codetools_0.2-20    miniUI_0.1.1.1
#>  [55] pkgbuild_1.4.6      lattice_0.22-6      tibble_3.2.1
#>  [58] plyr_1.8.9          shiny_1.10.0        withr_3.0.2
#>  [61] rARPACK_0.11-0      evaluate_1.0.3      desc_1.4.3
#>  [64] survival_3.8-3      urlchecker_1.0.1    xml2_1.3.6
#>  [67] pillar_1.10.1       ggpubr_0.6.0        carData_3.0-5
#>  [70] foreach_1.5.2       ellipse_0.5.0       generics_0.1.3
#>  [73] rprojroot_2.0.4     hms_1.1.3           ggplot2_3.5.1
#>  [76] munsell_0.5.1       scales_1.3.0        xtable_1.8-4
#>  [79] glue_1.8.0          tools_4.3.0         RSpectra_0.16-2
#>  [82] ggsignif_0.6.4      fs_1.6.5            cowplot_1.1.3
#>  [85] grid_4.3.0          tidyr_1.3.1         devtools_2.4.5
#>  [88] colorspace_2.1-1    Formula_1.2-5       cli_3.6.3
#>  [91] mixOmics_6.26.0     dplyr_1.1.4         corpcor_1.6.10
#>  [94] gtable_0.3.6        pls_2.8-5           rstatix_0.7.2
#>  [97] sass_0.4.9          digest_0.6.37       ggrepel_0.9.6
#> [100] farver_2.1.2        htmlwidgets_1.6.4   memoise_2.0.1
#> [103] htmltools_0.5.8.1   lifecycle_1.0.4     mime_0.12
#> [106] MASS_7.3-58.4