Introduction
This vignette provides a step-by-step guide to tuning the parameters
of the
PLSKO' function. PLSKO is a knockoff generator that generates knockoff features for a given dataset. ThePLSKO’
function requires several parameters to be specified, and tuning these
parameters can impact the performance of the knockoff generator and make
ensure that the generated knockoff features are valid. This vignette
provides a detailed explanation of the parameters of the PLSKO function
and demonstrates how to tune these parameters using semi-simulation.
Parameters of the PLSKO
Function
The PLSKO function requires several parameters to be specified. The
key parameters of the PLSKO function are as follows:
Number of components (ncomp): An integer specifying
the number of components to use in the PLS regression. Generally, using
more components leads 1) more variance between variables captured, 2)
makes the knockoff variables more similar to the original variables,
therefore 3) lower FDR and lower power, which means 4) fewer
discoveries. Also, more components increase the computational
cost.
Threshold of neighbourhood definition (threshold.abs
or threshold.q): The threshold of neighbourhood definition
is used to define the neighbour variables. In default setting of PLSKO
and this The neighbourhood information is used to capture the
correlation structure between variables. The threshold can be defined as
an absolute value (threshold.abs) or as a quantile
(threshold.q) of the absolute values of the correlation
coefficients between variables. A higher threshold leads to a sparser
neighbourhood structure, which can improve the performance of the
knockoff generator.
Sparsity level (sparsity): A numeric value between 0
(excluded) and 1 specifying the sparsity level in the PLS regression. 1
means no sparsity. Proper sparsity level can improve the performance of
the knockoff generator in effectively controlling the false discovery
rate (FDR) and increasing the power.
General idea
The idea is to use semi-simulation to evaluate the empirical FDR and
power of the PLSKO function with different parameter values and select
the optimal parameter values which can achieve the desired FDR.
According to Model-X knockoff definition, knockoff variables are
generated using only, regardless of the response variable . Therefore,
as long as we have a valid knockoff variable set, we can use it for any
response variable , and make sure the FDR is controlled at the desired
level.
Semi-simulation means that we generate the response variable (i.e. an
artificial ) with known true coefficients with using the real data we
are going to use (i.e. ) and use the PLSKO function to generate knockoff
features. We then compare the selected variables with the true
coefficients to evaluate the FDR and power of the PLSKO function. By
repeating this process with different parameter values, we can identify
the optimal parameter values that achieve the desired FDR.
Main function:
plsko_tuning
The plsko_tuning function is used to tune the parameters
of the PLSKO function, allowing users to test a series of parameter
value combinations for their dataset. Apart from the parameter values to
be tested, the plsko_tuning function also requires some
important input, including: - X: Your real omics data.
n_ko: The number of knockoff variables to generate
for each parameter configuration. Default is 10. Please note that the
larger the number of knockoff variables, the more confident the tuning
results, but the more computational cost.
q: The false discovery rate (FDR) level to control.
Default is 0.05. Usually based on your own research design.
fdp.measure: The measure to assess the FDR of the
certain parameter configuration over repeated n_ko
simulations. Default is median, which means we use the
median of the FDPs over n_ko simulations to represent the
FDR of the certain parameter configuration. Other options include
mean, either (either mean or median under than
the target level), both (require both of mean and median of
FDP smaller than FDR). This measure does not affect the tuning results,
but only affects the selection of the optimal parameter
configuration.
early.stop: A logical value indicating whether to
stop the simulation early if the FDR of a certain parameter
configuration is already smaller than the target level. Default is
FALSE. If TRUE, the simulation will stop early
if the FDR of a certain parameter configuration is already smaller than
the target level, which can save computational time. If
FALSE, the simulation will continue until all parameter
configurations are tested.
p_s: The number of assumed true variables in the
data. Default is 10% of the number of variables. This value is preferred
to be larger in large dataset, e.g. 1/q, to ensure the estimate of FDR
is stable and more accurate.
Tips
The plsko_tuning function tests across the
combinations of different parameters. For example, when you test 5
different values for ncomp, 5 different values for
threshold.abs, and 5 different values for
sparsity, the total number of parameter configurations to
test is 5 * 5 * 5 = 125. Adding up the number of knockoff variables
(n_ko) to generate for each parameter configuration, the
total number of simulations is 125 * n_ko, which means 125
* n_ko knockoff variables set will be generated and
evaluated. Therefore, the computational cost can be high when testing a
large number of parameter configurations. It is recommended to start
with a small number of parameter configurations and increase the number
of configurations gradually based on the computational resources
available.
The plsko_tuning function is computationally
intensive, especially when the number of knockoff variables
(n_ko) is large or the number of parameter configurations
to test is large. Therefore, it is recommended to run the function in
parallel computing and with sufficient computational resources.
A single run of knockoff generation might take long time,
especially when the number of variables in X or the
components (ncomp) is large. For example, a single run of
knockoff generation for X with 100 variables and 100
samples might take about 2 seconds; a single run of knockoff generation
for X with 1200 variables, 150 samples and 5 components
might take about 25 seconds. It is recommended to test the computational
cost of a single run of knockoff generation before running the
plsko_tuning function.
The plsko_tuning function gives the optimal of
absolute threshold threshold.abs only.
Case 1: cfRNA data for
pre-eclampsia study
Here we show two examples of tuning the parameters of the PLSKO
function using semi-simulation on real data.
library(PLSKO)
data(cfRNA_placenta)
str(cfRNA_placenta)
#> List of 2
#> $ counts : num [1:71, 1:81] 6.97 5.07 6.92 5.85 6.35 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:71] "501143_1" "505372_1" "512025_1" "515494_1" ...
#> .. ..$ : chr [1:81] "ENSG00000003436" "ENSG00000004939" "ENSG00000007944" "ENSG00000035862" ...
#> $ metadata:'data.frame': 71 obs. of 9 variables:
#> ..$ group : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ lib.size : num [1:71] 48955 108046 86493 120683 217499 ...
#> ..$ norm.factors: num [1:71] 1.196 1.342 0.907 1.286 0.852 ...
#> ..$ title : chr [1:71] "501143_1" "505372_1" "512025_1" "515494_1" ...
#> ..$ disease : chr [1:71] "control" "control" "control" "pre-eclampsia" ...
#> ..$ cohort : chr [1:71] "Validation 1" "Discovery" "Validation 1" "Discovery" ...
#> ..$ time : chr [1:71] "≤12 weeks gestation" "≤12 weeks gestation" "≤12 weeks gestation" "≤12 weeks gestation" ...
#> ..$ patientID : chr [1:71] "501143" "505372" "512025" "515494" ...
#> ..$ PE : Factor w/ 2 levels "control","PE": 1 1 1 2 2 1 1 1 1 1 ...
# Basic input
X <- cfRNA_placenta$counts
n_ko <- 10
q <- 0.05
p_s <- 10
# Parameters to test
ncomp <- c(3, 5, 7)
threshold.abs <- c(0, 0.3, 0.5, 0.8)
sparsity <- 1
# We test on:
expand.grid(ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity)
# Run the tuning
start.time <- Sys.time()
cfRNA_tune <- plsko_tuning(X, n_ko = n_ko, p_s = p_s, q = q, ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity)
#> [1] "Optimal configuration: ncomp = 7, threshold.abs = 0.3, sparsity = 1, median FDP = 0"
end.time <- Sys.time()
# Print the results
plot(cfRNA_tune)

# Print the time
end.time - start.time
#> Time difference of 1.448801 mins
The plsko_tuning function run for 2 minutes to test 12
parameter configurations with 10 repeated simulations for each
configuration on a 71 * 81 dataset working on 31 machines parallel. We
observed that the neighbourhood threshold threshold.abs has
a significant impact on the FDR and power of the PLSKO function, while
the number of components ncomp have a moderate impact. The
optimal parameter configuration is ncomp = 2,
threshold.abs = 0.2, and sparsity = 1,
although the mean of FDP is slightly higher than the target level.
Then we apply this optimal parameter configuration to the real data
to generate the knockoff variables and select the important
variables.
y <- cfRNA_placenta$metadata$PE
ncomp <- cfRNA_tune$optimal$ncomp
threshold.abs <- cfRNA_tune$optimal$threshold.abs
sparsity <- cfRNA_tune$optimal$sparsity
cfRNA_result <- plsAKO(X, y, n_ko = 25, ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity)
#> Warning in plsAKO(X, y, n_ko = 25, ncomp = ncomp, threshold.abs =
#> threshold.abs, : lasso.lcd is not applicable for categorical response variable,
#> change to lasso.logistic
print(cfRNA_result)
#> Call:
#> plsAKO(X = X, y = y, n_ko = 25, ncomp = ncomp, threshold.abs = threshold.abs,
#> sparsity = sparsity)
#>
#> Selected variables (AKO):
#> ENSG00000076770 ENSG00000112419 ENSG00000112773 ENSG00000131149 ENSG00000136929
#> 11 24 25 33 42
#> ENSG00000172331 ENSG00000198517
#> 63 75
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#>
#> 7 11 12 17 24 25 33 42 62 63 64 75 78 80
#> 1 17 3 6 23 23 15 12 1 12 3 11 4 1
# Or if you like using the configuration with the lowest average FDP
user_optimal <- cfRNA_tune$mean[which.min(cfRNA_tune$mean$mean.fdp), ]
cfRNA_result_alt <- plsAKO(X, y, n_ko = 25, ncomp = user_optimal$ncomp, threshold.abs = user_optimal$threshold.abs, sparsity = user_optimal$sparsity)
#> Warning in plsAKO(X, y, n_ko = 25, ncomp = user_optimal$ncomp, threshold.abs =
#> user_optimal$threshold.abs, : lasso.lcd is not applicable for categorical
#> response variable, change to lasso.logistic
print(cfRNA_result_alt)
#> Call:
#> plsAKO(X = X, y = y, n_ko = 25, ncomp = user_optimal$ncomp, threshold.abs = user_optimal$threshold.abs,
#> sparsity = user_optimal$sparsity)
#>
#> Selected variables (AKO):
#> ENSG00000112419 ENSG00000112773 ENSG00000131149 ENSG00000136929
#> 24 25 33 42
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#>
#> 11 17 24 25 33 42 63 64 75 78
#> 5 2 19 14 13 9 2 3 4 1
Case 2: Proteomics data
for late pre-eclampsia study
Now we show another example of tuning the parameters of the PLSKO
function using semi-simulation on real data. This is a proteomics study
of late on-set pre-eclampsia in placenta tissue, generated from
SomaLogic platform. The data is preprocessed and ready to use.
# download preprocessed data from github
url <- "https://raw.github.com/guannan-yang/PLSKO/main/paper%20codes/plos1Prot/SL_prot_time1_data.rds"
path <- "C:/Users/guannany/OneDrive - The University of Melbourne/PhD/random/"
download.file(url, destfile = paste0(path, "SL_prot_time1_data.rds"), mode = "wb")
SL_prot_time1_data <- readRDS(paste0(path, "SL_prot_time1_data.rds"))
str(SL_prot_time1_data, list.len = 5)
#> List of 3
#> $ metadata:'data.frame': 166 obs. of 3 variables:
#> ..$ ID : num [1:166] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ GA : num [1:166] 12.3 9.7 8 12.9 15.1 12.6 22.9 13.3 16.6 22.9 ...
#> ..$ LatePE: num [1:166] 2 1 1 1 2 2 2 1 2 2 ...
#> $ log_expr:'data.frame': 166 obs. of 1125 variables:
#> ..$ C4b : num [1:166] 10.68 11.18 11.63 9.61 11.37 ...
#> ..$ Coagulation Factor XI : num [1:166] 11.3 10.7 10.8 10.7 10.6 ...
#> ..$ CTACK : num [1:166] 10.8 10.7 10.9 10.5 10.7 ...
#> ..$ Endostatin : num [1:166] 15.9 15.8 16 15.6 16 ...
#> ..$ TIMP-1 : num [1:166] 8.44 8.63 8.22 7.82 8.34 ...
#> .. [list output truncated]
#> $ expr :'data.frame': 166 obs. of 1125 variables:
#> ..$ C4b : num [1:166] 1640 2327 3173 779 2654 ...
#> ..$ Coagulation Factor XI : num [1:166] 2514 1718 1751 1674 1606 ...
#> ..$ CTACK : num [1:166] 1759 1628 1849 1483 1614 ...
#> ..$ Endostatin : num [1:166] 59633 56729 67505 51188 66199 ...
#> ..$ TIMP-1 : num [1:166] 347 395 298 226 324 ...
#> .. [list output truncated]
The data contains 166 samples and 1125 proteins. We want to identify
the proteins that are related to pre-eclampsia. We keep 500 proteins
with the largest variance for the analysis.
Let’s tune on this data.
# Keep the most variable 500 proteins
SL_prot_time1_logexpr_top500 <- SL_prot_time1_data$log_expr[, order(apply(SL_prot_time1_data$log_expr, 2, var), decreasing = TRUE)[1:500]]
# Basic input
X <- SL_prot_time1_logexpr_top500
n_ko <- 25
q <- 0.05
p_s <- 25
# Parameters to test
ncomp <- c(4, 6, 8, 10)
threshold.abs <- c(0, 0.1, 0.3, 0.5)
sparsity <- c(0.5, 0.8, 1)
# Run the tuning
start.time <- Sys.time()
prot_tune <- plsko_tuning(as.matrix(X), n_ko = n_ko, p_s = p_s, q = q, ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity)
#> [1] "Optimal configuration: ncomp = 10, threshold.abs = 0.1, sparsity = 0.5, median FDP = 0.0384615384615385"
end.time <- Sys.time()
difftime(end.time, start.time, units = "secs")
#> Time difference of 1203.497 secs
plot(prot_tune)
It took 28 minutes to test 48 parameter configurations with 25 repeated
simulations for each configuration on a 166 * 1125 dataset working on 31
machines parallel. We observed that the the neighbourhood threshold
threshold.abs and number of components ncomp
have a significant impact on the FDR and power of the PLSKO function,
while the sparsity level sparsity have a slight impact.
The optimal parameter configuration is ncomp = 10,
threshold.abs = 0.1, and sparsity = 0.5, with
the mean of FDP is slightly higher than the target level, and median of
FDP at 0.038.
Then we apply this optimal parameter configuration to the real data
to generate the knockoff variables and select the important
variables.
y <- as.factor(SL_prot_time1_data$metadata$LatePE)
ncomp <- prot_tune$optimal$ncomp
threshold.abs <- prot_tune$optimal$threshold.abs
sparsity <- prot_tune$optimal$sparsity
prot_result <- plsAKO(as.matrix(X), y, n_ko = 25, q = 0.05,
ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity,
w.method = "lasso.logistic")
print(prot_result)
#> Call:
#> plsAKO(X = as.matrix(X), y = y, n_ko = 25, q = 0.05, w.method = "lasso.logistic",
#> ncomp = ncomp, threshold.abs = threshold.abs, sparsity = sparsity)
#>
#> Selected variables (AKO):
#> Thrombin PTK6 MMP-7
#> 12 102 226
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#>
#> 12 14 42 102 202 226 407 416
#> 13 2 2 10 1 11 1 5
PLS-AKO selects 3 important proteins that are related to late on-set
pre-eclampsia in placenta tissue, including Thrombin, PTK6 and MMP-7,
with selection frequency of 12, 10 and 11 of 25 repeated PLSKO runs,
respectively.
Conclusion
Each dataset is unique, and the performance of the PLSKO function can
be influenced by the dataset characteristics and the parameter values
used. By tuning the parameter of PLSKO function using semi-simulation,
we can optimise parameter values that achieve the desired FDR with more
confidence.
For more information on the PLSKO function and the
plsko_tuning function, please refer to the package
documentation and the function help files.
Please cite
Yang, G., Menkhorst, E., Dimitriadis, E., & Le Cao, K. A. (2024).
PLSKO: a robust knockoff generator to control false discovery rate in
omics variable selection. bioRxiv, 2024-08.
Session Info
sessionInfo()
#> R version 4.3.0 (2023-04-21 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=C LC_CTYPE=English_Australia.utf8
#> [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
#> [5] LC_TIME=English_Australia.utf8
#>
#> time zone: Australia/Sydney
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] PLSKO_0.2.0
#>
#> loaded via a namespace (and not attached):
#> [1] gridExtra_2.3 remotes_2.5.0 rlang_1.1.5
#> [4] magrittr_2.0.3 matrixStats_1.5.0 compiler_4.3.0
#> [7] roxygen2_7.3.2 vctrs_0.6.5 reshape2_1.4.4
#> [10] stringr_1.5.1 profvis_0.4.0 pkgconfig_2.0.3
#> [13] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
#> [16] backports_1.5.0 ellipsis_0.3.2 promises_1.3.2
#> [19] rmarkdown_2.29 sessioninfo_1.2.2 purrr_1.0.2
#> [22] xfun_0.50 glmnet_4.1-8 cachem_1.1.0
#> [25] jsonlite_1.8.9 knockoff_0.3.6 progress_1.2.3
#> [28] later_1.4.1 BiocParallel_1.36.0 broom_1.0.7
#> [31] parallel_4.3.0 prettyunits_1.2.0 R6_2.5.1
#> [34] bslib_0.9.0 stringi_1.8.4 RColorBrewer_1.1-3
#> [37] car_3.1-3 pkgload_1.4.0 jquerylib_0.1.4
#> [40] Rcpp_1.0.14 iterators_1.0.14 knitr_1.49
#> [43] usethis_3.1.0 httpuv_1.6.15 Matrix_1.6-1.1
#> [46] splines_4.3.0 igraph_2.1.4 tidyselect_1.2.1
#> [49] rstudioapi_0.17.1 abind_1.4-8 yaml_2.3.10
#> [52] doParallel_1.0.17 codetools_0.2-20 miniUI_0.1.1.1
#> [55] pkgbuild_1.4.6 lattice_0.22-6 tibble_3.2.1
#> [58] plyr_1.8.9 shiny_1.10.0 withr_3.0.2
#> [61] rARPACK_0.11-0 evaluate_1.0.3 desc_1.4.3
#> [64] survival_3.8-3 urlchecker_1.0.1 xml2_1.3.6
#> [67] pillar_1.10.1 ggpubr_0.6.0 carData_3.0-5
#> [70] foreach_1.5.2 ellipse_0.5.0 generics_0.1.3
#> [73] rprojroot_2.0.4 hms_1.1.3 ggplot2_3.5.1
#> [76] munsell_0.5.1 scales_1.3.0 xtable_1.8-4
#> [79] glue_1.8.0 tools_4.3.0 RSpectra_0.16-2
#> [82] ggsignif_0.6.4 fs_1.6.5 cowplot_1.1.3
#> [85] grid_4.3.0 tidyr_1.3.1 devtools_2.4.5
#> [88] colorspace_2.1-1 Formula_1.2-5 cli_3.6.3
#> [91] mixOmics_6.26.0 dplyr_1.1.4 corpcor_1.6.10
#> [94] gtable_0.3.6 pls_2.8-5 rstatix_0.7.2
#> [97] sass_0.4.9 digest_0.6.37 ggrepel_0.9.6
#> [100] farver_2.1.2 htmlwidgets_1.6.4 memoise_2.0.1
#> [103] htmltools_0.5.8.1 lifecycle_1.0.4 mime_0.12
#> [106] MASS_7.3-58.4