This vignette illustrates the basic and advanced usage of the package to select important variables in omics data. PLSKO is a method that combines partial least squares (PLS) regression with knockoff filtering to select important variables in high-dimensional biological data with false discovery rate (FDR) control. The package provides functions to generate the knockoff variables, perform the knockoff filtering and aggregate multiple knockoff results. The package is designed to be user-friendly and flexible, allowing users to easily apply the PLSKO method to their own data.
Functions and the pipeline are very easy, but in this vignette, we will show you how to use the package in several examples. We will show you how to use the package to apply the PLSKO pipeline to the simulation data and cell-free RNA-seq dataset. We will also show you how to use the separate functions in an advanced and flexible way to customise the settings of the PLSKO method.
Here is the outline of the functions corresponding to the workflow of knockoff framework, and we also provide two pipeline functions that combine the steps in the workflow.
| Pipeline Funs Provided^ | |||||
|---|---|---|---|---|---|
| Main Steps of Knockoff framework | Function | Auxiliary Function | Output (Class) | plsko_filter(), plsAKO() |
ko_filter(), AKO_with_KO() |
| Step 1: Knockoff Variable Generation | plsko() |
r_criterion() (for ncomp
estimation) |
A n x p matrix of knockoff variables |
✓ | (Bring your own knockoff vars) |
Step 2: Importance Score Calculation
(W) |
(import from pkg Knockoff) |
- | A vector of p or a n_ko x p
matrix |
✓ | ✓ |
| Step 3: Knockoff Filtering and Variable Selection | KO_with_W() AKO_with_W() |
- | A list (class “knockoff.result” or “AKO.result”)
with components: statistic, selected,
ako.selected |
✓ | ✓ |
The pipeline functions are designed to be user-friendly and easy to use, while the separate functions provide more flexibility for users to customise the settings of each step in the workflow.
You can install the development version of the package from GitHub using the following code:
# install.packages("devtools")
#devtools::install_github("guannan-yang/PLSKO/PLSKO", quiet = TRUE, upgrade = "never")
library(PLSKO)
#> Registered S3 method overwritten by 'PLSKO':
#> method from
#> print.knockoff.result knockoff
#If warnings of installing dependencies appears, please install the dependencies manually by running the following code:
# install.packages(c("knockoff","progress", "parallel", "doParallel", "foreach"))
#
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("mixOmics")
plsko_filter and
plsAko on simulated dataIn this example, we generate a simulated dataset and response variable and then show how to apply the PLSKO method to select important variables with one single function call. We will also show how to customise the settings of the PLSKO method to improve the performance of the variable selection.
set.seed(1)
n = 100 # number of samples
p = 150 # number of variables ( p > n, high-dimensional setting )
k = 25 # number of important variables with nonzeros coefficients
a = 5 # coefficient for important variables
# generate the variables from a multivariate normal distribution
mu = rep(0, p)
rho = 0.5
Sigma = toeplitz(rho^(0:(p-1))) # we define a covariance matrix with an AR(1) structure
X = MASS::mvrnorm(n, mu, Sigma)
# generate a continuous response variable from a linear model
nonzero = sample(1:p, k)
beta = a * (1:p %in% nonzero) / sqrt(n)
y = X %*% beta + rnorm(n)
# generate a binary response variable from binomial with sigmoid function as the probability
y_bin = as.factor(rbinom(n, 1, plogis(X %*% beta)))
In the generated X matrix, the variables are correlated with an AR(1) structures (i.e., each entry \(\rho_{ij} = \rho^{|ij|}\)), which means correlations to be higher between variables that are closer to each other. (And we will use this information to define the neighbourhoods of variables in the PLSKO method in the following example.)
We then generate the response variable y from a linear model with 25 important variables and 175 unimportant variables. The important variables have non-zero coefficients, while the unimportant variables have zero coefficients.
Next, we apply the PLSKO method to select important variables in the simulated data.
If you are new to knockoff, we can use the function to perform the knockoff filtering.
First let’s run with the default settings. With the default settings, the function requires the predictor matrix , the response vector as input arguments. And it returns an object of class . You can print the result to see the selected variables.
# run the knockoff filter with default settings
result = plsko_filter(X, y)
print(result)
#> Call:
#> plsko_filter(X = X, y = y)
#>
#> Selected variables:
#> NULL
# compare with the true coefficients
which(beta != 0)
#> [1] 9 20 23 31 44 48 50 53 55 61 62 63 66 71 73 78 82 97 105
#> [20] 107 118 121 125 146 147
# calculate FDP in this run
fdp <- function(result) {
if(class(result) == "knockoff.result") fdp = sum(beta[result$selected] ==0) / max(length(result$selected), 1)
else if (class(result) == "AKO.result") fdp = sum(beta[result$ako.s] ==0) / max(length(result$ako.s), 1)
return(fdp)
}
fdp(result)
#> [1] 0
The default settings, which are: neighbourhoods are determined based on 80-quantile of the sample correlations, the number of components is determined empirically by the criterion (minimum 2), and the sparsity level is set to 1 (no sparsity) for PLS regression and the target FDR level is 0.05. The result shows that the PLSKO method selected the 8 important variables with a false discovery proportion (FDP) of 0.
You can also customise the settings of the PLSKO method by specifying the parameters in the function. For example, you can specify the neighbourhood information, the number of components, the sparsity level in PLS regression, and the target FDR level, etc.
First, we define the neighbourhood information based on the AR(1) structure of the variables (which we know is true in this case, in real data, you might need to estimate the neighbourhood information from the data or make some assumptions based on domain knowledge). Specifically, we define the neighbourhood of each variable as the variables that are within a distance of 3 variables from the variable.
# define the neighbourhood information based on the AR(1) structure of the variables
# Option 1: define the neighbourhood as a list of length p
nb.list = lapply(1:p, function(i){
c((i-3):(i-1), (i+1):(i+3))
})
# remove the indices that are out of the range
nb.list = lapply(nb.list, function(x) x[x > 0 & x <= p])
# Then, we run the PLSKO method with the customized neighbourhood information.
result = plsko_filter(X, y, nb.list = nb.list)
print(result)
#> Call:
#> plsko_filter(X = X, y = y, nb.list = nb.list)
#>
#> Selected variables:
#> [1] 9 23 73 107 146
fdp(result)
#> [1] 0
# Option 2: define the neighbourhood as an adjacency matrix
nb.mat = matrix(0, p, p)
for(i in 1:p){
# make sure the indices are within the range
nb = (i-3):(i+3)
nb = nb[nb > 0 & nb <= p]
nb.mat[i, nb] = 1
}
isSymmetric(nb.mat) # check if the matrix is symmetric
#> [1] TRUE
result = plsko_filter(X, y, nb.list = nb.mat)
#> Warning in plsko(X, nb.list = nb.list, threshold.abs = threshold.abs,
#> threshold.q = threshold.q, : Input nb.list must have 0 on the diagonal. Setting
#> diagonal to 0.
print(result)
#> Call:
#> plsko_filter(X = X, y = y, nb.list = nb.mat)
#>
#> Selected variables:
#> [1] 23 61 73 146
fdp(result)
#> [1] 0
These two options are equivalent, and you can choose the one that is
more convenient for you. The result shows that the PLSKO method selected
12 important variables with a FDP of 0.833, which is a little bit higher
than the target FDR level of 0.05. This might due to the small sample
size and the high-dimensional setting, modified FDR control by
offset = 0 is used in this run, or purely randomness in
knockoff variable generating (given knockoff frame work only ensure FDR
control as an expectation of FDP, which might fluctuate in different
runs). However, the power of the PLSKO method is better than the default
run with input of neighbourhood information.
You can also specify the number of components and the sparsity level in the PLS regression. For example, you can set the number of components to 3 and the sparsity level to 0.9.
# run the PLSKO method with the number of components set to 3 and the sparsity level set to 0.9, which means 90% of the coefficients in PLS regression are zero on each component.
result = plsko_filter(X, y, ncomp = 3, sparsity = 0.95)
print(result)
#> Call:
#> plsko_filter(X = X, y = y, ncomp = 3, sparsity = 0.95)
#>
#> Selected variables:
#> [1] 55 61 62 73 82 107 121 146
fdp(result)
#> [1] 0
We observed that the PLSKO method selected 5 important variables (out of 25) with a FDP of 0.
You can also apply the PLSKO method to a binary response variable. In
this case, you need to specify the method to compute the test
statistics. For example, you can set the method to “lasso.logistic” to
use the difference of coefficients in LASSO logistic regression. Or
without specifying, plsko_filter will automatically adjust
the method based on the response type.
# run the knockoff filter with default settings for binary response
result = plsko_filter(X, y_bin)
#> Warning in ko_filter(X, Xk, y, covariates = covariates, q = q, method = method,
#> : lasso.lcd is not applicable for categorical response variable, change to
#> lasso.logistic
print(result)
#> Call:
#> plsko_filter(X = X, y = y_bin)
#>
#> Selected variables:
#> NULL
fdp(result)
#> [1] 0
We observed NULL in the result, which means no variable
is selected in this run. This might be due to the small sample size. In
binary response, given more randomness in the response, the power of the
PLSKO method is generally lower than in continuous response, and you
might need a larger sample size to achieve good power.
PLSKO methodGiven the randomness in the knockoff variable generation and the PLS
regression, the results of the PLSKO method might vary in different
runs. To improve the stability and power of the variable selection, you
can aggregate multiple knockoff results using the PLS-AKO method. We
only show the default settings here, you can also customize the settings
in the function, such as the number of iterations (n_ko)
and other parameters similar to the function. The default setting is to
run 25 iterations of the PLSKO method and aggregate the results using
the PLS-AKO method. We provide options of parallel computing (default)
in the function, which can significantly reduce the computation time
when running multiple iterations. If you encounter any issues with
parallel computing, you can set parallel = FALSE to run in
serial mode.
# run the PLS-AKO method with default settings
result = plsAKO(X, y)
print(result)
#> Call:
#> plsAKO(X = X, y = y)
#>
#> Selected variables (AKO):
#> [1] 61 62 73 107 146
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#>
#> 6 9 23 27 31 45 48 50 51 53 55 60 61 62 63 73 75 78 82 107
#> 1 3 8 1 1 1 1 1 1 2 5 1 14 11 4 11 1 3 7 18
#> 114 118 119 120 121 125 146
#> 1 4 7 4 7 1 16
fdp(result)
#> [1] 0
The result shows that the PLS-AKO method selected 5 important variables with a FDP of 0,
plsko_filter and
plsAkoIn this example, we generate a semi-synthetic dataset with continuous response and apply the PLSKO method to select important variables in biological data. We generate the response variable y from a linear model with 8 important variables with nonzeros coefficients. By this way we can evaluate the performance of the PLSKO method in selecting important variables in real biological data with some known ground truth.
There are two datasets provided in the package:
prot_placenta and cfRNA_placenta. The
prot_placenta dataset contains relative abundances of
proteins from 36 samples with 36 genes. The cfRNA_placenta
dataset contains cell-free RNA counts (Moufarrej et al. ,2022) from 71
samples with 81 genes with elevated expression in placenta. The proteins
are a subset of proteomics data (from a multi-omics pre-eclampsia study
(Marić et al. 2022)) that were inferred released by placenta, according
to Degnes et al. (2022).
prot_placenta datasetdata("prot_placenta")
X = as.matrix(prot_placenta$abundance)
#generate the response variable y from a linear model
set.seed(1)
n = nrow(X)
p = ncol(X)
k = 8 # number of important variables with nonzeros coefficients
nonzero = sample(1:p, k) # randomly select 8 important variables
beta = as.numeric(1:p %in% nonzero) # assign non-zero coefficients to the important variables
y = X %*% beta
result = plsko_filter(X, y)
print(result)
#> Call:
#> plsko_filter(X = X, y = y)
#>
#> Selected variables:
#> HSPB1 TIMP3 SERPINE1 DIABLO SERPINE2 TFPI DKK1 ANGPTL4
#> 1 4 11 14 15 18 21 23
#> GREM1 IL6 NOG
#> 27 29 34
fdp(result)
#> [1] 0.2727273
The result shows that the PLSKO method selected 10 important variables with a FDP of 0.2 (all 8 true positive plus 2 false discoveries), higher than the default FDR level of 0.05. This might be due to the small number of important variables.
Another reason is that in real data, the true distrution is unknown and highly correlated, we might want to include more neighbours to control the correlations among variables, plus include more components in PLS regression to capture the structure of the data.
result <- plsko_filter(X, y, threshold.abs = 0, ncomp = 5, sparsity = 0.8) # set the absolute correlation threshold to 0 (every one is neighour with each other), the number of components to 5, and the sparsity level in PLS regression to 0.8
print(result)
#> Call:
#> plsko_filter(X = X, y = y, threshold.abs = 0, ncomp = 5, sparsity = 0.8)
#>
#> Selected variables:
#> HSPB1 TIMP3 SERPINE1 DIABLO TFPI ANGPTL4 GREM1 NOG
#> 1 4 11 14 18 23 27 34
fdp(result)
#> [1] 0
We observed that the PLSKO method selected all 8 important variables with a FDP of 0.
Let’s see how aggregating multiple knockoff results using the PLS-AKO method.
result = plsAKO(X, y, threshold.abs = 0, ncomp = 5, sparsity = 0.8)
print(result)
#> Call:
#> plsAKO(X = X, y = y, threshold.abs = 0, ncomp = 5, sparsity = 0.8)
#>
#> Selected variables (AKO):
#> HSPB1 TIMP3 SERPINE1 DIABLO TFPI ANGPTL4 GREM1 NOG
#> 1 4 11 14 18 23 27 34
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#>
#> 1 4 11 14 15 18 23 27 29 34
#> 25 25 25 25 4 25 25 25 4 25
fdp(result)
#> [1] 0
# average the results from 25 iterations
mean(unlist(lapply(result$s, function(x) fdp(x))))
#> [1] 0.03466667
Similar to the PLSKO method, the PLS-AKO method selected all 8 important variables with a FDP of 0. Although 7 of these 25 iteration include at one false discovery with FDR higher than 0.05, the estimated FDR from the mean of the FDP is 0.042, which is consistent with the target FDR level of 0.05. Overall, this shows how the PLS-AKO method can improve the stability of the result.
Now as you know the basic usage of the PLSKO package, we
will show you how to use the package in a more advanced way. Our package
also provides separate function for each step of the knockoff framework,
which allows users to have more flexibility in the workflow. In this
example, we show how to use the PLSKO package to apply the
PLSKO method to the cfRNA_placenta dataset with a synthetic
response variable.
data("cfRNA_placenta")
X = as.matrix(cfRNA_placenta$counts)
#generate the response variable y from a linear model
set.seed(1)
n = nrow(X)
p = ncol(X)
k = 8 # number of important variables with nonzeros coefficients
nonzero = sample(1:p, k) # randomly select 8 important variables
beta = as.numeric(1:p %in% nonzero) # assign non-zero coefficients to the important variables
y = X %*% beta
First, we generate the knockoff variables using the
plsko function. The function requires the predictor matrix
as input and returns a matrix of knockoff variables. There are many
options to customise the knockoff variables, such as the threshold to
define neighbourhoods, the number of components in PLS regression, and
the sparsity level in PLS regression.
# generate the knockoff variables with default settings of the plsko function
plsko_default = plsko(X)
# generate the knockoff variables with customized settings
plsko_custom = plsko(X, threshold.abs = 0, ncomp = 7, sparsity = 0.8)
# generate the knockoff variables with customized settings and neighbourhood information
# Here we may estimate the neibourhood information from the data by using graphical lasso, which provides a sparse estimate of the precision matrix
library(glasso)
cov.mat = cov(X)
glasso_res = glasso(cov.mat, rho = 0.1) # estimate the precision matrix using graphical lasso
nb.list = lapply(1:p, function(i) which(glasso_res$wi[i, ] != 0)) # get the neighbourhood information from the precision matrix
plsko_custom_nb = plsko(X, ncomp = 7, nb.list = nb.list)
# Or you can use other knockoff methods to generate the knockoff variables
# For example, you can use the method provided in the `knockoff` package to generate the knockoff variables
library(knockoff)
#> Warning: package 'knockoff' was built under R version 4.3.1
ko_soa = knockoff::create.second_order(X)
ko_filterThen for the next two steps, we can use the function
ko_filter in our package to calculate the importance scores
and perform the knockoff filtering and variable selection. The function
requires the predictor matrix , the knockoff variables, and the response
vector as input arguments. The function returns an object of class with
the selected variables, which is the same with the output of
plsko_filter function.
# calculate the importance scores and perform the knockoff filtering and variable selection with the default settings
result_default = ko_filter(X, plsko_default, y)
print(result_default)
#> Call:
#> ko_filter(X = X, Xk = plsko_default, y = y)
#>
#> Selected variables:
#> ENSG00000003436 ENSG00000085382 ENSG00000131873 ENSG00000136231 ENSG00000137460
#> 1 14 34 39 43
#> ENSG00000155966 ENSG00000169860 ENSG00000182158
#> 51 59 68
fdp(result_default)
#> [1] 0
# calculate the importance scores and perform the knockoff filtering and variable selection with the customized settings
result_custom = ko_filter(X, plsko_custom, y)
print(result_custom)
#> Call:
#> ko_filter(X = X, Xk = plsko_custom, y = y)
#>
#> Selected variables:
#> ENSG00000003436 ENSG00000085382 ENSG00000088726 ENSG00000131873 ENSG00000136231
#> 1 14 15 34 39
#> ENSG00000137460 ENSG00000155966 ENSG00000164292 ENSG00000169860 ENSG00000173626
#> 43 51 54 59 66
#> ENSG00000182158 ENSG00000261371
#> 68 81
fdp(result_custom)
#> [1] 0.3333333
# calculate the importance scores and perform the knockoff filtering and variable selection with the customized settings and neighbourhood information
result_custom_nb = ko_filter(X, plsko_custom_nb, y)
print(result_custom_nb)
#> Call:
#> ko_filter(X = X, Xk = plsko_custom_nb, y = y)
#>
#> Selected variables:
#> ENSG00000003436 ENSG00000085382 ENSG00000131873 ENSG00000136231 ENSG00000137460
#> 1 14 34 39 43
#> ENSG00000155966 ENSG00000169860 ENSG00000182158
#> 51 59 68
fdp(result_custom_nb)
#> [1] 0
# calculate the importance scores and perform the knockoff filtering and variable selection with the knockoff variables generated by the `knockoff` package
result_ko_soa = ko_filter(X, ko_soa, y)
print(result_ko_soa)
#> Call:
#> ko_filter(X = X, Xk = ko_soa, y = y)
#>
#> Selected variables:
#> ENSG00000003436 ENSG00000085382 ENSG00000088726 ENSG00000131873 ENSG00000134215
#> 1 14 15 34 36
#> ENSG00000136231 ENSG00000136261 ENSG00000137460 ENSG00000154429 ENSG00000155966
#> 39 40 43 49 51
#> ENSG00000164292 ENSG00000169860 ENSG00000173626 ENSG00000182158 ENSG00000261371
#> 54 59 66 68 81
fdp(result_ko_soa)
#> [1] 0.4666667
# Or you might have different options to calculate the importance scores
result_default_rf <- ko_filter(X, plsko_default, y, method = "RF") # e.g. use random forest to calculate the importance scores
print(result_default_rf)
#> Call:
#> ko_filter(X = X, Xk = plsko_default, y = y, method = "RF")
#>
#> Selected variables:
#> ENSG00000003436 ENSG00000085382 ENSG00000131873 ENSG00000136231 ENSG00000137460
#> 1 14 34 39 43
#> ENSG00000155966 ENSG00000169860 ENSG00000182158
#> 51 59 68
fdp(result_default_rf)
#> [1] 0
ko_withWThe importance score is calculated based on the assumption of
relationship between X and y. As long as the importance score is valid
and satisfies the property (defined in the original knockoff paper), it
will not affect the FDR control but might affect the power of the
method. If you have your own importance score, you can use the function
ko_withW to perform the knockoff filtering and variable
selection. The function requires the importance scores and the target
FDR level as input arguments.
# calculate the importance scores using self-defined method, e.g. difference of absolute value marginal correlation between X and y between the original and knockoff variables
my_knockoff_stat = function(X, X_k, y) {
abs(t(X) %*% y) - abs(t(X_k) %*% y)
}
W = my_knockoff_stat(X, plsko_custom, y)
result_my = ko_withW(W, q = 0.05)
print(result_my)
#> Call:
#> ko_withW(W = W, q = 0.05)
#>
#> Selected variables:
#> NULL
Unfortunately, no variable is selected in this run, suggesting low power of the marginal correlation as the importance score in this case.
In this example, we show how to apply the PLS-AKO method to the
cfRNA_placenta dataset with the real binary response
variable. We will use the separate functions in the PLS-AKO pipeline to
aggregate multiple knockoff results and improve the stability and power
of the variable selection.
X = as.matrix(cfRNA_placenta$counts)
y = cfRNA_placenta$metadata$PE # the real binary response variable indicates pre-eclampsia or control
# generate multiple knockoff variables by PLSKO with customised setting
n_ko = 15
plsko_list = lapply(1:n_ko, function(i) plsko(X, seed = i)) # generate 15 knockoff independently. Note that seed needs to be set since the default seed is 1 in the function, without specifying, the same knockoff variables will be generated in each iteration.
AKO_withKOOnce we have the multiple knockoff variables, we can use the
AKO_withKO function to calculate the importance scores and
perform the knockoff filtering and variable selection with the PLS-AKO
method. The function requires the predictor matrix , the list of
knockoff variables, and the response vector as input arguments. The
function returns an object of class with the selected variables.
# calculate the importance scores and perform the knockoff filtering and variable selection with the PLS-AKO method
result_ako = AKO_withKO(X, plsko_list, y)
print(result_ako)
#> Call:
#> AKO_withKO(X = X, Xko.list = plsko_list, y = y)
#>
#> Selected variables (AKO):
#> ENSG00000076770 ENSG00000112419 ENSG00000112773 ENSG00000131149 ENSG00000136929
#> 11 24 25 33 42
#> ENSG00000198517
#> 75
#> Frequency of selected variables from 15 iterations of single-run knockoffs:
#>
#> 7 11 12 17 24 25 33 42 63 64 75 78
#> 1 10 1 4 13 13 11 7 5 3 6 1
The result shows that the PLS-AKO method selected 6 important genes related to pre-eclampsia.
This function also allows you to bring your own knockoff variables, which is useful when you have your own method to generate the knockoff variables.
soa_list = lapply(1:n_ko, function(i) knockoff::create.second_order(X))
result_ako_soa = AKO_withKO(X, soa_list, y)
print(result_ako_soa)
#> Call:
#> AKO_withKO(X = X, Xko.list = soa_list, y = y)
#>
#> Selected variables (AKO):
#> ENSG00000076770 ENSG00000112419 ENSG00000112773 ENSG00000131149 ENSG00000198517
#> 11 24 25 33 75
#> Frequency of selected variables from 15 iterations of single-run knockoffs:
#>
#> 11 12 17 24 25 33 42 63 64 75 78
#> 10 1 3 13 13 12 4 4 1 6 2
Same with the ko_withW function, you can also use the
AKO_withW function to perform the knockoff filtering and
variable selection with customised importance scores.
# calculate the importance scores from multiple knockoff variables using self-defined method
# We use the difference of absolute value of Z-score (on contrast of PE or control group) between the original and knockoff variables as the importance score
my_knockoff_stat_Z <- function(X, Xk, y){
X_new <- cbind(X, Xk)
beta <- coef(lm(X_new~y))[2,] # run orgianal variables and knockoff variables together into an OLS regression and extract Z-score
abs(beta[1:ncol(X)]) - abs(beta[(ncol(X)+1):ncol(X_new)])
}
# calculate the importance scores from multiple knockoff variables
W_list = lapply(1:n_ko, function(i) my_knockoff_stat_Z(X, plsko_list[[i]], y))
# perform the knockoff filtering and variable selection with AKO method
result_ako_W = AKO_withW(W_list, q = 0.05)
print(result_ako_W)
#> Call:
#> AKO_withW(W.list = W_list, q = 0.05)
#>
#> Selected variables (AKO):
#> ENSG00000112773 ENSG00000198517
#> 25 75
#> Frequency of selected variables from 15 iterations of single-run knockoffs:
#>
#> 3 10 17 25 29 54 55 75 78 80
#> 1 2 4 6 1 1 4 5 1 2
In this example, we show how to include additional covariates in the knockoff filtering model. For example, when looking for important genes related to a disease, we might want to include the age, BMI, smoking status and other potential confounders as covariates in the model to make the conclusion more robust.
In knockoff filtering, this can be done by fitting the model on the augmented matrix of the predictors, the predictors’ knockoffs and the covariates.
To include covariates in the knockoff filtering model, you can use
the plsko_filter, plsAKO,
ko_filter or AKO_withKO functions with the
covariates argument. These functions will automatically
adjust the importance scores based on the covariates. Currently our
package implemented functions to calculate covariates-adjusted
importance scores based on the coefficient differences from lasso
regression (LCD) or other glmnet regression.
Please note that the covariate variables should be continous or binary variables. If you have categorical covariates, you need to convert them to (p_conf - 1) dummy binary variables (e.g. using one-hot encoding) before using the functions.
As we don’t have a real dataset with covariates, we will generate a
synthetic datasets with covariates and show how to use the
PLSKO package to include covariates in the knockoff
filtering model.
# generate a synthetic dataset with covariates
set.seed(1)
n = 100 # number of samples
p = 100
k = 10 # number of important variables with nonzeros coefficients
a = 5 # coefficient for important variables
p_conf = 5 # number of covariates
X <- MASS::mvrnorm(n, mu = rep(0, p), Sigma = toeplitz(0.5^(0:(p-1))))
covariates <- MASS::mvrnorm(n, mu = rep(0, p_conf), Sigma = diag(1, p_conf)) # generate covariates
# generate a continuous response variable from a linear model of X and covariate variables
nonzero = sample(1:p, k) # true signal: 19 70 100 51 59 81 52 91 3 55
beta = a * (1:p %in% nonzero) / sqrt(n)
y = X %*% beta + covariates %*% rnorm(p_conf) + rnorm(n)
Now we have a synthetic dataset with 100 samples, 100 variables, and 5 covariates (correlated with y).
Let’s see the selected variables using the plsAKO
function with or without the covariates included in the model.
# run the PLSKO method without covariates
result_no_cov = plsAKO(X, y, q = 0.05, n_ko = 25, ncomp = 3, threshold.q = 0.95)
print(result_no_cov)
#> Call:
#> plsAKO(X = X, y = y, n_ko = 25, q = 0.05, ncomp = 3, threshold.q = 0.95)
#>
#> Selected variables (AKO):
#> [1] 52
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#>
#> 3 19 36 38 51 52 59 91
#> 1 1 3 6 5 7 1 3
# run the PLSKO method with covariates
result_with_cov = plsAKO(X, y, q = 0.05, n_ko = 25, ncomp = 3, threshold.q = 0.95, covariates = covariates)
print(result_with_cov)
#> Call:
#> plsAKO(X = X, y = y, n_ko = 25, q = 0.05, covariates = covariates,
#> ncomp = 3, threshold.q = 0.95)
#>
#> Selected variables (AKO):
#> [1] 3 19 51 52 55 59 70 91 100
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#>
#> 3 4 5 18 19 24 28 34 51 52 53 55 59 70 81 86 91 100
#> 24 3 1 6 24 1 4 1 22 20 1 20 21 20 3 1 24 19
We observed a dramatic power increase with the covariates included in the model, as the model are closer to the ‘true’ underlying relationship between X and y (i.e., y ~ X + covar), or say the y is ‘denoised’ by ajusting for the covariates.
This vignette provides a comprehensive guide on how to use the
PLSKO package to select important variables in
high-dimensional biological data. The package provides functions to
generate the knockoff variables, calculate the importance scores, and
perform the knockoff filtering and variable selection. The package is
designed to be user-friendly and flexible, allowing users to easily
apply the PLSKO method to their own data. More details about the
functions and parameters can be found in the package documentation.
We hope this vignette helps you to get started with the
PLSKO package and apply the PLSKO method to your own data.
If you have any questions or feedback, please feel free to contact us or
open an issue on the GitHub repository.
sessionInfo()
#> R version 4.3.0 (2023-04-21 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
#> [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
#> [5] LC_TIME=English_Australia.utf8
#>
#> time zone: Australia/Sydney
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] knockoff_0.3.6 glasso_1.11 PLSKO_0.2.0
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 ellipse_0.5.0 shape_1.4.6.1
#> [4] xfun_0.50 bslib_0.9.0 ggplot2_3.5.1
#> [7] ggrepel_0.9.6 rstatix_0.7.2 lattice_0.22-6
#> [10] vctrs_0.6.5 tools_4.3.0 generics_0.1.3
#> [13] parallel_4.3.0 tibble_3.2.1 rARPACK_0.11-0
#> [16] pkgconfig_2.0.3 Matrix_1.6-1.1 RColorBrewer_1.1-3
#> [19] mixOmics_6.26.0 lifecycle_1.0.4 stringr_1.5.1
#> [22] compiler_4.3.0 progress_1.2.3 Rdsdp_1.0.5.2.1
#> [25] munsell_0.5.1 codetools_0.2-20 carData_3.0-5
#> [28] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10
#> [31] glmnet_4.1-8 Formula_1.2-5 pillar_1.10.1
#> [34] car_3.1-3 ggpubr_0.6.0 crayon_1.5.3
#> [37] jquerylib_0.1.4 tidyr_1.3.1 MASS_7.3-58.4
#> [40] BiocParallel_1.36.0 cachem_1.1.0 iterators_1.0.14
#> [43] abind_1.4-8 foreach_1.5.2 RSpectra_0.16-2
#> [46] pls_2.8-5 tidyselect_1.2.1 digest_0.6.37
#> [49] stringi_1.8.4 reshape2_1.4.4 dplyr_1.1.4
#> [52] purrr_1.0.2 splines_4.3.0 fastmap_1.2.0
#> [55] grid_4.3.0 colorspace_2.1-1 cli_3.6.3
#> [58] magrittr_2.0.3 survival_3.8-3 broom_1.0.7
#> [61] corpcor_1.6.10 prettyunits_1.2.0 scales_1.3.0
#> [64] backports_1.5.0 rmarkdown_2.29 matrixStats_1.5.0
#> [67] igraph_2.1.4 gridExtra_2.3 ggsignif_0.6.4
#> [70] hms_1.1.3 evaluate_1.0.3 knitr_1.49
#> [73] doParallel_1.0.17 rlang_1.1.5 Rcpp_1.0.14
#> [76] glue_1.8.0 rstudioapi_0.17.1 jsonlite_1.8.9
#> [79] plyr_1.8.9 R6_2.5.1