1 Introduction

This vignette illustrates the basic and advanced usage of the package to select important variables in omics data. PLSKO is a method that combines partial least squares (PLS) regression with knockoff filtering to select important variables in high-dimensional biological data with false discovery rate (FDR) control. The package provides functions to generate the knockoff variables, perform the knockoff filtering and aggregate multiple knockoff results. The package is designed to be user-friendly and flexible, allowing users to easily apply the PLSKO method to their own data.

Functions and the pipeline are very easy, but in this vignette, we will show you how to use the package in several examples. We will show you how to use the package to apply the PLSKO pipeline to the simulation data and cell-free RNA-seq dataset. We will also show you how to use the separate functions in an advanced and flexible way to customise the settings of the PLSKO method.

1.1 Overview of the knockoff framework

Here is the outline of the functions corresponding to the workflow of knockoff framework, and we also provide two pipeline functions that combine the steps in the workflow.

Pipeline Funs Provided^
Main Steps of Knockoff framework Function Auxiliary Function Output (Class) plsko_filter(), plsAKO() ko_filter(), AKO_with_KO()
Step 1: Knockoff Variable Generation plsko() r_criterion()
(for ncomp estimation)
A n x p matrix of knockoff variables (Bring your own
knockoff vars)
Step 2: Importance Score Calculation (W) (import from pkg Knockoff) - A vector of p or a n_ko x p matrix
Step 3: Knockoff Filtering and Variable Selection KO_with_W()
AKO_with_W()
- A list (class “knockoff.result” or “AKO.result”) with components:
statistic, selected, ako.selected

The pipeline functions are designed to be user-friendly and easy to use, while the separate functions provide more flexibility for users to customise the settings of each step in the workflow.

2 Installation

You can install the development version of the package from GitHub using the following code:

# install.packages("devtools")
#devtools::install_github("guannan-yang/PLSKO/PLSKO", quiet = TRUE, upgrade = "never")

library(PLSKO)
#> Registered S3 method overwritten by 'PLSKO':
#>   method                from    
#>   print.knockoff.result knockoff
#If warnings of installing dependencies appears, please install the dependencies manually by running the following code:
# install.packages(c("knockoff","progress", "parallel", "doParallel", "foreach"))
#
# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("mixOmics")

3 Example 1: Apply knockoff pipeline functions plsko_filter and plsAko on simulated data

In this example, we generate a simulated dataset and response variable and then show how to apply the PLSKO method to select important variables with one single function call. We will also show how to customise the settings of the PLSKO method to improve the performance of the variable selection.

set.seed(1)
n = 100 # number of samples
p = 150 # number of variables ( p > n, high-dimensional setting )
k = 25 # number of important variables with nonzeros coefficients
a = 5 # coefficient for important variables

# generate the variables from a multivariate normal distribution
mu = rep(0, p)
rho = 0.5
Sigma = toeplitz(rho^(0:(p-1))) # we define a covariance matrix with an AR(1) structure
X = MASS::mvrnorm(n, mu, Sigma)

# generate a continuous response variable from a linear model
nonzero = sample(1:p, k)
beta = a * (1:p %in% nonzero) / sqrt(n)
y = X %*% beta + rnorm(n)

# generate a binary response variable from binomial with sigmoid function as the probability
y_bin = as.factor(rbinom(n, 1, plogis(X %*% beta)))

In the generated X matrix, the variables are correlated with an AR(1) structures (i.e., each entry \(\rho_{ij} = \rho^{|ij|}\)), which means correlations to be higher between variables that are closer to each other. (And we will use this information to define the neighbourhoods of variables in the PLSKO method in the following example.)

We then generate the response variable y from a linear model with 25 important variables and 175 unimportant variables. The important variables have non-zero coefficients, while the unimportant variables have zero coefficients.

Next, we apply the PLSKO method to select important variables in the simulated data.

3.1 PLSKO pipeline: apply knockoff framework with PLSKO-generated knockoff in a single function

If you are new to knockoff, we can use the function to perform the knockoff filtering.

3.2 Default settings

First let’s run with the default settings. With the default settings, the function requires the predictor matrix , the response vector as input arguments. And it returns an object of class . You can print the result to see the selected variables.

# run the knockoff filter with default settings
result = plsko_filter(X, y) 
print(result)
#> Call:
#> plsko_filter(X = X, y = y)
#> 
#> Selected variables:
#> NULL
# compare with the true coefficients
which(beta != 0)
#>  [1]   9  20  23  31  44  48  50  53  55  61  62  63  66  71  73  78  82  97 105
#> [20] 107 118 121 125 146 147
# calculate FDP in this run
fdp <- function(result) {
  if(class(result) == "knockoff.result") fdp = sum(beta[result$selected] ==0) / max(length(result$selected), 1)
  else if (class(result) == "AKO.result") fdp = sum(beta[result$ako.s] ==0) / max(length(result$ako.s), 1)
  return(fdp)
}
fdp(result)
#> [1] 0

The default settings, which are: neighbourhoods are determined based on 80-quantile of the sample correlations, the number of components is determined empirically by the criterion (minimum 2), and the sparsity level is set to 1 (no sparsity) for PLS regression and the target FDR level is 0.05. The result shows that the PLSKO method selected the 8 important variables with a false discovery proportion (FDP) of 0.

3.3 Customised settings option 1: neighbourhood information

You can also customise the settings of the PLSKO method by specifying the parameters in the function. For example, you can specify the neighbourhood information, the number of components, the sparsity level in PLS regression, and the target FDR level, etc.

First, we define the neighbourhood information based on the AR(1) structure of the variables (which we know is true in this case, in real data, you might need to estimate the neighbourhood information from the data or make some assumptions based on domain knowledge). Specifically, we define the neighbourhood of each variable as the variables that are within a distance of 3 variables from the variable.

# define the neighbourhood information based on the AR(1) structure of the variables
# Option 1: define the neighbourhood as a list of length p
nb.list = lapply(1:p, function(i){
  c((i-3):(i-1), (i+1):(i+3))
})
# remove the indices that are out of the range
nb.list = lapply(nb.list, function(x) x[x > 0 & x <= p])

# Then, we run the PLSKO method with the customized neighbourhood information. 
result = plsko_filter(X, y, nb.list = nb.list)
print(result)
#> Call:
#> plsko_filter(X = X, y = y, nb.list = nb.list)
#> 
#> Selected variables:
#> [1]   9  23  73 107 146
fdp(result)
#> [1] 0
# Option 2: define the neighbourhood as an adjacency matrix
nb.mat = matrix(0, p, p)
for(i in 1:p){
  # make sure the indices are within the range
  nb = (i-3):(i+3)
  nb = nb[nb > 0 & nb <= p]
  nb.mat[i, nb] = 1
}
isSymmetric(nb.mat) # check if the matrix is symmetric
#> [1] TRUE
result = plsko_filter(X, y, nb.list = nb.mat)
#> Warning in plsko(X, nb.list = nb.list, threshold.abs = threshold.abs,
#> threshold.q = threshold.q, : Input nb.list must have 0 on the diagonal. Setting
#> diagonal to 0.
print(result)
#> Call:
#> plsko_filter(X = X, y = y, nb.list = nb.mat)
#> 
#> Selected variables:
#> [1]  23  61  73 146
fdp(result)
#> [1] 0

These two options are equivalent, and you can choose the one that is more convenient for you. The result shows that the PLSKO method selected 12 important variables with a FDP of 0.833, which is a little bit higher than the target FDR level of 0.05. This might due to the small sample size and the high-dimensional setting, modified FDR control by offset = 0 is used in this run, or purely randomness in knockoff variable generating (given knockoff frame work only ensure FDR control as an expectation of FDP, which might fluctuate in different runs). However, the power of the PLSKO method is better than the default run with input of neighbourhood information.

3.4 Customided settings option 2: number of components and sparsity level

You can also specify the number of components and the sparsity level in the PLS regression. For example, you can set the number of components to 3 and the sparsity level to 0.9.

# run the PLSKO method with the number of components set to 3 and the sparsity level set to 0.9, which means 90% of the coefficients in PLS regression are zero on each component.
result = plsko_filter(X, y, ncomp = 3, sparsity = 0.95)
print(result)
#> Call:
#> plsko_filter(X = X, y = y, ncomp = 3, sparsity = 0.95)
#> 
#> Selected variables:
#> [1]  55  61  62  73  82 107 121 146
fdp(result)
#> [1] 0

We observed that the PLSKO method selected 5 important variables (out of 25) with a FDP of 0.

3.5 Customised settings option 3: Binary response

You can also apply the PLSKO method to a binary response variable. In this case, you need to specify the method to compute the test statistics. For example, you can set the method to “lasso.logistic” to use the difference of coefficients in LASSO logistic regression. Or without specifying, plsko_filter will automatically adjust the method based on the response type.

# run the knockoff filter with default settings for binary response
result = plsko_filter(X, y_bin)
#> Warning in ko_filter(X, Xk, y, covariates = covariates, q = q, method = method,
#> : lasso.lcd is not applicable for categorical response variable, change to
#> lasso.logistic
print(result)
#> Call:
#> plsko_filter(X = X, y = y_bin)
#> 
#> Selected variables:
#> NULL
fdp(result)
#> [1] 0

We observed NULL in the result, which means no variable is selected in this run. This might be due to the small sample size. In binary response, given more randomness in the response, the power of the PLSKO method is generally lower than in continuous response, and you might need a larger sample size to achieve good power.

3.6 PLS-AKO pipeline: aggregate multiple knockoff results from PLSKO method

Given the randomness in the knockoff variable generation and the PLS regression, the results of the PLSKO method might vary in different runs. To improve the stability and power of the variable selection, you can aggregate multiple knockoff results using the PLS-AKO method. We only show the default settings here, you can also customize the settings in the function, such as the number of iterations (n_ko) and other parameters similar to the function. The default setting is to run 25 iterations of the PLSKO method and aggregate the results using the PLS-AKO method. We provide options of parallel computing (default) in the function, which can significantly reduce the computation time when running multiple iterations. If you encounter any issues with parallel computing, you can set parallel = FALSE to run in serial mode.

# run the PLS-AKO method with default settings
result = plsAKO(X, y)
print(result)
#> Call:
#> plsAKO(X = X, y = y)
#> 
#> Selected variables (AKO):
#> [1]  61  62  73 107 146
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#> 
#>   6   9  23  27  31  45  48  50  51  53  55  60  61  62  63  73  75  78  82 107 
#>   1   3   8   1   1   1   1   1   1   2   5   1  14  11   4  11   1   3   7  18 
#> 114 118 119 120 121 125 146 
#>   1   4   7   4   7   1  16
fdp(result)
#> [1] 0

The result shows that the PLS-AKO method selected 5 important variables with a FDP of 0,

4 Example 2: Apply pipeline functions plsko_filter and plsAko

In this example, we generate a semi-synthetic dataset with continuous response and apply the PLSKO method to select important variables in biological data. We generate the response variable y from a linear model with 8 important variables with nonzeros coefficients. By this way we can evaluate the performance of the PLSKO method in selecting important variables in real biological data with some known ground truth.

There are two datasets provided in the package: prot_placenta and cfRNA_placenta. The prot_placenta dataset contains relative abundances of proteins from 36 samples with 36 genes. The cfRNA_placenta dataset contains cell-free RNA counts (Moufarrej et al. ,2022) from 71 samples with 81 genes with elevated expression in placenta. The proteins are a subset of proteomics data (from a multi-omics pre-eclampsia study (Marić et al. 2022)) that were inferred released by placenta, according to Degnes et al. (2022).

4.1 Semi-synthetic data based on the prot_placenta dataset

data("prot_placenta")
X = as.matrix(prot_placenta$abundance)

#generate the response variable y from a linear model
set.seed(1)
n = nrow(X)
p = ncol(X)
k = 8 # number of important variables with nonzeros coefficients
nonzero = sample(1:p, k) # randomly select 8 important variables

beta = as.numeric(1:p %in% nonzero) # assign non-zero coefficients to the important variables
y = X %*% beta 
result = plsko_filter(X, y) 
print(result)
#> Call:
#> plsko_filter(X = X, y = y)
#> 
#> Selected variables:
#>    HSPB1    TIMP3 SERPINE1   DIABLO SERPINE2     TFPI     DKK1  ANGPTL4 
#>        1        4       11       14       15       18       21       23 
#>    GREM1      IL6      NOG 
#>       27       29       34
fdp(result)
#> [1] 0.2727273

The result shows that the PLSKO method selected 10 important variables with a FDP of 0.2 (all 8 true positive plus 2 false discoveries), higher than the default FDR level of 0.05. This might be due to the small number of important variables.

Another reason is that in real data, the true distrution is unknown and highly correlated, we might want to include more neighbours to control the correlations among variables, plus include more components in PLS regression to capture the structure of the data.

result <- plsko_filter(X, y, threshold.abs = 0, ncomp = 5, sparsity = 0.8) # set the absolute correlation threshold to 0 (every one is neighour with each other), the number of components to 5, and the sparsity level in PLS regression to 0.8
print(result)
#> Call:
#> plsko_filter(X = X, y = y, threshold.abs = 0, ncomp = 5, sparsity = 0.8)
#> 
#> Selected variables:
#>    HSPB1    TIMP3 SERPINE1   DIABLO     TFPI  ANGPTL4    GREM1      NOG 
#>        1        4       11       14       18       23       27       34
fdp(result)
#> [1] 0

We observed that the PLSKO method selected all 8 important variables with a FDP of 0.

Let’s see how aggregating multiple knockoff results using the PLS-AKO method.

result = plsAKO(X, y, threshold.abs = 0, ncomp = 5, sparsity = 0.8)
print(result)
#> Call:
#> plsAKO(X = X, y = y, threshold.abs = 0, ncomp = 5, sparsity = 0.8)
#> 
#> Selected variables (AKO):
#>    HSPB1    TIMP3 SERPINE1   DIABLO     TFPI  ANGPTL4    GREM1      NOG 
#>        1        4       11       14       18       23       27       34 
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#> 
#>  1  4 11 14 15 18 23 27 29 34 
#> 25 25 25 25  4 25 25 25  4 25
fdp(result)
#> [1] 0
# average the results from 25 iterations
mean(unlist(lapply(result$s, function(x) fdp(x))))
#> [1] 0.03466667

Similar to the PLSKO method, the PLS-AKO method selected all 8 important variables with a FDP of 0. Although 7 of these 25 iteration include at one false discovery with FDR higher than 0.05, the estimated FDR from the mean of the FDP is 0.042, which is consistent with the target FDR level of 0.05. Overall, this shows how the PLS-AKO method can improve the stability of the result.

5 Example 3: Advanced usage part 1 –apply knockoff framework with PLSKO-generated knockoff in a single run

Now as you know the basic usage of the PLSKO package, we will show you how to use the package in a more advanced way. Our package also provides separate function for each step of the knockoff framework, which allows users to have more flexibility in the workflow. In this example, we show how to use the PLSKO package to apply the PLSKO method to the cfRNA_placenta dataset with a synthetic response variable.

data("cfRNA_placenta")
X = as.matrix(cfRNA_placenta$counts)
#generate the response variable y from a linear model
set.seed(1)
n = nrow(X)
p = ncol(X)
k = 8 # number of important variables with nonzeros coefficients
nonzero = sample(1:p, k) # randomly select 8 important variables

beta = as.numeric(1:p %in% nonzero) # assign non-zero coefficients to the important variables
y = X %*% beta 

5.1 Step 1: Knockoff Variable Generation

First, we generate the knockoff variables using the plsko function. The function requires the predictor matrix as input and returns a matrix of knockoff variables. There are many options to customise the knockoff variables, such as the threshold to define neighbourhoods, the number of components in PLS regression, and the sparsity level in PLS regression.

# generate the knockoff variables with default settings of the plsko function
plsko_default = plsko(X) 

# generate the knockoff variables with customized settings
plsko_custom = plsko(X, threshold.abs = 0, ncomp = 7, sparsity = 0.8)

# generate the knockoff variables with customized settings and neighbourhood information
# Here we may estimate the neibourhood information from the data by using graphical lasso, which provides a sparse estimate of the precision matrix
library(glasso)
cov.mat = cov(X)
glasso_res = glasso(cov.mat, rho = 0.1) # estimate the precision matrix using graphical lasso
nb.list = lapply(1:p, function(i) which(glasso_res$wi[i, ] != 0)) # get the neighbourhood information from the precision matrix
plsko_custom_nb = plsko(X, ncomp = 7, nb.list = nb.list)


# Or you can use other knockoff methods to generate the knockoff variables
# For example, you can use the method provided in the `knockoff` package to generate the knockoff variables
library(knockoff)
#> Warning: package 'knockoff' was built under R version 4.3.1
ko_soa = knockoff::create.second_order(X)

5.2 Step 2 and 3 combined: Importance Score Calculation and Variable Selection with function ko_filter

Then for the next two steps, we can use the function ko_filter in our package to calculate the importance scores and perform the knockoff filtering and variable selection. The function requires the predictor matrix , the knockoff variables, and the response vector as input arguments. The function returns an object of class with the selected variables, which is the same with the output of plsko_filter function.

# calculate the importance scores and perform the knockoff filtering and variable selection with the default settings
result_default = ko_filter(X, plsko_default, y)
print(result_default)
#> Call:
#> ko_filter(X = X, Xk = plsko_default, y = y)
#> 
#> Selected variables:
#> ENSG00000003436 ENSG00000085382 ENSG00000131873 ENSG00000136231 ENSG00000137460 
#>               1              14              34              39              43 
#> ENSG00000155966 ENSG00000169860 ENSG00000182158 
#>              51              59              68
fdp(result_default)
#> [1] 0
# calculate the importance scores and perform the knockoff filtering and variable selection with the customized settings
result_custom = ko_filter(X, plsko_custom, y)
print(result_custom)
#> Call:
#> ko_filter(X = X, Xk = plsko_custom, y = y)
#> 
#> Selected variables:
#> ENSG00000003436 ENSG00000085382 ENSG00000088726 ENSG00000131873 ENSG00000136231 
#>               1              14              15              34              39 
#> ENSG00000137460 ENSG00000155966 ENSG00000164292 ENSG00000169860 ENSG00000173626 
#>              43              51              54              59              66 
#> ENSG00000182158 ENSG00000261371 
#>              68              81
fdp(result_custom)
#> [1] 0.3333333
# calculate the importance scores and perform the knockoff filtering and variable selection with the customized settings and neighbourhood information
result_custom_nb = ko_filter(X, plsko_custom_nb, y)
print(result_custom_nb)
#> Call:
#> ko_filter(X = X, Xk = plsko_custom_nb, y = y)
#> 
#> Selected variables:
#> ENSG00000003436 ENSG00000085382 ENSG00000131873 ENSG00000136231 ENSG00000137460 
#>               1              14              34              39              43 
#> ENSG00000155966 ENSG00000169860 ENSG00000182158 
#>              51              59              68
fdp(result_custom_nb)
#> [1] 0
# calculate the importance scores and perform the knockoff filtering and variable selection with the knockoff variables generated by the `knockoff` package
result_ko_soa = ko_filter(X, ko_soa, y)
print(result_ko_soa)
#> Call:
#> ko_filter(X = X, Xk = ko_soa, y = y)
#> 
#> Selected variables:
#> ENSG00000003436 ENSG00000085382 ENSG00000088726 ENSG00000131873 ENSG00000134215 
#>               1              14              15              34              36 
#> ENSG00000136231 ENSG00000136261 ENSG00000137460 ENSG00000154429 ENSG00000155966 
#>              39              40              43              49              51 
#> ENSG00000164292 ENSG00000169860 ENSG00000173626 ENSG00000182158 ENSG00000261371 
#>              54              59              66              68              81
fdp(result_ko_soa)
#> [1] 0.4666667
# Or you might have different options to calculate the importance scores
result_default_rf <- ko_filter(X, plsko_default, y, method = "RF") # e.g. use random forest to calculate the importance scores
print(result_default_rf)
#> Call:
#> ko_filter(X = X, Xk = plsko_default, y = y, method = "RF")
#> 
#> Selected variables:
#> ENSG00000003436 ENSG00000085382 ENSG00000131873 ENSG00000136231 ENSG00000137460 
#>               1              14              34              39              43 
#> ENSG00000155966 ENSG00000169860 ENSG00000182158 
#>              51              59              68
fdp(result_default_rf)
#> [1] 0

5.3 Step 3: Select important variables by function ko_withW

The importance score is calculated based on the assumption of relationship between X and y. As long as the importance score is valid and satisfies the property (defined in the original knockoff paper), it will not affect the FDR control but might affect the power of the method. If you have your own importance score, you can use the function ko_withW to perform the knockoff filtering and variable selection. The function requires the importance scores and the target FDR level as input arguments.

# calculate the importance scores using self-defined method, e.g. difference of absolute value marginal correlation between X and y between the original and knockoff variables 

my_knockoff_stat = function(X, X_k, y) {
  abs(t(X) %*% y) - abs(t(X_k) %*% y)
}
W = my_knockoff_stat(X, plsko_custom, y)
result_my = ko_withW(W, q = 0.05)
print(result_my)
#> Call:
#> ko_withW(W = W, q = 0.05)
#> 
#> Selected variables:
#> NULL

Unfortunately, no variable is selected in this run, suggesting low power of the marginal correlation as the importance score in this case.

6 Example 4: Advanced usage part 2 –apply knockoff filtering on real data for multiple knockoff aggregation

In this example, we show how to apply the PLS-AKO method to the cfRNA_placenta dataset with the real binary response variable. We will use the separate functions in the PLS-AKO pipeline to aggregate multiple knockoff results and improve the stability and power of the variable selection.

6.1 Step 1: Multiple knockoff generation

X = as.matrix(cfRNA_placenta$counts)
y = cfRNA_placenta$metadata$PE # the real binary response variable indicates pre-eclampsia or control

# generate multiple knockoff variables by PLSKO with customised setting 
n_ko = 15
plsko_list = lapply(1:n_ko, function(i) plsko(X, seed = i)) # generate 15 knockoff independently. Note that seed needs to be set since the default seed is 1 in the function, without specifying, the same knockoff variables will be generated in each iteration.

6.2 Step 2 and 3 combined: Importance Score Calculation and Variable Selection with function AKO_withKO

Once we have the multiple knockoff variables, we can use the AKO_withKO function to calculate the importance scores and perform the knockoff filtering and variable selection with the PLS-AKO method. The function requires the predictor matrix , the list of knockoff variables, and the response vector as input arguments. The function returns an object of class with the selected variables.

# calculate the importance scores and perform the knockoff filtering and variable selection with the PLS-AKO method
result_ako = AKO_withKO(X, plsko_list, y)
print(result_ako)
#> Call:
#> AKO_withKO(X = X, Xko.list = plsko_list, y = y)
#> 
#> Selected variables (AKO):
#> ENSG00000076770 ENSG00000112419 ENSG00000112773 ENSG00000131149 ENSG00000136929 
#>              11              24              25              33              42 
#> ENSG00000198517 
#>              75 
#> Frequency of selected variables from 15 iterations of single-run knockoffs:
#> 
#>  7 11 12 17 24 25 33 42 63 64 75 78 
#>  1 10  1  4 13 13 11  7  5  3  6  1

The result shows that the PLS-AKO method selected 6 important genes related to pre-eclampsia.

This function also allows you to bring your own knockoff variables, which is useful when you have your own method to generate the knockoff variables.

soa_list = lapply(1:n_ko, function(i) knockoff::create.second_order(X))
result_ako_soa = AKO_withKO(X, soa_list, y)
print(result_ako_soa)
#> Call:
#> AKO_withKO(X = X, Xko.list = soa_list, y = y)
#> 
#> Selected variables (AKO):
#> ENSG00000076770 ENSG00000112419 ENSG00000112773 ENSG00000131149 ENSG00000198517 
#>              11              24              25              33              75 
#> Frequency of selected variables from 15 iterations of single-run knockoffs:
#> 
#> 11 12 17 24 25 33 42 63 64 75 78 
#> 10  1  3 13 13 12  4  4  1  6  2

6.3 Step 3: Multiple knockoff aggregation with customised importance scores

Same with the ko_withW function, you can also use the AKO_withW function to perform the knockoff filtering and variable selection with customised importance scores.

# calculate the importance scores from multiple knockoff variables using self-defined method
# We use the difference of absolute value of Z-score (on contrast of PE or control group) between the original and knockoff variables as the importance score
my_knockoff_stat_Z <- function(X, Xk, y){
  X_new <- cbind(X, Xk)
 beta <- coef(lm(X_new~y))[2,] # run orgianal variables and knockoff variables together into an OLS regression and extract Z-score
 abs(beta[1:ncol(X)]) - abs(beta[(ncol(X)+1):ncol(X_new)])
}

# calculate the importance scores from multiple knockoff variables
W_list = lapply(1:n_ko, function(i) my_knockoff_stat_Z(X, plsko_list[[i]], y))

# perform the knockoff filtering and variable selection with AKO method
result_ako_W = AKO_withW(W_list, q = 0.05)
print(result_ako_W)
#> Call:
#> AKO_withW(W.list = W_list, q = 0.05)
#> 
#> Selected variables (AKO):
#> ENSG00000112773 ENSG00000198517 
#>              25              75 
#> Frequency of selected variables from 15 iterations of single-run knockoffs:
#> 
#>  3 10 17 25 29 54 55 75 78 80 
#>  1  2  4  6  1  1  4  5  1  2

7 Example 5: Include additional covariates in your knockoff filtering model

In this example, we show how to include additional covariates in the knockoff filtering model. For example, when looking for important genes related to a disease, we might want to include the age, BMI, smoking status and other potential confounders as covariates in the model to make the conclusion more robust.

In knockoff filtering, this can be done by fitting the model on the augmented matrix of the predictors, the predictors’ knockoffs and the covariates.

To include covariates in the knockoff filtering model, you can use the plsko_filter, plsAKO, ko_filter or AKO_withKO functions with the covariates argument. These functions will automatically adjust the importance scores based on the covariates. Currently our package implemented functions to calculate covariates-adjusted importance scores based on the coefficient differences from lasso regression (LCD) or other glmnet regression.

Please note that the covariate variables should be continous or binary variables. If you have categorical covariates, you need to convert them to (p_conf - 1) dummy binary variables (e.g. using one-hot encoding) before using the functions.

As we don’t have a real dataset with covariates, we will generate a synthetic datasets with covariates and show how to use the PLSKO package to include covariates in the knockoff filtering model.

# generate a synthetic dataset with covariates
set.seed(1)
n = 100 # number of samples
p = 100
k = 10 # number of important variables with nonzeros coefficients
a = 5 # coefficient for important variables

p_conf = 5 # number of covariates

X <- MASS::mvrnorm(n, mu = rep(0, p), Sigma = toeplitz(0.5^(0:(p-1))))

covariates <- MASS::mvrnorm(n, mu = rep(0, p_conf), Sigma = diag(1, p_conf)) # generate covariates

# generate a continuous response variable from a linear model of X and covariate variables
nonzero = sample(1:p, k) # true signal: 19  70 100  51  59  81  52  91   3  55
beta = a * (1:p %in% nonzero) / sqrt(n)
y = X %*% beta + covariates %*% rnorm(p_conf) + rnorm(n)

Now we have a synthetic dataset with 100 samples, 100 variables, and 5 covariates (correlated with y).

Let’s see the selected variables using the plsAKO function with or without the covariates included in the model.

# run the PLSKO method without covariates
result_no_cov = plsAKO(X, y, q = 0.05, n_ko = 25, ncomp = 3, threshold.q = 0.95)
print(result_no_cov)
#> Call:
#> plsAKO(X = X, y = y, n_ko = 25, q = 0.05, ncomp = 3, threshold.q = 0.95)
#> 
#> Selected variables (AKO):
#> [1] 52
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#> 
#>  3 19 36 38 51 52 59 91 
#>  1  1  3  6  5  7  1  3
# run the PLSKO method with covariates
result_with_cov = plsAKO(X, y, q = 0.05, n_ko = 25, ncomp = 3, threshold.q = 0.95, covariates = covariates)
print(result_with_cov)
#> Call:
#> plsAKO(X = X, y = y, n_ko = 25, q = 0.05, covariates = covariates, 
#>     ncomp = 3, threshold.q = 0.95)
#> 
#> Selected variables (AKO):
#> [1]   3  19  51  52  55  59  70  91 100
#> Frequency of selected variables from 25 iterations of single-run knockoffs:
#> 
#>   3   4   5  18  19  24  28  34  51  52  53  55  59  70  81  86  91 100 
#>  24   3   1   6  24   1   4   1  22  20   1  20  21  20   3   1  24  19

We observed a dramatic power increase with the covariates included in the model, as the model are closer to the ‘true’ underlying relationship between X and y (i.e., y ~ X + covar), or say the y is ‘denoised’ by ajusting for the covariates.

8 Conclusion

This vignette provides a comprehensive guide on how to use the PLSKO package to select important variables in high-dimensional biological data. The package provides functions to generate the knockoff variables, calculate the importance scores, and perform the knockoff filtering and variable selection. The package is designed to be user-friendly and flexible, allowing users to easily apply the PLSKO method to their own data. More details about the functions and parameters can be found in the package documentation.

We hope this vignette helps you to get started with the PLSKO package and apply the PLSKO method to your own data. If you have any questions or feedback, please feel free to contact us or open an issue on the GitHub repository.

9 Session Info

sessionInfo()
#> R version 4.3.0 (2023-04-21 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
#> [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
#> [5] LC_TIME=English_Australia.utf8    
#> 
#> time zone: Australia/Sydney
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] knockoff_0.3.6 glasso_1.11    PLSKO_0.2.0   
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6        ellipse_0.5.0       shape_1.4.6.1      
#>  [4] xfun_0.50           bslib_0.9.0         ggplot2_3.5.1      
#>  [7] ggrepel_0.9.6       rstatix_0.7.2       lattice_0.22-6     
#> [10] vctrs_0.6.5         tools_4.3.0         generics_0.1.3     
#> [13] parallel_4.3.0      tibble_3.2.1        rARPACK_0.11-0     
#> [16] pkgconfig_2.0.3     Matrix_1.6-1.1      RColorBrewer_1.1-3 
#> [19] mixOmics_6.26.0     lifecycle_1.0.4     stringr_1.5.1      
#> [22] compiler_4.3.0      progress_1.2.3      Rdsdp_1.0.5.2.1    
#> [25] munsell_0.5.1       codetools_0.2-20    carData_3.0-5      
#> [28] htmltools_0.5.8.1   sass_0.4.9          yaml_2.3.10        
#> [31] glmnet_4.1-8        Formula_1.2-5       pillar_1.10.1      
#> [34] car_3.1-3           ggpubr_0.6.0        crayon_1.5.3       
#> [37] jquerylib_0.1.4     tidyr_1.3.1         MASS_7.3-58.4      
#> [40] BiocParallel_1.36.0 cachem_1.1.0        iterators_1.0.14   
#> [43] abind_1.4-8         foreach_1.5.2       RSpectra_0.16-2    
#> [46] pls_2.8-5           tidyselect_1.2.1    digest_0.6.37      
#> [49] stringi_1.8.4       reshape2_1.4.4      dplyr_1.1.4        
#> [52] purrr_1.0.2         splines_4.3.0       fastmap_1.2.0      
#> [55] grid_4.3.0          colorspace_2.1-1    cli_3.6.3          
#> [58] magrittr_2.0.3      survival_3.8-3      broom_1.0.7        
#> [61] corpcor_1.6.10      prettyunits_1.2.0   scales_1.3.0       
#> [64] backports_1.5.0     rmarkdown_2.29      matrixStats_1.5.0  
#> [67] igraph_2.1.4        gridExtra_2.3       ggsignif_0.6.4     
#> [70] hms_1.1.3           evaluate_1.0.3      knitr_1.49         
#> [73] doParallel_1.0.17   rlang_1.1.5         Rcpp_1.0.14        
#> [76] glue_1.8.0          rstudioapi_0.17.1   jsonlite_1.8.9     
#> [79] plyr_1.8.9          R6_2.5.1