Chapter 4 coda-lasso
Coda-lasso implements penalised regression on a log-contrast model (a regression model on log-transformed covariates and a zero-sum constraint on the regression coefficients, except the intercept) (Lu, Shi, and Li 2019; Lin et al. 2014).
As we mentioned before, coda-lasso is not yet available as an R package. Our R code for implementing the algorithm includes two main functions that are coda_logistic_lasso() and coda_logistic_elasticNet() that implement LASSO or elastic-net penalisation on a logistic regression model for a binary outcome.
Details of function coda_logistic_lasso(y,X,lambda):
y is the binary outcome, can be numerical (values 0 and 1), factor (2 levels) or categorical (2 categories).
X is the matrix of microbiome abundances, either absolute abundances (counts) or relative abundances (proportions), the rows of X are individuals/samples, the columns are taxa.
lambda is the penalisation parameter, the larger the value of lambda the smaller the number of variables selected.
Notes:
- Imputation of zeros: The user should provide a matrix X of positive values without zeros. Zeros should be previously added with an offset of 1 by the user.
- Log-transformation: X should not be the matrix of log(counts) or log(proportions). The method itself performs the log-transformation of the abundances.
- Selection of \(\lambda\): Functions lambdaRange_codalasso() and lambdaRange_elasticnet() are useful to find the optimum value of the penalisation parameter \(\lambda\). Function lambdaRange_codalasso(y,X) provides the number of selected variables and the proportion of explained deviance for a sequence of values for the penalised parameter \(\lambda\).
Bellow, we illustrate the use of coda_logistic_lasso() on the Crohn and HFHS-Day1 case studies. We also generated a wrapper function called coda_lasso_wrapper() that will help us to handle the output of coda-lasso. The coda_lasso_wrapper() function is uploaded via functions.R.
4.1 Crohn case study
To run coda-lasso, we must specify a value of \(\lambda\), the penalisation parameter: the larger the value of \(\lambda\), the less variables will be selected. In previous analysis of this dataset with selbal, a balance with 12 variables was determined optimal to discriminate the CD status (Rivera-Pinto et al. 2018). For ease of comparison, we will specify the penalised parameter that results in the selection of 12 variables for coda-lasso. Function lambdaRange_codalasso() helps us to identify the value of \(\lambda\) that corresponds to 12 variables selected. To save space, we only consider a sequence of \(\lambda\) from 0.1 to 0.2 with an increment of 0.01, but you can start from a broader range:
lambdaRange_codalasso(X = x_Crohn, y = y_Crohn, lambdaSeq = seq(0.1, 0.2, 0.01))
## [1] "lambda" "num.selected" "prop.explained.dev"
## 0.1000 23 0.2473
## 0.1100 18 0.2398
## 0.1200 17 0.2355
## 0.1300 25 0.2288
## 0.1400 13 0.2185
## 0.1500 12 0.2130
## 0.1600 12 0.2043
## 0.1700 14 0.2010
## 0.1800 11 0.1908
## 0.1900 10 0.1903
## 0.2000 8 0.1775
In this output, the first column is the value of \(\lambda\), the second column is the number of selected variables and the third column is the proportion of deviance explained. According to this results, we will take \(\lambda = 0.15\).
codalasso_Crohn <- coda_logistic_lasso(X = x_Crohn, y = y_Crohn, lambda = 0.15)
The same as the previous two methods, we also use a wrapper function called coda_lasso_wrapper() with the output from coda_logistic_lasso().
Crohn.results_codalasso <- coda_lasso_wrapper(result = codalasso_Crohn, X = x_Crohn)
Then we get the number of selected variables:
Crohn.results_codalasso$numVarSelect
## [1] 12
and the names of selected variables:
Crohn.results_codalasso$varSelect
## [1] "g__Roseburia" "g__Streptococcus"
## [3] "g__Dialister" "f__Peptostreptococcaceae_g__"
## [5] "g__Aggregatibacter" "g__Eggerthella"
## [7] "g__Prevotella" "g__Dorea"
## [9] "g__Bilophila" "o__Lactobacillales_g__"
## [11] "g__Lachnospira" "g__Clostridium"
4.2 HFHS-Day1 case study
The analysis on HFHSday1 data is similar to Crohn data.
Using function lambdaRange_codalasso(), we explore the number of selected OTUs and the proportion of explained variance for different values of \(\lambda\). To save space, we only consider a sequence of \(\lambda\) from 1.3 to 1.8 with an increment of 0.01, but you can start from a broader range:
lambdaRange_codalasso(X = x_HFHSday1, y = y_HFHSday1, lambdaSeq = seq(1.3, 1.8, 0.01))
## [1] "lambda" "num.selected" "prop.explained.dev"
## 1.3000 8 1.0000
## 1.3100 8 1.0000
## 1.3200 8 1.0000
## 1.3300 8 1.0000
## 1.3400 8 1.0000
## 1.3500 8 1.0000
## 1.3600 8 1.0000
## 1.3700 8 1.0000
## 1.3800 9 1.0000
## 1.3900 9 1.0000
## 1.4000 9 1.0000
## 1.4100 9 1.0000
## 1.4200 9 1.0000
## 1.4300 9 1.0000
## 1.4400 7 1.0000
## 1.4500 7 1.0000
## 1.4600 7 1.0000
## 1.4700 7 1.0000
## 1.4800 6 0.6344
## 1.4900 6 0.6344
## 1.5000 6 0.6344
## 1.5100 6 0.6343
## 1.5200 6 0.6343
## 1.5300 6 0.6343
## 1.5400 6 0.6343
## 1.5500 6 0.6343
## 1.5600 6 0.6342
## 1.5700 3 0.0108
## 1.5800 3 0.0089
## 1.5900 3 0.0130
## 1.6000 3 0.0241
## 1.6100 3 0.0290
## 1.6200 3 0.0441
## 1.6300 3 0.0481
## 1.6400 3 0.0644
## 1.6500 3 0.0667
## 1.6600 3 0.0688
## 1.6700 3 0.0700
## 1.6800 3 0.0822
## 1.6900 3 0.0829
## 1.7000 3 0.0835
## 1.7100 3 0.0841
## 1.7200 3 0.0847
## 1.7300 3 0.0854
## 1.7400 3 0.0860
## 1.7500 3 0.0866
## 1.7600 3 0.0872
## 1.7700 3 0.0878
## 1.7800 3 0.0884
## 1.7900 3 0.0890
## 1.8000 3 0.0895
The largest \(\lambda\) that provides maximum proportion of explained deviance is \(\lambda = 1.47\). Thus, we implement coda-lasso with this value of \(\lambda\).
codalasso_HFHSday1 <- coda_logistic_lasso(X = x_HFHSday1, y = y_HFHSday1, lambda = 1.47)
We then use coda_lasso_wrapper() to handle the results.
HFHS.results_codalasso <- coda_lasso_wrapper(result = codalasso_HFHSday1, X = x_HFHSday1)
The number of selected variables is 7:
HFHS.results_codalasso$numVarSelect
## [1] 7
The proportion of explained deviance by the selected OTUs is 1 meaning that they completely discriminate the two diet groups:
HFHS.results_codalasso$explained_deviance_proportion
## [1] 1
The names of the selected OTUs are:
HFHS.results_codalasso$varSelect
## [1] "348038" "400599" "192222" "198339" "265322" "263479" "175272"
By using the taxomonic table, we extract the taxonomic information of these selected OTUs.
HFHS.tax_codalasso <- taxonomy_HFHS[which(rownames(taxonomy_HFHS) %in%
HFHS.results_codalasso$varSelect), ]
kable(HFHS.tax_codalasso[ ,2:6], booktabs = T)
Phylum | Class | Order | Family | Genus | |
---|---|---|---|---|---|
192222 | Bacteroidetes | Bacteroidia | Bacteroidales | Prevotellaceae | |
265322 | Bacteroidetes | Bacteroidia | Bacteroidales | S24-7 | |
263479 | Bacteroidetes | Bacteroidia | Bacteroidales | S24-7 | |
400599 | Firmicutes | Clostridia | Clostridiales | ||
348038 | Bacteroidetes | Bacteroidia | Bacteroidales | S24-7 | |
198339 | Bacteroidetes | Bacteroidia | Bacteroidales | S24-7 | |
175272 | Bacteroidetes | Bacteroidia | Bacteroidales | S24-7 |
References
Lin, Wei, Pixu Shi, Rui Feng, and Hongzhe Li. 2014. “Variable Selection in Regression with Compositional Covariates.” Biometrika 101 (4). Oxford University Press: 785–97.
Lu, Jiarui, Pixu Shi, and Hongzhe Li. 2019. “Generalized Linear Models with Linear Constraints for Microbiome Compositional Data.” Biometrics 75 (1). Wiley Online Library: 235–44.
Rivera-Pinto, J, JJ Egozcue, Vera Pawlowsky-Glahn, Raul Paredes, Marc Noguera-Julian, and ML Calle. 2018. “Balances: A New Perspective for Microbiome Analysis.” MSystems 3 (4). Am Soc Microbiol: e00053–18.