In this document we describe the use of the coda_glmnet()
function for the identification of microbial signatures in cross-sectional studies.
Zero imputation: The algorithm checks whether the data contains zeros and, if so, it implements a simple zero imputation (see the help of coda4microbiome::impute_zeros()
function). However, the users can impute zeros before using coda4microbiome functions with more advanced methods such as those implemented in the R package zCompositions (Palarea-Albaladejo and MartinFernandez, 2015)
We illustrate the coda_glmnet
algorithm with the Crohn dataset (included in the package).
Crohn dataset includes a matrix, x_Crohn
, of microbiome compositions at genus level (48 taxa) for 975 individuals from a Crohn’s disease study (662 patients with Crohn’s disease and 313 controls) and an factor variable, y_Crohn
, indicating if the sample corresponds to a case (CD) or a control (no).
We call function coda_glmnet()
with the default parameters:
library(coda4microbiome)
set.seed(123) # to reproduce the results
data(Crohn, package = "coda4microbiome") # load the data
coda_glmnet_Crohn<-coda_glmnet(x=x_Crohn,y=y_Crohn)
The algorithm identifies that the outcome is binary and implements a penalized logistic regression.
The first plot shows the cross-validation accuracy (AUC) curve from cv.glmnet()
. We can see that the default lambda (“lambda.1se”) selects 27 pairwise log-ratios that, as we see below, correspond to 24 different taxa.
The results of coda_glmnet include the number, the name and the coefficients of the selected taxa:
coda_glmnet_Crohn$taxa.num
#> [1] 1 4 5 8 9 10 15 18 19 20 26 27 28 31 32 33 34 37 38 39 40 44 45 48
length(coda_glmnet_Crohn$taxa.num) # number of selected taxa
#> [1] 24
coda_glmnet_Crohn$taxa.name
#> [1] "g__Turicibacter" "g__Sutterella"
#> [3] "f__Peptostreptococcaceae_g__" "g__Bacteroides"
#> [5] "g__Eggerthella" "g__Faecalibacterium"
#> [7] "g__Blautia" "g__Dorea"
#> [9] "g__Dialister" "g__Anaerostipes"
#> [11] "g__Oscillospira" "o__Lactobacillales_g__"
#> [13] "g__Adlercreutzia" "g__Prevotella"
#> [15] "g__Roseburia" "g__Lachnospira"
#> [17] "o__Clostridiales_g__" "f__Ruminococcaceae_g__"
#> [19] "g__Clostridium" "g__Streptococcus"
#> [21] "g__Aggregatibacter" "g__Collinsella"
#> [23] "g__Holdemania" "g__Bilophila"
coda_glmnet_Crohn$`log-contrast coefficients`
#> [1] 0.013010872 -0.020016941 0.164754951 0.155094293 -0.138182163
#> [6] 0.011063385 -0.002597898 -0.096701551 -0.109790538 0.064335733
#> [11] -0.109765138 -0.051361152 -0.164948478 0.031393421 0.334531688
#> [16] 0.037824859 0.109558776 -0.030427617 -0.024800066 -0.114461353
#> [21] -0.117553698 -0.019393408 0.041706398 0.036725623
By construction, the sum of the coefficients is equal to zero:
sum(coda_glmnet_Crohn$`log-contrast coefficients`)
#> [1] -2.775558e-17
This constraint means that the model is a log-contrast function and this ensures the invariance principle for compositional data analysis.
The microbiome signature is defined by the relative abundances of two groups of taxa, G1 (those taxa with positive coefficient) and G2 (those with negative coefficient).
Taxa in G1 (ordered by importance):
coef<-coda_glmnet_Crohn$`log-contrast coefficients`
positives<-which(coef>=0)
op<-order(coef[positives], decreasing = TRUE)
coda_glmnet_Crohn$taxa.name[positives[op]]
#> [1] "g__Roseburia" "f__Peptostreptococcaceae_g__"
#> [3] "g__Bacteroides" "o__Clostridiales_g__"
#> [5] "g__Anaerostipes" "g__Holdemania"
#> [7] "g__Lachnospira" "g__Bilophila"
#> [9] "g__Prevotella" "g__Turicibacter"
#> [11] "g__Faecalibacterium"
Taxa in G2 (ordered by importance):
negatives<-which(coef<0)
on<-order(abs(coef[coef<0]), decreasing = TRUE)
coda_glmnet_Crohn$taxa.name[negatives[on]]
#> [1] "g__Adlercreutzia" "g__Eggerthella" "g__Aggregatibacter"
#> [4] "g__Streptococcus" "g__Dialister" "g__Oscillospira"
#> [7] "g__Dorea" "o__Lactobacillales_g__" "f__Ruminococcaceae_g__"
#> [10] "g__Clostridium" "g__Sutterella" "g__Collinsella"
#> [13] "g__Blautia"
A graphical representation of the signature can obtained with the signature plot that represents in a bar plot the selected taxa and their estimated regression coefficients.
coda_glmnet_Crohn$`signature plot`
The output of the coda_glmnet()
function also includes the predictions of the signature for each of the individuals in the sample:
head(coda_glmnet_Crohn$`predictions`)
#> [,1]
#> 1939.SKBTI.0175 -0.30698303
#> 1939.SKBTI.1068 1.94796161
#> 1939.SKBTI047 1.94584281
#> 1939.SKBTI051 0.95388884
#> 1939.SKBTI063 -0.04902361
#> 1939.SKBTI072 0.52612178
The algorithm provides three classification accuracy measures: the apparent AUC (the AUC of the signature applied to the same data that was used to generate the model) and the mean and sd of the cross-validation AUC (obtained from the output of cv.glmnet()
):
coda_glmnet_Crohn$`apparent AUC`
#> [1] 0.8479774
coda_glmnet_Crohn$`mean cv-AUC`
#> [1] 0.8240901
coda_glmnet_Crohn$`sd cv-AUC`
#> [1] 0.00811983
Finally, we can obtain the predictions plot that provides a graphical assessment of the classification accuracy of the microbiome signature:
coda_glmnet_Crohn$`predictions plot`
We can explore the significance of the Crohn analysis by performing a permutational test. This is specially advised for datasets with more variables than individuals, where over-fitting is very likely.
Function coda_glmnet_null()
implements this permutational test. It provides the distribution of mean cv-AUC under the null hypothesis by implementing the coda_glmnet()
on different rearrangements of the response variable y.
null_acc<-coda_glmnet_null(x=x_Crohn,y=y_Crohn, niter=100)
summary(null_acc$"accuracy")
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.4735 0.4883 0.4916 0.5035 0.5173 0.5537
null_acc$"confidence interval"
#> 2.5% 97.5%
#> 0.4746088 0.5495805
We can see that the mean cv-AUC is far from the region of expected values under the null.
We illustrate the analysis of a continuous outcome with the sCD14 dataset (included in the package).
sCD14 dataset contains information on the microbiome composition and the inflammatory parameter sCD14 from an HIV study (Noguera-Julian et al. 2016): x_sCD14
is the microbiome abundance matrix for 151 individuals (rows) and 60 genera (columns) and y_sCD14
is a numeric variable with the value of the inflammation parameter sCD14 for each individual.
In this case we specify the penalized parameter as lambda = "lambda.min"
because with the default parameter (“lambda.1se”) very few variables were selected:
library(coda4microbiome)
set.seed(123) # to reproduce the results
data(sCD14, package = "coda4microbiome")
coda_glmnet_SCD14<-coda_glmnet(x=x_sCD14,y=y_sCD14, lambda = "lambda.min")
In the cross-validation accuracy (MSE) curve from cv.glmnet()
, we see that 20 pairwise log-ratios are selected for lambda= “lambda.min”. This correspond to 22 different taxa (see below).
As in the binary case, the results of coda_glmnet provide the column number, the name and the coefficients of the selected taxa:
coda_glmnet_SCD14$taxa.num
#> [1] 2 3 4 9 10 12 15 16 17 21 22 24 26 27 29 30 36 44 49 55 57 59
length(coda_glmnet_SCD14$taxa.num) # number of selected taxa
#> [1] 22
head(coda_glmnet_SCD14$taxa.name)
#> [1] "g_Faecalibacterium" "g_Bacteroides"
#> [3] "f_Lachnospiraceae_g_unclassified" "g_Parabacteroides"
#> [5] "g_Lachnospira" "f_Lachnospiraceae_g_Incertae_Sedis"
head(coda_glmnet_SCD14$`log-contrast coefficients`)
#> [1] 0.09665921 0.06410267 -0.39100093 -0.06410267 -0.03464772 0.26437739
The two groups G1 and G2 that define the signature are:
Taxa in G1 (ordered by importance):
coef<-coda_glmnet_SCD14$`log-contrast coefficients`
positives<-which(coef>=0)
op<-order(coef[positives], decreasing = TRUE)
coda_glmnet_SCD14$taxa.name[positives[op]]
#> [1] "f_Lachnospiraceae_g_Incertae_Sedis"
#> [2] "g_Subdoligranulum"
#> [3] "g_Thalassospira"
#> [4] "g_Dialister"
#> [5] "g_Faecalibacterium"
#> [6] "g_Bacteroides"
#> [7] "g_Dorea"
#> [8] "g_Desulfovibrio"
#> [9] "f_Peptostreptococcaceae_g_Incertae_Sedis"
#> [10] "g_Alistipes"
Taxa in G2 (ordered by importance):
negatives<-which(coef<0)
on<-order(abs(coef[coef<0]), decreasing = TRUE)
coda_glmnet_SCD14$taxa.name[negatives[on]]
#> [1] "f_Lachnospiraceae_g_unclassified" "g_Collinsella"
#> [3] "g_Bifidobacterium" "g_Mitsuokella"
#> [5] "g_Parabacteroides" "g_Lachnospira"
#> [7] "o_Clostridiales_g_unclassified" "g_Catenibacterium"
#> [9] "g_Ruminococcus" "g_Odoribacter"
#> [11] "f_Porphyromonadaceae_g_unclassified" "g_Coprococcus"
For a continuous outcome, the prediction accuracy measures are: the apparent R-squared (the squared of the Pearson correlation between the outcome and the predictions by the signature) and the mean and sd of the cross-validation MSE (obtained from the output of cv.glmnet()
):
coda_glmnet_SCD14$Rsq
#> NULL
coda_glmnet_SCD14$`mean cv-MSE`
#> [1] 7041310
coda_glmnet_SCD14$`sd cv-MSE`
#> [1] 1828685
We obtain the signature plot with the selected taxa and their estimated regression coefficients:
coda_glmnet_SCD14$`signature plot`
The predictions plot in this case is a scatter plot with the corresponding regression line that represents the linear association between the microbiome signature and the outcome:
coda_glmnet_SCD14$`predictions plot`