In this document we describe the use of the coda_glmnet_longitudinal() function for the identification of microbial signatures in longitudinal studies. A detailed description of the methodology can be found in Calle M.L. and Susin A. (2022) Identification of dynamic microbial signatures in longitudinal studies. BioRxiv. https://www.biorxiv.org/content/10.1101/2022.04.25.489415v1

Zero imputation: The algorithm checks whether the data contains zeros and, if so, it implements a simple zero imputation (see the help of coda4microbiome::impute_zeros() function). However, the users can impute zeros before using coda4microbiome functions with more advanced methods such as those implemented in the R package zCompositions (Palarea-Albaladejo and MartinFernandez, 2015)

1 coda_glmnet_longitudinal for longitudinal data

We illustrate the coda_glmnet_longitudinal algorithm with the ecam_filtered dataset (included in the package).

ecam_filtered dataset contains microbiome compositions at genus level from the Early childhood and the microbiome (ECAM) study (Bokulich et al. 2016). Metadata and microbiome data were downloaded from https://github.com/caporaso-lab/longitudinal-notebooks.

We used function coda4microbiome::filter_longitudinal() to filter those individuals and taxa with enough information for time-course profiling. The filtered data contains information on 42 children and 37 taxa.

The dataset contains three objects:

We focus on the effects of the diet on the early modulation of the microbiome by comparing microbiome profiles between children with breastmilk diet (bd) vs. formula milk diet (fd) in their first 3 months of life. Among the 42 children, 30 were bd and 12 fd.

library(coda4microbiome)

data(ecam_filtered, package = "coda4microbiome")

x=x_ecam # microbiome abundance
x_time = metadata$day_of_life;    # observation times
subject_id = metadata$studyid;   # subject id
y= metadata$diet;           # diet ("bd"= breast diet, "fd"=formula diet)
ini_time = 0;
end_time = 90;

We implement the analysis with function coda_glmnet_longitudinal() with the default parameters except for lambda and nfolds. Because of the small sample size in “fd” group we specified nfolds=4 for the cross-validation penalized regression and set lambda="lambda.min" after observing the cross-validation accuracy (AUC) curve.

set.seed(123)

ecam <-coda_glmnet_longitudinal (x,y, x_time, subject_id, ini_time, end_time, lambda="lambda.min",nfolds=4)

The first plot shows the cross-validation accuracy (AUC) curve from cv.glmnet(). We can see that the default for lambda=“lambda.min” the algorithm selects 8 pairwise log-ratios that correspond to 8 different taxa (see below).

The output of coda_glmnet_longitudinal provides the number, the name and the coefficients of the selected taxa:

ecam$taxa.num
#> [1]  1  7 16 18 25 26 30 35

ecam$taxa.name
#> [1] "23"  "166" "205" "208" "223" "224" "240" "419"

ecam$`log-contrast coefficients`
#> [1]  0.21593417 -0.56720510  0.00556157  0.27307317  0.33590267  0.13581618
#> [7]  0.03371223 -0.43279490

The algorithm identified a microbiome signature with maximum discrimination accuracy between the two diet groups. The signature is defined by the relative abundances of two groups of taxa, G1 and G2, where G1 is composed of 6 taxa (those with a positive coefficient in the regression model) and G2 is composed of 2 taxa (those with a negative coefficient):

Taxa in G1 (ordered by importance):

coef<-ecam$`log-contrast coefficients`
positives<-which(coef>=0)
op<-order(coef[positives], decreasing = TRUE)
taxanames[positives[op]]
#> [1] "k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__Parabacteroides"  
#> [2] "k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides"          
#> [3] "k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinomycetaceae;g__Actinomyces"  
#> [4] "k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Rikenellaceae;g__"                      
#> [5] "k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus"              
#> [6] "k__Bacteria;p__Actinobacteria;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Eggerthella"

Taxa in G2 (ordered by importance):

negatives<-which(coef<0)
on<-order(abs(coef[coef<0]), decreasing = TRUE)
taxanames[negatives[on]]
#> [1] "k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Bifidobacteriales;f__Bifidobacteriaceae;g__Bifidobacterium"
#> [2] "k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Enterococcaceae;g__Enterococcus"

This is represented in the signature plot, a bar plot that shows the selected taxa and their estimated regression coefficients.

ecam$`signature plot`

The function also provides a plot of the signature trajectories for the different children according to their diet:

We can see a clear separation between the two groups. Those children with breastmilk diet have more relative mean abundance of g_Haemophilus and g_Staphylococcus with respect to the relative abundance of taxa in group G1, while children with formula milk diet have more relative abundance of taxa in group G1 relative to G2.

The output of the coda_glmnet() function also includes the predictions of the signature for each of the individuals in the sample:

head(ecam$predictions)
#>             [,1]
#> [1,] -0.62530837
#> [2,] -0.01926814
#> [3,] -0.70777170
#> [4,]  0.93713640
#> [5,] -1.79162762
#> [6,] -0.62548766

The algorithm provides three classification accuracy measures: the apparent AUC (the AUC of the signature applied to the same data that was used to generate the model) and the mean and sd of the cross-validation AUC (obtained from the output of cv.glmnet()):

ecam$`apparent AUC`
#> [1] 0.9611111
ecam$`mean cv-AUC`
#> [1] 0.7470238
ecam$`sd cv-AUC`
#> [1] 0.1054345

The classification accuracy of the signature is displayed in the predictions plot:

ecam$`predictions plot`