1 Introduction

Understanding the role of the microbiome in human health and how it can be modulated is becoming increasingly relevant for preventive medicine and for the medical management of chronic diseases (Calle 2019). High-throughput sequencing technologies has boosted microbiome research but the compositional nature of microbiome data is a major challenge for their analysis.

Microbiome count data is compositional since their total are constrained by the sequencing depth. Relative abundances (proportions) are obviously constraint by a sum equal to one. This total constraint induces strong dependencies among the observed abundances of the different taxa. In fact, nor the absolute abundance (read counts) nor the relative abundance (proportion) of one taxon alone are informative of the real abundance of the taxon in the environment. Instead, they provide information on the relative measure of abundance when compared to the abundance of other taxa in the same sample.

We introduce a new package, coda4microbiome, that aims to bridge the gap between microbiome research and compositional data analysis (CoDA).

2 Package functionality

Our package provides a set of functions to explore and study microbiome data within the CoDA framework, with a special focus on identification of microbial signatures that can serve as biomarkers of disease risk and prognostic. Their prediction accuracy relies on the selection of the taxa that constitute the signature, which is challenging given the sparsity, multivariate and compositional inherent characteristics of microbiome data (Susin et al. 2020).

coda4microbiome performs variable selection through penalized regression in cross-sectional studies, with both binary and continuous outcome. In addition, the package incorporates a new approach for the analysis of longitudinal microbiome studies with a binary outcome.

Penalized regression implementation relies on the function cv.glmnet() from the R package glmnet adapted to CoDA by using all pairwise log-ratios of the variables (Bates and Tibshirani, 2018). The results are expressed as the (weighted) balance between two groups of taxa, those that contribute positively to the microbial signature and those that contribute negatively (Susin et al. 2020).

The interpretability of results is of major importance in this context. The package provides several graphical representations for a better interpretation of the analysis and the identified microbial signatures.

2.1 Functions for microbial signature identification

  • coda_glmnet: Identification of microbial signatures in cross-sectional studies.

    The algorithm performs variable selection through penalized regression on the set of all pairwise log-ratios for both, binary outcome (logistic regression) and continuous outcome (linear regression). The result is expressed as the (weighted) balance between two groups of taxa. It allows the use of non-compositional covariates.

    A graphical representation of the taxa that constitutes the signature and their coefficients is provided. The output also includes a plot of the prediction or classification accuracy.

  • coda_glmnet_longitudinal: Identification of microbial signatures in longitudinal studies.

    Identification of a set of microbial taxa whose joint dynamics is associated with the phenotype of interest (binary). The algorithm performs variable selection through penalized regression over the summary of the log-ratio trajectories (AUC). The result is expressed as the (weighted) balance between two groups of taxa.

    The output provides three plots: the taxa that constitutes the signature and their coefficients, the classification accuracy of the signature and the plot of the signature trajectories of the individuals.

2.2 Functions for log-ratio exploratory analysis

Previously or independently of variable selection for microbial signature identification, one may be interested in the exploratory analysis of pairwise log-ratios.

The interpretation of results of log-ratio analysis is challenging because when one taxon A is highly associated with the outcome, any log-ratio involving taxon A is likely to be associated with Y, no matter which is the second taxon involved in the log-ratio. Here we summarize the importance of each taxon A by aggregating the prediction accuracy of all log-ratios that involve taxon A.

  • explore_logratios: Explores the association of each log-ratio with the outcome.

  • explore_lr_longitudinal: Explores the association of a summary (integral) of each log-ratio trajectory with the outcome.

    Both functions summarize the importance of each taxon as the aggregation of the association measures of those log-ratios involving the taxon. The output includes a plot of the association of the log-ratio with the outcome where the taxa are ranked by importance

2.3 Suplementary functions

  • explore_zeros:

    Provides the proportion of zeros for a pair of variables (taxa) in table x and the proportion of samples with zero in both variables. A bar plot with this information is also provided. Results can be stratified by a categorical variable.

  • impute_zeros:

    Simple imputation: When the abundance table contains zeros, a positive value is added to all the values in the table. It adds 1 when the minimum of table is larger than 1 (i.e. tables of counts) or it adds half of the minimum value of the table, otherwise.

  • logratios_matrix:

    Computes the matrix with of all pairwise log-ratios between taxa

  • plot_prediction:

    Plot of the predictions of a fitted model (microbial signature): Multiple box-plot and density plots for binary outcomes and Regression plot for continuous outcome

  • plot_signature:

    Graphical representation of the variables selected and their coefficients

  • coda_glmnet_null: Performs a permutational test for the coda_glmnet() algorithm

    It provides the distribution of results under the null hypothesis by implementing the coda_glmnet() on different rearrangements of the response variable.

  • filter_longitudinal: Filters those individuals and taxa with enough longitudinal information

  • coda_glmnet_longitudinal_null: Performs a permutational test for the coda_glmnet_longitudinal() algorithm

    It provides the distribution of results under the null hypothesis by implementing the coda_glmnet_longitudinal() on different rearrangements of the response variable.

  • shannon: Shannon information

  • shannon_effnum: Shannon effective number of variables in a composition

  • shannon_sim: Shannon similarity between two compositions