Chapter 1 Introduction

This vignette supports the paper “Variable selection in microbiome compositional data analysis” by Susin et al. (2020) that assesses three compositional data analysis (CoDA) algorithms for microbiome variable selection:

  • selbal: a forward selection method for the identification of two groups of taxa whose balance is most associated with the response variable (Rivera-Pinto et al. 2018).
  • clr-lasso: penalized regression after the centered log-ratio (clr) transformation (Zou and Hastie 2005; Tibshirani 1996; Le Cessie and Van Houwelingen 1992);
  • coda-lasso: penalized log-contrast regression (log-transformed abundances and a zero-sum constraint on the regression coefficients) (Lu, Shi, and Li 2019; Lin et al. 2014);

Among them, coda-lasso is not yet available as an R package, but the R code for implementing the algorithm is available on Github: https://github.com/UVic-omics/CoDA-Penalized-Regression. Therefore, let us copy the repository first. We only need to copy once, after that, we can update it by fetching the last modified version.

# copy the repository from https://github.com/UVic-omics/CoDA-Penalized-Regression
system('git clone https://github.com/UVic-omics/CoDA-Penalized-Regression')

# fetch the last modified repository from 
# https://github.com/UVic-omics/CoDA-Penalized-Regression
# when you have already git clone the repository
# system('git pull https://github.com/UVic-omics/CoDA-Penalized-Regression')

This vignette only displays the application of all methods on the case studies. Paper related codes and datasets including simulations are all available on GitHub: https://github.com/UVic-omics/Microbiome-Variable-Selection

1.1 Packages installation and loading

Install then load the following packages:

# cran.packages <- c('knitr', 'glmnet', 'ggplot2', 'gridExtra',
#                    'UpSetR', 'ggforce')
# install.packages(cran.packages)
# devtools::install_github(repo = 'UVic-omics/selbal')

library(knitr) # rbookdown, kable
library(glmnet) # glmnet
library(selbal) # selbal
library(ggplot2) # draw selbal
library(gridExtra) # grid.arrange
library(UpSetR) # upset
library(ggforce) # selbal-like plot
library(grid) # grid.draw
# source coda-lasso functions
source(file = './CoDA-Penalized-Regression/R/functions_coda_penalized_regression.R')

# build in functions
source(file = 'functions.R')

1.2 Example datasets

1.2.1 Crohn’s disease

Crohn’s disease (CD) is an inflammatory bowel disease that has been linked to microbial alterations in the gut. The pediatric CD study (Gevers et al. 2014) includes 975 individuals from 662 patients with Crohn’s disease and 313 without any symptoms. The processed data, from 16S rRNA gene sequencing after QIIME 1.7.0, were downloaded from Qiita (Gonzalez et al. 2018) study ID 1939. The OTU table was agglomerated to the genus level, resulting in a matrix with 48 genera and 975 samples (see Table 1.1).

Load the data as follows:

load('./datasets/Crohn_data.RData')

File “Crohn_data.RData” contains three objects:

x_Crohn: the abundance table, a data frame of counts with 975 rows (individuals) and 48 columns (genera)

class(x_Crohn)
## [1] "data.frame"
dim(x_Crohn)
## [1] 975  48

y_Crohn: a factor variable, indicator of disease status (CD vs. not CD)

class(y_Crohn)
## [1] "factor"
summary(y_Crohn)
##  CD  no 
## 662 313

y_Crohn_numeric: a numerical variable with values 1 (CD) and 0 (not CD)

class(y_Crohn_numeric)
## [1] "numeric"
table(y_Crohn_numeric)
## y_Crohn_numeric
##   1   2 
## 662 313

Note: x_Crohn contains no zero. The original matrix of counts (X) was transformed by adding one count to each matrix cell: x_Crohn = X + 1. The original matrix of counts can easily be recovered and other imputation methods can be applied.

1.2.2 High fat high sugar diet in mice

The study was conducted by Dr Lê Cao at the University of Queensland Diamantina Institute that investigated the effect of diet in mice. C57/B6 female black mice were housed in cages (3 animals per cage and fed with a high fat high sugar diet (HFHS) or a normal diet). Stool sampling was performed at Day 0, 1, 4 and 7. Illumina MiSeq sequencing was used to obtain the 16S rRNA sequencing data. The sequencing data were then processed with QIIME 1.9.0. For our analysis, we considered Day 1 only (HFHSday1 data). The OTU (Operational Taxonomy Units) table after OTU filtering included 558 taxa and 47 samples (24 HFHS diet and 23 normal diet) (see Table 1.1). Taxonomy information is also available and reported here.

load('./datasets/HFHSday1.RData')

File “HFHSday1.RData” contains three objects:

x_HFHSday1: the abundance table, a matrix of proportions with 47 rows (samples) and 558 columns (OTUs)

class(x_HFHSday1)
## [1] "matrix"
dim(x_HFHSday1)
## [1]  47 558

y_HFHSday1: a factor variable, indicator of diet (HFHS vs. normal)

class(y_HFHSday1)
## [1] "factor"
summary(y_HFHSday1)
##   HFHS Normal 
##     24     23

taxonomy_HFHS: taxonomy table

Note: x_HFHSday1 contains no zero. Zero imputation was performed on the original abundance matrix.

Table 1.1: A summary of the number of samples and number of taxa in each case study
Crohn data HFHSday1 data
No. of genera 48 No. of OTUs 558
No. of samples 975 No. of samples 47
No. of patients with CD 662 No. of mice with HFHS diet 24
No. of healthy patients 313 No. of mice with normal diet 23

References

Gevers, Dirk, Subra Kugathasan, Lee A Denson, Yoshiki Vázquez-Baeza, Will Van Treuren, Boyu Ren, Emma Schwager, et al. 2014. “The Treatment-Naive Microbiome in New-Onset Crohn’s Disease.” Cell Host & Microbe 15 (3). Elsevier: 382–92.

Gonzalez, Antonio, Jose A Navas-Molina, Tomasz Kosciolek, Daniel McDonald, Yoshiki Vázquez-Baeza, Gail Ackermann, Jeff DeReus, et al. 2018. “Qiita: Rapid, Web-Enabled Microbiome Meta-Analysis.” Nature Methods 15 (10). Nature Publishing Group: 796–98.

Le Cessie, Saskia, and Johannes C Van Houwelingen. 1992. “Ridge Estimators in Logistic Regression.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 41 (1). Wiley Online Library: 191–201.

Lin, Wei, Pixu Shi, Rui Feng, and Hongzhe Li. 2014. “Variable Selection in Regression with Compositional Covariates.” Biometrika 101 (4). Oxford University Press: 785–97.

Lu, Jiarui, Pixu Shi, and Hongzhe Li. 2019. “Generalized Linear Models with Linear Constraints for Microbiome Compositional Data.” Biometrics 75 (1). Wiley Online Library: 235–44.

Rivera-Pinto, J, JJ Egozcue, Vera Pawlowsky-Glahn, Raul Paredes, Marc Noguera-Julian, and ML Calle. 2018. “Balances: A New Perspective for Microbiome Analysis.” MSystems 3 (4). Am Soc Microbiol: e00053–18.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1). Wiley Online Library: 267–88.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2). Wiley Online Library: 301–20.