coda4microbiome package provides some tools for the exploratory analysis of pairwise log-ratios. The interpretation of results of log-ratio analysis is challenging because when one taxon A is highly associated with the outcome in the environment, any log-ratio involving taxon A is likely to be associated with Y, no matter which is the second taxon involved in the log-ratio. Here we summarize the importance of each taxon A by aggregating the prediction accuracy of all log-ratios that involve taxon A.
In this document we describe the use of the explore_logratios()
and explore_lr_longitudinal()
.
Function explore_logratios()
explores the association of each pairwise log-ratio with a dependent variable (binary or continuous).
Function explore_lr_longitudinal()
is similar to explore_logratios()
, but adapted to longitudinal studies. For each pairwise log-ratio, it explores the association between the dependent variable (binary or continuous) and a summary of the log-ratio trajectories within a specified interval of time.
Both functions summarize the importance of each variable (taxa) as the aggregation of the association measures of those log-ratios involving the variable. The output includes a plot of the association of the log-ratio with the outcome where the variables (taxa) are ranked by importance.
Zero imputation: The algorithm checks whether the data contains zeros and, if so, it implements a simple zero imputation (see the help of coda4microbiome::impute_zeros()
function). However, the users can impute zeros before using coda4microbiome functions with more advanced methods such as those implemented in the R package zCompositions (Palarea-Albaladejo and MartinFernandez, 2015)
We illustrate the explore_logratios
algorithm with the Crohn dataset (included in the package).
Crohn dataset includes a matrix, x_Crohn
, of microbiome compositions at genus level (48 taxa) for 975 individuals from a Crohn’s disease study (662 patients with Crohn’s disease and 313 controls) and an factor variable, y_Crohn
, indicating if the sample corresponds to a case (CD) or a control (no).
We call function explore_logratios()
with the default parameters:
library(coda4microbiome)
set.seed(123) # to reproduce the results
data(Crohn, package = "coda4microbiome") # load the data
Crohn_logratios<-explore_logratios(x=x_Crohn,y=y_Crohn, measure = "glm")
The algorithm provides a correlation-like plot that provides the association of each pairwise log-ratios with the outcome. The variables composing the log-ratio are indicated and are ranked according to their aggregated association with the outcome:
We can see in the plot that the most important variables are 32, 34, 40, 19, ect. This and the name of the variables can be obtained from the output of the function:
Crohn_logratios$`order of importance`
#> [1] 32 34 40 19 5 33 11 7 24 27 39 9 2 23 8 28 13 45 35 10 17 42 47 12 20
#> [26] 15 48 37 22 1 14 16 44 41 30 36 38 25 6 3 18 26 31 46 29 21 43 4
Crohn_logratios$`name of most important variables`
#> [1] "g__Roseburia" "o__Clostridiales_g__"
#> [3] "g__Aggregatibacter" "g__Dialister"
#> [5] "f__Peptostreptococcaceae_g__" "g__Lachnospira"
#> [7] "g__Coprococcus" "g__Veillonella"
#> [9] "f__Gemellaceae_g__" "o__Lactobacillales_g__"
#> [11] "g__Streptococcus" "g__Eggerthella"
#> [13] "g__Parabacteroides" "g__Actinomyces"
#> [15] "g__Bacteroides"
The output also reports the pair of taxa whose log-ratio is more associated with the outcome:
Crohn_logratios$`max log-ratio`
#> [1] "32" "9"
Crohn_logratios$`names max log-ratio`
#> [1] "g__Roseburia" "g__Eggerthella"
The algorithm also provide the numerical values behind the correlation plot:
Crohn_logratios$`association log-ratio with y`[1:6,1:6]
#> 32 34 40 19 5 33
#> 32 0.0000000 0.6348947 0.7387817 0.7369719 0.6115991 0.6145044
#> 34 0.6348947 0.0000000 0.6746257 0.6773356 0.5067180 0.4956855
#> 40 0.7387817 0.6746257 0.0000000 0.5043483 0.6623288 0.6774900
#> 19 0.7369719 0.6773356 0.5043483 0.0000000 0.6603042 0.6619548
#> 5 0.6115991 0.5067180 0.6623288 0.6603042 0.0000000 0.4972757
#> 33 0.6145044 0.4956855 0.6774900 0.6619548 0.4972757 0.0000000
We consider the sCD14 dataset (included in the package).
sCD14 dataset contains information on the microbiome composition and the inflammatory parameter sCD14 from an HIV study (Noguera-Julian et al. 2016): x_sCD14 is the microbiome abundance matrix for 151 individuals (rows) and 60 genera (columns) and y_sCD14 is a numeric variable with the value of the inflammation parameter sCD14 for each individual.
We call function explore_logratios()
with the default parameters:
set.seed(123) # to reproduce the results
data(sCD14, package = "coda4microbiome") # load the data
sCD14_logratios<-explore_logratios(x=x_sCD14,y=y_sCD14, measure = "glm")
The algorithm provides a correlation-like plot that provides the association of each pairwise log-ratios with the outcome. The variables composing the log-ratio are indicated and are ranked according to their aggregated association with the outcome:
According to the plot and the output of the functions, the most important taxa related to sCD14 are:
sCD14_logratios$`order of importance`
#> [1] 57 36 22 4 52 30 3 16 2 28 49 17 8 42 48 51 41 35 29 12 47 60 58 7 31
#> [26] 40 20 59 21 14 23 25 55 45 44 15 53 39 37 56 46 18 32 38 43 27 10 11 54 50
#> [51] 24 9 13 26 33 1 19 34 5 6
sCD14_logratios$`name of most important variables`
#> [1] "g_Collinsella"
#> [2] "g_Bifidobacterium"
#> [3] "g_Catenibacterium"
#> [4] "f_Lachnospiraceae_g_unclassified"
#> [5] "f_Defluviitaleaceae_g_Incertae_Sedis"
#> [6] "g_Mitsuokella"
#> [7] "g_Bacteroides"
#> [8] "g_Alistipes"
#> [9] "g_Faecalibacterium"
#> [10] "g_Megasphaera"
#> [11] "g_Thalassospira"
#> [12] "g_Subdoligranulum"
#> [13] "g_Blautia"
#> [14] "g_Streptococcus"
#> [15] "g_Brachyspira"
The pair of taxa whose log-ratio is more associated with the sCD14 are:
sCD14_logratios$`max log-ratio`
#> [1] "57" "49"
sCD14_logratios$`names max log-ratio`
#> [1] "g_Collinsella" "g_Thalassospira"
The correlation values between the log-ratios and sCD14 are:
sCD14_logratios$`association log-ratio with y`[1:6,1:6]
#> 57 36 22 4 52 30
#> 57 0.00000000 -0.01638891 -0.04805360 -0.22131762 -0.4218543 -0.06132098
#> 36 0.01638891 0.00000000 -0.02449991 -0.11191744 -0.2886977 -0.03776775
#> 22 0.04805360 0.02449991 0.00000000 -0.07662439 -0.2594952 -0.02134678
#> 4 0.22131762 0.11191744 0.07662439 0.00000000 -0.2849224 0.01634477
#> 52 0.42185434 0.28869770 0.25949522 0.28492243 0.0000000 0.25924450
#> 30 0.06132098 0.03776775 0.02134678 -0.01634477 -0.2592445 0.00000000
The algorithm checks if the outcome is binary or continuous. According to this, for a pair of variables \(A\) and \(B\), different measures of association between the outcome and the pairwise log-ratio \(log(A/B)\) are available:
For binary outcomes:
“AUC”: Discrimination accuracy (AUC) of the log-ratio \(log(A/B)\) for the outcome \(Y\) (default)
“glm”: Discrimination accuracy (AUC) of the predictions of the logistic regression model \(glm(y\)~\(log(A/B)+covar)\) for the outcome \(Y\)
For continuous outcomes:
“Pearson”: Pearson’s correlation coefficient between the outcome \(Y\) and the log-ratio \(log(A/B)\)
“Spearman”: Spearman’s correlation coefficient between the outcome \(Y\) and the log-ratio \(log(A/B)\)
“glm”: Spearman’s correlation coefficient between the outcome \(Y\) and the predictions of the linear regression model \(lm(y\)~\(log(A/B)+covar)\)
Here we illustrate `explore_lr_longitudinal()
with ecam_filtered dataset (included in the package).
ecam_filtered dataset contains microbiome compositions at genus level from the Early childhood and the microbiome (ECAM) study (Bokulich et al. 2016). Metadata and microbiome data were downloaded from https://github.com/caporaso-lab/longitudinal-notebooks.
We used function coda4microbiome::filter_longitudinal()
to filter those individuals and taxa with enough information for time-course profiling. The filtered data contains information on 42 children and 37 taxa.
The dataset contains three objects:
x_ecam
: a microbiome abundance matrix in long format (873 rows) and 37 genera (columns)
taxanames
: a vector containing the taxonomy of the 37 taxa
metadata
: a matrix with information on the individuals at the observation time
We focus on the effects of the diet on the early modulation of the microbiome by comparing microbiome profiles between children with breastmilk diet (bd) vs. formula milk diet (fd) in their first 3 months of life. Among the 42 children, 30 were bd and 12 fd.
We call function explore_lr_longitudinal()
with the default parameters:
set.seed(123) # to reproduce the results
data(ecam_filtered, package = "coda4microbiome") # load the data
x=x_ecam # microbiome abundance
x_time = metadata$day_of_life; # observation times
subject_id = metadata$studyid; # subject id
y= metadata$diet; # diet ("bd"= breast diet, "fd"=formula diet)
ini_time = 0;
end_time = 90;
ecam_logratios<-explore_lr_longitudinal(x,y, x_time, subject_id, ini_time, end_time)
The correlation-like plot provides the association between the outcome (diet) and the summary of each pairwise log-ratios trajectories. The variables composing the log-ratio are indicated and are ranked according to their aggregated association with the outcome:
According to the plot and the output of the functions, the most important taxa related to diet are:
ecam_logratios$`order of importance`
#> [1] 35 7 30 18 26 11 14 2 16 4 9 32 19 20 8 6 37 25 21 12 36 24 10 29 3
#> [26] 27 1 22 23 28 34 31 33 5 15 17 13
ecam_logratios$`name of most important variables`
#> [1] "419" "166" "240" "208" "224" "192" "197" "77" "205" "98" "182" "259"
#> [13] "210" "211" "179"
The pair of taxa whose log-ratio is more associated with the diet are:
ecam_logratios$`max log-ratio`
#> [1] "35" "30"
taxanames[as.numeric(ecam_logratios$`max log-ratio`)]
#> [1] "k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pasteurellales;f__Pasteurellaceae;g__Haemophilus"
#> [2] "k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Veillonellaceae;g__Veillonella"
We can use function plot_signature_curves()
to see the log-ratio trajectories for the pair of variables with maximum discrimination.
plot_signature_curves(c(30,35), c(1,-1),x,y, x_time, subject_id, ini_time, end_time)
We observe that the relative abundance of g_Veillonella to g_Haemophilus quickly increased during their first month of life for those children under formula-milk diet (blue), while for breast-milk fed children, this relative abundance is quite stable.
The algorithm checks if the outcome is binary or continuous and, for each pair of variables \(A\) and \(B\), it provides the following measures of association between the outcome and the summary of the pairwise log-ratio trajectories, \(S(A/B)\):
For binary outcomes:
For continuous outcomes: