Detecting significant components of microbiomes by random forest with forward variable selection and phylogenetics

Published: Oct. 30, 2020, 2:02 p.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.29.361360v1?rss=1 Authors: Dang, T., Kishino, H. Abstract: A central focus of microbiome studies is the characterization of differences in the microbiome composition across groups of samples. A major challenge is the high dimensionality of microbiome datasets, which significantly reduces the power of current approaches for identifying true differences and increases the chance of false discoveries. We have developed a new framework to address these issues by combining (i) identifying a few significant features by a massively parallel forward variable selection procedure, (ii) mapping the selected species on a phylogenetic tree, and (iii) predicting functional profiles by functional gene enrichment analysis from metagenomic 16S rRNA data. We demonstrated the performance of the proposed approach by analyzing two published datasets from large-scale case-control studies: (i) 16S rRNA gene amplicon data for Clostridioides difficile infection (CDI) and (ii) shotgun metagenomics data for human colorectal cancer (CRC). The proposed approach improved the accuracy from 81% to 99.01% for CDI and from 75.14% to 90.17% for CRC. We identified a core set of 96 species that were significantly enriched in CDI and a core set of 75 species that were enriched in CRC. Moreover, although the quality of the data differed for the functional profiles predicted from the 16S rRNA dataset and functional metagenome profiling, our approach performed well for both databases and detected main functions that can be used to diagnose and study further the growth stage of diseases. Copy rights belong to original authors. Visit the link for more info