Optimizing the use of gene expression data to predict metabolic pathway memberships with unsupervised and supervised machine learning

Published: July 16, 2020, 8:21 p.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.07.15.204222v1?rss=1 Authors: Wang, P., Moore, B. M., Uygun, S., Lehti-Shiu, M. D., Barry, C., Shiu, S.-H. Abstract: Plants produce diverse metabolites via enzymes metabolic pathways important for plant survival, human nutrition and medicine. However, most plant enzyme genes are of unknown pathway membership. While some genes in the same pathways can be identified based on correlated expression, such correlation may exist only under specific spatiotemporal and conditional contexts. By considering 656 combinations of tomato gene expression datasets calculated with eight co-expression measures, we evaluated the performance of naive prediction (based on expression similarities to pathways), unsupervised and supervised learning methods in predicting memberships in 85 metabolic pathways. We found that optimal predictions for different pathways require different dataset, which tend to be associated with the biological processes related to the pathway functions. In addition, naive prediction has significantly lower performance compared to machine learning methods. Interestingly, the unsupervised learning approach has better performance in 52 pathways than the supervised approach, which may be attributed to the need for more data with supervised learning. Furthermore, machine learning clustering/models using gene-to-pathway expression similarities outperform that with gene expression profiles. Altogether, our study highlights the need to extensively explore expression-based features to maximize the utility of expression data for pinpointing pathway membership. Through this detailed exploration, novel connections between pathways and biological processes can also be identified based on the optimal expression dataset used, improving our mechanistic understanding of the metabolic network. Copy rights belong to original authors. Visit the link for more info