Fertility-LightGBM: A fertility-related protein prediction model by multi-information fusion and light gradient boosting machine

Published: Aug. 24, 2020, 9:01 p.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.24.264325v1?rss=1 Authors: Yue, L., Wang, M., Yang, X., Han, Y., Song, L., Yu, B. Abstract: The identification of fertility-related proteins plays an essential part in understanding the embryogenesis of germ cell development. Since the traditional experimental methods are expensive and time-consuming to identify fertility-related proteins, the purposes of predicting protein functions from amino acid sequences appeared. In this paper, we propose a fertility-related protein prediction model. Firstly, the model combines protein physicochemical property information, evolutionary information and sequence information to construct the initial feature space 'ALL'. Then, the least absolute shrinkage and selection operator (LASSO) is used to remove redundant features. Finally, light gradient boosting machine (LightGBM) is used as a classifier to predict. The 5-fold cross-validation accuracy of the training dataset is 88.5%, and the independent accuracy of the training dataset is 91.5%. The results show that our model is more competitive for the prediction of fertility-related proteins, which is helpful for the study of fertility diseases and related drug targets. Copy rights belong to original authors. Visit the link for more info