DIMA: Data-driven selection of a suitable imputation algorithm

Published: Oct. 14, 2020, 9:02 p.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.13.323618v1?rss=1 Authors: Egert, J., Warscheid, B., Kreutz, C. Abstract: Motivation: Imputation is a prominent strategy when dealing with missing values (MVs) in proteomics data analysis pipelines. However, the performance of different imputation methods is difficult to assess and varies strongly depending on data characteristics. To overcome this issue, we present the concept of a data-driven selection of a suitable imputation algorithm (DIMA). Results: The performance and broad applicability of DIMA is demonstrated on 121 quantitative proteomics data sets from the PRIDE database and on simulated data consisting of 5-50% MVs with different proportions of missing not at random and missing completely at random values. DIMA reliably suggests a high-performing imputation algorithm which is always among the three best algorithms and results in a root mean square error difference ({Delta}RMSE) <10% in 84% of the cases. Availability and Implementation: Source code is freely available for download at https://github.com/clemenskreutz/OmicsData . Copy rights belong to original authors. Visit the link for more info