Probability-based methods for outlier detection in replicated high-throughput biological data

Published: Aug. 7, 2020, 12:03 a.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.07.240473v1?rss=1 Authors: Smith, M. S., Devarajan, K. Abstract: A problem that arises frequently in high-throughput biological studies is the assessment of technical reproducibility of data obtained under homogeneous experimental conditions. This is an important problem considering the significant growth in the number of high-throughput technologies that have become available to the researcher in the past decade. Although certain ad hoc and graphical methods for determining data quality have been in existence, these methods lack statistical rigor and are not broadly applicable across different technologies. There is an inherent need for the quantitative evaluation of the reproducibility of technical replicates from high-throughput compound and siRNA screening, next-generation sequencing and other modern "omics" studies. To this end, we have developed an approach that accounts for technical variability and potential asymmetry that arise naturally in the distribution of replicate data, and aids in the identification of outliers. Our methods for outlier detection rely on flexible statistical models, employ maximum likelihood methods for estimation and are broadly applicable to a variety of high-throughput biological studies. We discuss an adaptation of these methods when there are multiple replicates as well as current limitations to these techniques. We illustrate our methods using experimental data from high-throughput compound screening and protein expression studies as well as simulated data. Our methods are implemented in the R package replicateOutliers and are currently available at github.com/matthew-seth-smith/replicateOutliers. Copy rights belong to original authors. Visit the link for more info