Information-theory-based benchmarking and feature selection algorithm improve cell type annotation and reproducibility of single cell RNA-seq data analysis pipelines

Published: Nov. 4, 2020, 10:01 p.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.02.365510v1?rss=1 Authors: Ren, Z., Gerlach, M., Shi, H., Misharin, A. V., Budinger, G. S., Amaral, L. A. N. Abstract: Single cell RNA sequencing (scRNA-seq), which promises to enable the quantitative study of biological processes at the single cell level, are now a routine part of experimental practice. One key computational challenge in analyzing scRNA-seq data is cell type annotation. A source of concern, however, is that the data analysis protocols for clustering cells suffer from low reproducibility and poorly-quantified accuracy. Here, we propose a new benchmark for determining clustering accuracy that uses a dataset where independent reference annotations are generated from surface protein measurements. We then systematically investigate the impact on labelling accuracy of different approaches to feature selection, of different clustering algorithms, and of different sets of parameter values. We demonstrate quantitatively the impact of feature selection and the poor performance of a widely used approach. Additionally, we show that an approach grounded on information theory can provide a generalizable, reliable, and accurate process for discarding uninformative features. Copy rights belong to original authors. Visit the link for more info