PCQC: Selecting optimal principal components for identifying clusters with highly imbalanced class sizes in single-cell RNA-seq data

Published: Nov. 20, 2020, 8:02 p.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.19.390542v1?rss=1 Authors: Burstein, D., Fullard, J., Roussos, P. Abstract: Prior to identifying clusters in single cell gene expression experiments, selecting the top principal components is a critical step for filtering out noise in the data set. Identifying these top principal components typically focuses on the total variance explained, and principal components that explain small clusters from rare populations will not necessarily capture a large percentage of variance in the data. We present a computationally efficient alternative for identifying the optimal principal components based on the tails of the distribution of variance explained for each observation. We then evaluate the efficacy of our approach in three different single cell RNA-sequencing data sets and find that our method matches, or outperforms, other selection criteria that are typically employed in the literature. Availability and implementation: pcqc is written in Python and available at github.com/RoussosLab/pcqc Copy rights belong to original authors. Visit the link for more info