Symbiont-Screener: a reference-free filter to automatically separate host sequences and contaminants for long reads or co-barcoded reads by unsupervised clustering

Published: Oct. 26, 2020, 9:02 p.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.26.354621v1?rss=1 Authors: Xu, M., Guo, L., Shi, C., Liu, X., Chen, J., Liu, X., Fan, G. Abstract: Decontamination is necessary for eliminating the effect of foreign genomes on the symbiont studies and biomedical discoveries. However, direct extraction of host sequencing reads with no references remains challenging. Here, we present a trio-based method to classify the host error-prone long reads or sparse co-barcoded reads prior to assembly, free of any alignments against DNA or protein references. This method first identifies high-confident host reads by haplotype-specific k-mers inherited from parents, and then groups remaining host reads by the unsupervised clustering. Experimental results demonstrated that this approach successfully classified up to 97.38% of the host human long reads with the precision rate of 99.9999%, and 79.95% host co-barcoded reads with the precision rate of 98.36% using an artificially mixed data. Moreover, the tool also exhibited a good performance on the decontamination of the real algae data. The purified reads reconstructed two haplotypes and improved the assembly with larger contig NGA50 value and less misassemblies. Symbiont-Screener can be freely downloaded at https://github.com/BGI-Qingdao/Symbiont-Screener. Copy rights belong to original authors. Visit the link for more info