A Python-based optimization framework for high-performance genomics

Published: Oct. 30, 2020, 3:03 p.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.29.361402v1?rss=1 Authors: Shajii, A. R., Numanagic, I., Leighton, A. T., Greenyer, H., Amarasinghe, S., Berger, B. Abstract: Exponentially-growing next-generation sequencing data requires high-performance tools and algorithms. Nevertheless, the implementation of high-performance computational genomics software is inaccessible to many scientists because it requires extensive knowledge of low-level software optimization techniques, forcing scientists to resort to high-level software alternatives that are less efficient. Here, we introduce Seq--- a Python-based optimization framework that combines the power and usability of high-level languages like Python with the performance of low-level languages like C or C++. Seq allows for shorter, simpler code, is readily usable by a novice programmer, and obtains significant performance improvements over existing languages and frameworks. We showcase and evaluate Seq by implementing seven standard, widely-used applications from all stages of the genomics analysis pipeline, including genome index construction, finding maximal exact matches, long-read alignment and haplotype phasing, and demonstrate its implementations are up to an order of magnitude faster than existing hand-optimized implementations, with just a fraction of the code. By enabling researchers of all backgrounds to easily implement high-performance analysis tools, Seq further opens the door to the democratization and scalability of computational genomics. Copy rights belong to original authors. Visit the link for more info