ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses

Published: Oct. 14, 2020, 12:03 a.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.13.336479v1?rss=1 Authors: Pavlovikj, N., Gomes-Neto, J. C., Deogun, J. S., Benson, A. K. Abstract: Whole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research to diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of a given microbial species poses a tremendous opportunity for discovery and hypothesis-generating research, but such opportunity is limited by scalability and user-friendliness of existing pipelines for population-scale inquiry. Here, we present ProkEvo, an automated, scalable, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: 1) Automating and scaling the computational analysis of many thousands of bacterial genomes starting from raw Illumina paired-ended reads; 2) Using workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; 3) Utilizing high-performance and high-throughput computational platforms; 4) Generating population-based genotypic analysis at different levels of resolution using the core-genome as an input, and allelic-based or Bayesian statistical tools as classification methods; and 5) Detecting antimicrobial resistance (AMR) genes using varying databases, putative virulence factors, plasmids, and producing pan-genome annotations and data compilation that can be further utilized for analysis. The scalability of ProkEvo is shown by using two datasets with significantly different genome sizes - one with ~2,400 genomes, and the second one an order of magnitude larger containing ~23,000 genomes. Because of its modularity, the running time of ProkEvo varied from ~3-26 days depending on the dataset and the computational platform used. However, if all ProkEvo steps were ran sequentially, the running time would have varied from ~3 months to 13 years. While the running time depends on multiple factors, there is a significant advantage of using such scalable, parallelizable, and automated pipeline. ProkEvo can be used with virtually any bacterial species and the Pegasus WMS enables easy addition or removal of programs from the workflow or modification of options within them. To show this, we used ProkEvo with three important serovars of the foodborne pathogen Salmonella enterica, as well as Campylobacter jejuni and Staphylococcus aureus. These three pathogens all used different MLST scheme, and the program SISTR, which among many functions does cgMLST calls, was only applied to the S. enterica serovars. All the dependencies of ProkEvo can be distributed via conda environment or Docker image. To demonstrate ProkEvo's applicability, we have carried a population-based analysis along with the distribution of antimicrobial-associated resistance loci across datasets, and showed how to combine phylogenies with metadata using reproducible Python and R scripts. Collectively, our study shows that ProkEvo presents a viable option for scaling and automating analyses of bacterial populations with direct applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance. Copy rights belong to original authors. Visit the link for more info