MQF and buffered MQF: Quotient filters for efficient storage of k-mers with their counts and metadata

Published: Aug. 24, 2020, 6:03 p.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.23.263061v1?rss=1 Authors: Shokrof, M., Brown, C. T., Mansour, T. A. Abstract: Background: Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution. Result: Here, we present the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are ~10 fold faster than BBhash, with no need to re-analyze the original data for further insertions or deletions. Conclusion: The MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data. Copy rights belong to original authors. Visit the link for more info