Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships

Published: Aug. 12, 2020, 1:01 a.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.11.245928v1?rss=1 Authors: Huber, F., Ridder, L., Rogers, S., van der Hooft, J. J. Abstract: Spectral similarity is used as a proxy for structural similarity in many tandem mass spectrometry (MS/MS) based metabolomics analyses, such as library matching and molecular networking. This is based upon the assumption that spectral similarity is a good proxy for structural similarity. Although weaknesses in the relationship between common spectral similarity scores and the true structural similarities have been pointed out, little development of alternative scores has been undertaken. Here, we introduce Spec2Vec, a novel spectral similarity score inspired by a natural language processing algorithm -- Word2Vec. Where Word2Vec learns relationships between words in sentences, Spec2Vec does so for mass fragments and neutral losses in MS/MS spectra. The spectral similarity score is based on spectral embeddings learnt from the fragmental relationships within a large set of spectral data. Using a dataset derived from GNPS MS/MS libraries including spectra for nearly 13,000 unique molecules, we show how Spec2Vec scores are more proportional to structural similarity of molecules than the commonly used cosine score and its derivative, the modified cosine score. We also demonstrate the advantages of Spec2Vec in library searching for both exact matches and analogues as well as in molecular networking. Finally, Spec2Vec is also computationally more scalable allowing us to search for structural analogues in a large database within seconds. Copy rights belong to original authors. Visit the link for more info