Exploring protein sequence and functional spaces using adversarial autoencoder

Published: Nov. 10, 2020, 8:02 p.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.09.375311v1?rss=1 Authors: Bitard-Feildel, T. Abstract: Shedding light on the relationship between protein sequences and their functions is a challenging task with implication in our understanding of protein evolution, diseases, or protein design. However, due to its complexity, protein sequence space is hard to comprehend with potential numerous human bias. Generative models able to learn and recreate the data specificity can help to decipher complex systems. Applied to protein sequences, they can help to pointing out relationships between protein positions and functions or to capture the different sequence patterns associated to functions. In this study, an unsupervised generative approach based on auto-encoder (AE) is proposed to generate and explore new protein sequences with respect to their functions. AE are tested on three protein families known for their multiple functions, of which one has manually curated annotations. Functional labelling of encoded sequences on a two dimensional latent space computed by AE for each family shows a good agreement regarding the ability of the latent space to capture the functional organization and specificity of the sequences. Furthermore, arithmetic between latent spaces and latent space interpolations between encoded sequences are tested as a way to generate new intermediate protein sequences sharing sequential and functional properties of sequences issued of families with different sequences and functions. Using structural homology modelling and assessment, it can be observed that the new protein sequences generated using latent space arithmetic display intermediate physico-chemical properties and energies relatively to the sequences of the families used to generate them. Finally, interpolated protein sequences between data points of the input data set show the ability of the AE to smoothly generalize and to produce meaningful biological sequences from un-charted area of the latent space. Code and data used for this study are freely available at https://github.com/T-B-F/aae4seq. Copy rights belong to original authors. Visit the link for more info