Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.09.04.282814v1?rss=1 Authors: Littmann, M., Heinzinger, M., Dallago, C., Olenyi, T., Rost, B. Abstract: Knowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we proposed predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originated from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 250 million protein sequences. Replicating the conditions of CAFA3, our method reached an Fmax of 37{+/-} 2%, 50{+/-} 3%, and 57{+/-} 2% for BPO, MFO, and CCO, respectively. This was numerically close to the top ten methods that had participated in CAFA3. Restricting the annotation transfer to proteins with <20% pairwise sequence identity to the query, performance dropped (Fmax BPO 33{+/-} 2%, MFO 43{+/-} 3%, CCO 53{+/-} 2%); this still outperformed naive sequence-based transfer. Preliminary results from CAFA4 appeared to confirm these findings. Overall, this new method may help, in particular, to annotate novel proteins from smaller families or proteins with intrinsically disordered regions. Copy rights belong to original authors. Visit the link for more info