Proteinext: Protein Function Prediction with Sequence Embeddings and Natural Language Processing
DOI:
https://doi.org/10.47852/bonviewMEDIN52025721Keywords:
protein function prediction, machine learning, natural language processingAbstract
Proteins are fundamental to life, as they support vital processes in the body such as muscle development, cell growth, tissue repair, and immune defense. However, their complex structures and diverse functions make them challenging to fully understand. While recent advances enable efficient and accurate protein structure prediction, the challenge of predicting protein function remains. Although promising, current prediction methods suffer from slow performance, high computational demands, and struggle with handling highly specific proteins. Due to a rapid expansion of protein sequence databases, a computational method for predicting function directly from sequence is critical. Our solution to this ongoing challenge is Proteinext, an innovative method for protein function prediction that leverages advanced sequence representations and natural language processing (NLP) techniques. Proteinext leverages Meta's 15B-parameter Evolutionary Scale Modeling to generate protein sequence embeddings, which are refined using a fine-tuned BigBird transformer-based NLP model. This combination results in a powerful model and method that significantly improves protein function prediction. The model was trained on 372,683 protein sequences from a combined dataset of Gene Ontology and Universal Protein Knowledgebase annotations. Proteinext represents a major step toward comprehensively understanding and predicting protein functions, achieving an impressive Fmax score of 0.74 and Smin score of 0.39. This work underscores the potential of combining computational biology with NLP to address critical challenges in proteomics. Proteinext is available at https://github.com/Cao-Labs/AlphaAnalyzer.
Received: 18 March 2025 | Revised: 30 September 2025 | Accepted: 18 November 2025
Conflicts of Interest
The authors declare that they have no conflicts of interest to this work.
Data Availability Statement
The data that support the findings of this study are openly available in GitHub at https://github.com/Cao-Labs/AlphaAnalyzer.
Author Contribution Statement
Hailey Ledenko: Conceptualization, Methodology, Software, Writing – original draft, Writing – review & editing, Visualization. Luke Coleman: Conceptualization, Methodology, Software, Writing – original draft, Writing – review & editing. G Alvarado: Writing – original draft, Writing – review & editing, Visualization. Tyler Stratton: Writing – original draft, Writing – review & editing. Boen Liu: Writing – original draft, Writing – review & editing. Jie Hou: Writing – original draft, Writing – review & editing. Dong Si: Writing – original draft, Writing – review & editing. Lei Zhang: Writing – original draft, Writing – review & editing. Rui Ding: Writing – original draft, Writing – review & editing. Yang Wang: Writing – original draft, Writing – review & editing. Renzhi Cao: Conceptualization, Writing – original draft, Writing – review & editing, Supervision, Project administration.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.