Proteinext: Protein Function Prediction with Sequence Embeddings and Natural Language Processing

Hailey Ledenko; Luke Coleman; G. Alvarado; Tyler Stratton; Boen Liu; Jie Hou; Dong Si; Lei Zhang; Rui Ding; Yang Wang; Renzhi Cao

doi:10.47852/bonviewMEDIN52025721

Authors

Hailey Ledenko Department of Computer Science, Pacific Lutheran University, USA
Luke Coleman Department of Computer Science, Pacific Lutheran University, USA
G. Alvarado Department of Computer Science, Pacific Lutheran University, USA
Tyler Stratton Department of Computer Science, Pacific Lutheran University, USA
Boen Liu Annie Wright Schools, USA
Jie Hou Department of Computer Science, St. Louis University, USA
Dong Si School of Science, Technology, Engineering & Mathematics, University of Washington, USA
Lei Zhang Information Material and Intelligent Sensing Laboratory of Anhui Province, Anhui University, China
Rui Ding Information Material and Intelligent Sensing Laboratory of Anhui Province, Anhui University, China
Yang Wang Information Material and Intelligent Sensing Laboratory of Anhui Province, Anhui University, China
Renzhi Cao Department of Computer Science, Pacific Lutheran University, USA

DOI:

https://doi.org/10.47852/bonviewMEDIN52025721

Keywords:

protein function prediction, machine learning, natural language processing

Abstract

Proteins are fundamental to life, as they support vital processes in the body such as muscle development, cell growth, tissue repair, and immune defense. However, their complex structures and diverse functions make them challenging to fully understand. While recent advances enable efficient and accurate protein structure prediction, the challenge of predicting protein function remains. Although promising, current prediction methods suffer from slow performance, high computational demands, and struggle with handling highly specific proteins. Due to a rapid expansion of protein sequence databases, a computational method for predicting function directly from sequence is critical. Our solution to this ongoing challenge is Proteinext, an innovative method for protein function prediction that leverages advanced sequence representations and natural language processing (NLP) techniques. Proteinext leverages Meta's 15B-parameter evolutionary scale modeling to generate protein sequence embeddings, which are refined using a fine-tuned BigBird transformer-based NLP model. This combination results in a powerful model and method that significantly improves protein function prediction. The model was trained on 372,683 protein sequences from a combined dataset of Gene Ontology and Universal Protein Knowledgebase annotations. Proteinext represents a major step toward comprehensively understanding and predicting protein functions, achieving an impressive F_max score of 0.74 and S_min score of 0.39. This work underscores the potential of combining computational biology with NLP to address critical challenges in proteomics. Proteinext is available at https://github.com/Cao-Labs/AlphaAnalyzer.

Received: 18 March 2025 | Revised: 30 September 2025 | Accepted: 18 November 2025

Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.

Data Availability Statement

The data that support the findings of this study are openly available in GitHub at https://github.com/Cao-Labs/AlphaAnalyzer.

Author Contribution Statement

Hailey Ledenko: Conceptualization, Methodology, Software, Writing – original draft, Writing – review & editing, Visualization. Luke Coleman: Conceptualization, Methodology, Software, Writing – original draft, Writing – review & editing. G. Alvarado: Writing – original draft, Writing – review & editing, Visualization. Tyler Stratton: Writing – original draft, Writing – review & editing. Boen Liu: Writing – original draft, Writing – review & editing. Jie Hou: Writing – original draft, Writing – review & editing. Dong Si: Writing – original draft, Writing – review & editing. Lei Zhang: Writing – original draft, Writing – review & editing. Rui Ding: Writing – original draft, Writing – review & editing. Yang Wang: Writing – original draft, Writing – review & editing. Renzhi Cao: Conceptualization, Writing – original draft, Writing – review & editing, Supervision, Project administration.

Proteinext: Protein Function Prediction with Sequence Embeddings and Natural Language Processing

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Journal Information

Make a Submission

Keywords