Proteinext: Protein Function Prediction with Sequence Embeddings and Natural Language Processing

Authors

  • Hailey Ledenko Department of Computer Science, Pacific Lutheran University, USA
  • Luke Coleman Department of Computer Science, Pacific Lutheran University, USA
  • G. Alvarado Department of Computer Science, Pacific Lutheran University, USA
  • Tyler Stratton Department of Computer Science, Pacific Lutheran University, USA
  • Boen Liu Annie Wright Schools, USA
  • Jie Hou Department of Computer Science, St. Louis University, USA
  • Dong Si School of Science, Technology, Engineering & Mathematics, University of Washington, USA
  • Lei Zhang Information Material and Intelligent Sensing Laboratory of Anhui Province, Anhui University, China
  • Rui Ding Information Material and Intelligent Sensing Laboratory of Anhui Province, Anhui University, China
  • Yang Wang Information Material and Intelligent Sensing Laboratory of Anhui Province, Anhui University, China
  • Renzhi Cao Department of Computer Science, Pacific Lutheran University, USA

DOI:

https://doi.org/10.47852/bonviewMEDIN52025721

Keywords:

protein function prediction, machine learning, natural language processing

Abstract

Proteins are fundamental to life, as they support vital processes in the body such as muscle development, cell growth, tissue repair, and immune defense. However, their complex structures and diverse functions make them challenging to fully understand. While recent advances enable efficient and accurate protein structure prediction, the challenge of predicting protein function remains. Although promising, current prediction methods suffer from slow performance, high computational demands, and struggle with handling highly specific proteins. Due to a rapid expansion of protein sequence databases, a computational method for predicting function directly from sequence is critical. Our solution to this ongoing challenge is Proteinext, an innovative method for protein function prediction that leverages advanced sequence representations and natural language processing (NLP) techniques. Proteinext leverages Meta's 15B-parameter Evolutionary Scale Modeling to generate protein sequence embeddings, which are refined using a fine-tuned BigBird transformer-based NLP model. This combination results in a powerful model and method that significantly improves protein function prediction. The model was trained on 372,683 protein sequences from a combined dataset of Gene Ontology and Universal Protein Knowledgebase annotations. Proteinext represents a major step toward comprehensively understanding and predicting protein functions, achieving an impressive Fmax score of 0.74 and Smin score of 0.39. This work underscores the potential of combining computational biology with NLP to address critical challenges in proteomics. Proteinext is available at https://github.com/Cao-Labs/AlphaAnalyzer.

 

Received: 18 March 2025 | Revised: 30 September 2025 | Accepted: 18 November 2025

 

Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.

 

Data Availability Statement

The data that support the findings of this study are openly available in GitHub at https://github.com/Cao-Labs/AlphaAnalyzer.

 

Author Contribution Statement

Hailey Ledenko: Conceptualization, Methodology, Software, Writing – original draft, Writing – review & editing, Visualization. Luke Coleman: Conceptualization, Methodology, Software, Writing – original draft, Writing – review & editing. G Alvarado: Writing – original draft, Writing – review & editing, Visualization. Tyler Stratton: Writing – original draft, Writing – review & editing. Boen Liu: Writing – original draft, Writing – review & editing. Jie Hou: Writing – original draft, Writing – review & editing. Dong Si: Writing – original draft, Writing – review & editing. Lei Zhang: Writing – original draft, Writing – review & editing. Rui Ding: Writing – original draft, Writing – review & editing. Yang Wang: Writing – original draft, Writing – review & editing. Renzhi Cao: Conceptualization, Writing – original draft, Writing – review & editing, Supervision, Project administration.


Downloads

Published

2025-12-23

Issue

Section

Research Articles

How to Cite

Ledenko, H., Coleman, L., Alvarado, G., Stratton, T., Liu, B., Hou, J., Si, D., Zhang, L., Ding, R., Wang, Y., & Cao, R. (2025). Proteinext: Protein Function Prediction with Sequence Embeddings and Natural Language Processing. Medinformatics. https://doi.org/10.47852/bonviewMEDIN52025721