TY - JOUR
T1 - LMPhosSite: A Deep Learning-Based Approach for General Protein Phosphorylation Site Prediction Using Embeddings from the Local Window Sequence and Pretrained Protein Language Model
AU - Pakhrin, Subash C.
AU - Pokharel, Suresh
AU - Pratyush, Pawel
AU - Chaudhari, Meenal
AU - Ismail, Hamid D
AU - Kc, Dukka B.
PY - 2023/8/4
Y1 - 2023/8/4
N2 - Phosphorylation is one of the most important post-translational modifications and plays a pivotal role in various cellular processes. Although there exist several computational tools to predict phosphorylation sites, existing tools have not yet harnessed the knowledge distilled by pretrained protein language models. Herein, we present a novel deep learning-based approach called LMPhosSite for the general phosphorylation site prediction that integrates embeddings from the local window sequence and the contextualized embedding obtained using global (overall) protein sequence from a pretrained protein language model to improve the prediction performance. Thus, the LMPhosSite consists of two base-models: one for capturing effective local representation and the other for capturing global per-residue contextualized embedding from a pretrained protein language model. The output of these base-models is integrated using a score-level fusion approach. LMPhosSite achieves a precision, recall, Matthew's correlation coefficient, and F1-score of 38.78%, 67.12%, 0.390, and 49.15%, for the combined serine and threonine independent test data set and 34.90%, 62.03%, 0.298, and 44.67%, respectively, for the tyrosine independent test data set, which is better than the compared approaches. These results demonstrate that LMPhosSite is a robust computational tool for the prediction of the general phosphorylation sites in proteins.
AB - Phosphorylation is one of the most important post-translational modifications and plays a pivotal role in various cellular processes. Although there exist several computational tools to predict phosphorylation sites, existing tools have not yet harnessed the knowledge distilled by pretrained protein language models. Herein, we present a novel deep learning-based approach called LMPhosSite for the general phosphorylation site prediction that integrates embeddings from the local window sequence and the contextualized embedding obtained using global (overall) protein sequence from a pretrained protein language model to improve the prediction performance. Thus, the LMPhosSite consists of two base-models: one for capturing effective local representation and the other for capturing global per-residue contextualized embedding from a pretrained protein language model. The output of these base-models is integrated using a score-level fusion approach. LMPhosSite achieves a precision, recall, Matthew's correlation coefficient, and F1-score of 38.78%, 67.12%, 0.390, and 49.15%, for the combined serine and threonine independent test data set and 34.90%, 62.03%, 0.298, and 44.67%, respectively, for the tyrosine independent test data set, which is better than the compared approaches. These results demonstrate that LMPhosSite is a robust computational tool for the prediction of the general phosphorylation sites in proteins.
KW - deep learning
KW - embedding
KW - phosphorylation
KW - post-translational modification
KW - protein language model
KW - score-level fusion
KW - stack generalization
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85166482947&origin=inward
UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=85166482947&origin=inward
U2 - 10.1021/acs.jproteome.2c00667
DO - 10.1021/acs.jproteome.2c00667
M3 - Article
C2 - 37459437
SN - 1535-3893
VL - 22
SP - 2548
EP - 2557
JO - Journal of Proteome Research
JF - Journal of Proteome Research
IS - 8
ER -