TY - JOUR
T1 - Probabilistic and Machine Learning Models for the Protein Scaffold Gap Filling Problem
AU - Badal, Kushal
AU - Qingge, Letu
AU - Liu, Xiaowen
AU - Zhu, Binhai
PY - 2024/1/1
Y1 - 2024/1/1
N2 - In de novo protein sequencing, we often could only obtain an incomplete protein sequence, namely scaffold, from top-down and bottom-up tandem mass spectrometry. While most sections of the proteins can be inferred from its homologous sequences, some specific section of proteins is always missing and it is hard to predict the missing amino acids in the gaps of the scaffold. Thus, we only focus on predicting the gaps based on a probabilistic algorithm and machine learning models instead predicting the complete protein sequence using generative AI models in this paper. We study two versions of the protein scaffold filling problem with known size gaps and known mass gaps. For the known size gaps version, we develop several machine learning models based on random forest, k-nearest neighbors, decision tree and fully connected neural network. For the known mass gap problem, we design a probabilistic algorithm to predict the missing amino acids in the gaps. The experimental results on both real and simulation data show that our proposed algorithms show promising results of 100% and close to 100% accuracy.
AB - In de novo protein sequencing, we often could only obtain an incomplete protein sequence, namely scaffold, from top-down and bottom-up tandem mass spectrometry. While most sections of the proteins can be inferred from its homologous sequences, some specific section of proteins is always missing and it is hard to predict the missing amino acids in the gaps of the scaffold. Thus, we only focus on predicting the gaps based on a probabilistic algorithm and machine learning models instead predicting the complete protein sequence using generative AI models in this paper. We study two versions of the protein scaffold filling problem with known size gaps and known mass gaps. For the known size gaps version, we develop several machine learning models based on random forest, k-nearest neighbors, decision tree and fully connected neural network. For the known mass gap problem, we design a probabilistic algorithm to predict the missing amino acids in the gaps. The experimental results on both real and simulation data show that our proposed algorithms show promising results of 100% and close to 100% accuracy.
KW - Heuristic algorithms
KW - Machine learning
KW - Probablistic model
KW - Protein Scaffold filling
KW - Protein sequencing
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85200513367&origin=inward
UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=85200513367&origin=inward
U2 - 10.1007/978-981-97-5087-0_3
DO - 10.1007/978-981-97-5087-0_3
M3 - Conference article
SN - 0302-9743
VL - 14956 LNBI
SP - 28
EP - 39
JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
T2 - 20th International Symposium on Bioinformatics Research and Applications, ISBRA 2024
Y2 - 19 July 2024 through 21 July 2024
ER -