Abstract
In de novo protein sequencing, we often could only obtain an incomplete protein sequence, namely scaffold, from top-down and bottom-up tandem mass spectrometry. While most sections of the proteins can be inferred from its homologous sequences, some specific section of proteins is always missing and it is hard to predict the missing amino acids in the gaps of the scaffold. Thus, we only focus on predicting the gaps based on a probabilistic algorithm and machine learning models instead predicting the complete protein sequence using generative AI models in this paper. We study two versions of the protein scaffold filling problem with known size gaps and known mass gaps. For the known size gaps version, we develop several machine learning models based on random forest, k-nearest neighbors, decision tree and fully connected neural network. For the known mass gap problem, we design a probabilistic algorithm to predict the missing amino acids in the gaps. The experimental results on both real and simulation data show that our proposed algorithms show promising results of 100% and close to 100% accuracy.
| Original language | English |
|---|---|
| Pages (from-to) | 28-39 |
| Number of pages | 12 |
| Journal | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
| Volume | 14956 LNBI |
| DOIs | |
| State | Published - Jan 1 2024 |
| Event | 20th International Symposium on Bioinformatics Research and Applications, ISBRA 2024 - Kunming, China Duration: Jul 19 2024 → Jul 21 2024 |
Keywords
- Heuristic algorithms
- Machine learning
- Probablistic model
- Protein Scaffold filling
- Protein sequencing
Fingerprint
Dive into the research topics of 'Probabilistic and Machine Learning Models for the Protein Scaffold Gap Filling Problem'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver