TY - JOUR
T1 - deepFEPS: Deep Learning-Oriented Feature Extraction for Biological Sequences
AU - Ismail, Hamid Dafalla
AU - Bikdash, Marwan
PY - 2025
Y1 - 2025
N2 - Machine- and deep-learning approaches for biological sequences depend critically on transforming raw DNA, RNA, and protein FASTA files into informative numerical representations. However, this process is often fragmented across multiple libraries and preprocessing steps, which creates a barrier for researchers without extensive computational expertise. To address this gap, we developed deepFEPS, an open-source toolkit that unifies state-of-the-art feature extraction methods for sequence data within a single, reproducible workflow. deepFEPS integrates five families of modern feature extractors - k-mer embeddings (Word2Vec, FastText), document-level embeddings (Doc2Vec), transformer-based encoders (DNABERT, ProtBERT, and ESM2), autoencoder-derived latent features, and graph-based embeddings - into one consistent platform. The system accepts FASTA input via a web interface or command-line tool, exposes key model parameters, and outputs analysis-ready feature matrices (CSV). Each run is accompanied by an automatic quality-control report including sequence counts, dimensionality, sparsity, variance distributions, class balance, and diagnostic visualizations. By consolidating advanced sequence embeddings into one environment, deepFEPS reduces preprocessing overhead, improves reproducibility, and shortens the path from raw sequences to downstream machine- and deep-learning applications. deepFEPS lowers the practical barrier to modern representation learning for bioinformatics, enabling both novice and expert users to generate advanced embeddings for classification, clustering, and predictive modeling. Its unified framework supports exploratory analyses, high-throughput studies, and integration into institutional workflows, while remaining extensible to emerging models and methods. The webserver is accessible at this https URL.
AB - Machine- and deep-learning approaches for biological sequences depend critically on transforming raw DNA, RNA, and protein FASTA files into informative numerical representations. However, this process is often fragmented across multiple libraries and preprocessing steps, which creates a barrier for researchers without extensive computational expertise. To address this gap, we developed deepFEPS, an open-source toolkit that unifies state-of-the-art feature extraction methods for sequence data within a single, reproducible workflow. deepFEPS integrates five families of modern feature extractors - k-mer embeddings (Word2Vec, FastText), document-level embeddings (Doc2Vec), transformer-based encoders (DNABERT, ProtBERT, and ESM2), autoencoder-derived latent features, and graph-based embeddings - into one consistent platform. The system accepts FASTA input via a web interface or command-line tool, exposes key model parameters, and outputs analysis-ready feature matrices (CSV). Each run is accompanied by an automatic quality-control report including sequence counts, dimensionality, sparsity, variance distributions, class balance, and diagnostic visualizations. By consolidating advanced sequence embeddings into one environment, deepFEPS reduces preprocessing overhead, improves reproducibility, and shortens the path from raw sequences to downstream machine- and deep-learning applications. deepFEPS lowers the practical barrier to modern representation learning for bioinformatics, enabling both novice and expert users to generate advanced embeddings for classification, clustering, and predictive modeling. Its unified framework supports exploratory analyses, high-throughput studies, and integration into institutional workflows, while remaining extensible to emerging models and methods. The webserver is accessible at this https URL.
UR - https://arxiv.org/abs/2511.22821
U2 - 10.48550/arXiv.2511.22821
DO - 10.48550/arXiv.2511.22821
M3 - Article
JO - ARXIV
JF - ARXIV
ER -