Abstract
Diabetes develops through a mix of clinical, metabolic, lifestyle, demographic, and environmental factors. Most current classification models focus on traditional biomedical indicators and do not include environmental exposure biomarkers. In this study, we develop and evaluate a supervised machine learning classification framework that integrates heterogeneous demographic, anthropometric, clinical, behavioral, and environmental exposure features to classify physician-diagnosed diabetes using data from the National Health and Nutrition Examination Survey (NHANES). We analyzed NHANES 2017–2018 data for adults aged ≥18 years, addressed missingness using Multiple Imputation by Chained Equations, and corrected class imbalance via the Synthetic Minority Oversampling Technique. Model performance was evaluated using stratified ten-fold cross-validation across eight supervised classifiers: logistic regression, random forest, XGBoost, support vector machine, multilayer perceptron neural network (artificial neural network), k-nearest neighbors, naïve Bayes, and classification tree. Random Forest and XGBoost performed best on the balanced dataset, with ROC AUC values of 0.891 and 0.885, respectively, after imputation and oversampling. Feature importance analysis indicated that age, household income, and waist circumference contributed most strongly to diabetes classification. To assess out-of-sample generalization, we conducted an independent 80/20 hold-out evaluation. XGBoost achieved the highest overall accuracy and F1-score, whereas random forest attained the greatest sensitivity, demonstrating stable performance beyond cross-validation. These results indicate that incorporating environmental exposure biomarkers alongside clinical and metabolic features yields improved classification performance for physician-diagnosed diabetes. The findings support the inclusion of chemical exposure variables in population-level diabetes classification and underscore the value of integrating heterogeneous feature sets in machine learning-based risk stratification.
| Original language | English |
|---|---|
| Article number | 76 |
| Journal | Toxics |
| Volume | 14 |
| Issue number | 1 |
| DOIs | |
| State | Published - Jan 1 2026 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords
- ROC–AUC
- SMOTE
- environmental exposure
- exposome
- machine learning
- multiple imputation (MICE)
- predictive modeling
- random forest
Fingerprint
Dive into the research topics of 'Evaluating Machine Learning Models for Classifying Diabetes Using Demographic, Clinical, Lifestyle, Anthropometric, and Environmental Exposure Factors'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver