Skip to main navigation Skip to search Skip to main content

Evaluating Machine Learning Models for Classifying Diabetes Using Demographic, Clinical, Lifestyle, Anthropometric, and Environmental Exposure Factors

  • North Carolina Agricultural and Technical State University
  • North Carolina State University

Research output: Contribution to journalArticlepeer-review

Abstract

Diabetes develops through a mix of clinical, metabolic, lifestyle, demographic, and environmental factors. Most current classification models focus on traditional biomedical indicators and do not include environmental exposure biomarkers. In this study, we develop and evaluate a supervised machine learning classification framework that integrates heterogeneous demographic, anthropometric, clinical, behavioral, and environmental exposure features to classify physician-diagnosed diabetes using data from the National Health and Nutrition Examination Survey (NHANES). We analyzed NHANES 2017–2018 data for adults aged ≥18 years, addressed missingness using Multiple Imputation by Chained Equations, and corrected class imbalance via the Synthetic Minority Oversampling Technique. Model performance was evaluated using stratified ten-fold cross-validation across eight supervised classifiers: logistic regression, random forest, XGBoost, support vector machine, multilayer perceptron neural network (artificial neural network), k-nearest neighbors, naïve Bayes, and classification tree. Random Forest and XGBoost performed best on the balanced dataset, with ROC AUC values of 0.891 and 0.885, respectively, after imputation and oversampling. Feature importance analysis indicated that age, household income, and waist circumference contributed most strongly to diabetes classification. To assess out-of-sample generalization, we conducted an independent 80/20 hold-out evaluation. XGBoost achieved the highest overall accuracy and F1-score, whereas random forest attained the greatest sensitivity, demonstrating stable performance beyond cross-validation. These results indicate that incorporating environmental exposure biomarkers alongside clinical and metabolic features yields improved classification performance for physician-diagnosed diabetes. The findings support the inclusion of chemical exposure variables in population-level diabetes classification and underscore the value of integrating heterogeneous feature sets in machine learning-based risk stratification.
Original languageEnglish
Article number76
JournalToxics
Volume14
Issue number1
DOIs
StatePublished - Jan 1 2026

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Keywords

  • ROC–AUC
  • SMOTE
  • environmental exposure
  • exposome
  • machine learning
  • multiple imputation (MICE)
  • predictive modeling
  • random forest

Fingerprint

Dive into the research topics of 'Evaluating Machine Learning Models for Classifying Diabetes Using Demographic, Clinical, Lifestyle, Anthropometric, and Environmental Exposure Factors'. Together they form a unique fingerprint.

Cite this