Machine Learning Analysis of Diabetes-Related Health Outcomes (University of London Coursework)
September 2025 - April 2026
About This Project
A machine learning coursework project for ST3189 (Machine Learning) at the University of London, applying unsupervised learning, classification, and regression to the 2024 CDC Behavioral Risk Factor Surveillance System (BRFSS) - a telephone survey of 457,670 US adults. PCA and K-means clustering reveal interpretable health dimensions and identify a high-risk subgroup with 55% diabetic prevalence. Seven classifiers achieve AUC scores of 0.75-0.81 for predicting diabetes status without clinical tests, and gradient boosting predicts physical health burden among confirmed diabetics with R-squared = 0.45.
Project Details
This project was completed as coursework for ST3189 (Machine Learning) at the University of London, contributing 30% to the final course grade. To maintain academic integrity, I am not sharing links to my submitted work or code. However, I can describe the full methodology and results. Personal Motivation This project holds particular personal significance for me, as my family has a history of diabetes. The BRFSS dataset - the world's largest continuously conducted telephone health survey - offered a rare opportunity to explore diabetes risk at population scale using the machine learning techniques I was learning in the course. Dataset The 2024 BRFSS contains 457,670 respondents across 345 variables. From the full extract, 60 variables were retained spanning demographics, health status, chronic conditions, healthcare access, health behaviours, and social determinants. The binary classification target (confirmed diabetic vs. non-diabetic) had a 5.9:1 imbalance, addressed using ROSE synthetic oversampling applied strictly within training folds to prevent data leakage. Part 1 - Unsupervised Learning: Population Structure Principal Component Analysis After one-hot encoding categorical variables, the dataset expanded to 113 features. The variance explained by each component decays gradually - the first two account for 4.8% and 3.9% respectively, with no single dominant factor. PC1 separates college-educated, physically active respondents from those with poor mental health and low activity. PC2 captures a healthcare access and age gradient. This diffuse structure directly explains why compressing the data before classification does not improve performance. K-Means Clustering K-means applied to the first ten PCs (K=6, silhouette score = 0.20, confirmed by Ward's hierarchical clustering) identifies six interpretable population subgroups. The highest-risk cluster - characterised by lower education, physical inactivity, and high comorbidity burden - has a diabetic prevalence of 55.34%, nearly three times higher than the lowest-risk cluster (19.99%). Part 2 - Classification: Predicting Diabetes Status Seven classifiers were trained on 50,000 respondents with an 80/20 train-test split: Logistic Regression: AUC = 0.815, Sensitivity = 0.773 (best overall AUC) LDA: AUC = 0.814, Sensitivity = 0.773 SVM with RBF kernel: AUC = 0.809, Sensitivity = 0.767 Random Forest (500 trees): AUC = 0.807, Sensitivity = 0.778 (best sensitivity) Naive Bayes: AUC = 0.786, Sensitivity = 0.708 Decision Tree (pruned): AUC = 0.781, Sensitivity = 0.758 Neural Network (12 hidden units): AUC = 0.746, Sensitivity = 0.663 Results are broadly consistent with Xie et al.'s 2014 BRFSS benchmarks (AUC 0.718-0.795), confirming the survey-diabetes relationship has remained stable over a decade. The sensitivity improvement over the prior study (0.66-0.78 vs. 0.38-0.52) is methodological: ROSE shifts the decision threshold toward detecting the minority class, while AUC scores remain comparable. Training classifiers on PCA-compressed inputs (61 PCs, 80% variance threshold) consistently reduces sensitivity by 3-8 percentage points - a meaningful negative result. However, compression enables QDA (Quadratic Discriminant Analysis), which cannot be estimated stably in the full 113-feature space. QDA achieves the highest sensitivity of any model tested (0.787) and the lowest missed-diagnosis rate (0.213), making it the preferred approach for population screening where missing a diabetic case carries far higher cost than a false referral. Five-fold cross-validation on the Random Forest confirms stable out-of-sample performance: mean AUC = 0.804, mean sensitivity = 0.749. Part 3 - Regression: Physical Health Burden Among Diabetics Among 63,454 confirmed diabetics, the regression target is the number of days in the past 30 that physical health was not good. The distribution is strongly zero-inflated (47% report zero bad days; 16% report all 30 days), so a square-root transformation was applied before modelling. Eight regression approaches were compared on an 80/20 split: GBM Gradient Boosting (246 trees by CV): RMSE = 1.581, R-squared = 0.445, best overall OLS Linear Regression: RMSE = 1.590, R-squared = 0.439, nearly identical to GBM Random Forest (300 trees): RMSE = 1.598, R-squared = 0.433 Lasso (51 predictors retained): RMSE = 1.623, R-squared = 0.415 Elastic Net: RMSE = 1.623, R-squared = 0.415 Ridge: RMSE = 1.631, R-squared = 0.410 Principal Components Regression (63 PCs): RMSE = 1.666, R-squared = 0.384 Weighted Least Squares: RMSE = 1.936, R-squared = 0.169, poor out-of-sample fit GBM and OLS perform nearly identically, suggesting the non-linear patterns are modest and that OLS coefficients tell an equally complete interpretive story. The two dominant predictors - consistent across OLS and Random Forest variable importance - are poor self-rated general health (coefficient = 2.74, p < 0.001) and difficulty walking or climbing stairs (coefficient = 0.735, p < 0.001). Loneliness and inability to work due to illness are also significant positive predictors. Older age groups above 65 show negative associations, plausibly because those surviving to older age with diabetes represent a healthier-than-average subset. Five-fold CV on the Random Forest gives mean RMSE = 1.62 and mean R-squared = 0.43. Key Takeaways Survey-based diabetes screening is feasible without clinical tests, achieving AUC above 0.80 across multiple models. PCA compression does not improve accuracy for existing classifiers but enables QDA, which achieves the highest sensitivity of any model tested. Physical health burden among diabetics is predictable at R-squared of approximately 0.44 from survey responses alone; the remaining variance likely requires clinical or longitudinal data to capture. The near-equivalence of GBM and OLS reinforces that interpretable models can match complex ones when non-linear patterns are modest. A two-stage hurdle model is recommended for future work to better handle the high proportion of zero responses in the physical health target. The complete BRFSS dataset and documentation can be accessed through the CDC link provided.
Technologies
Project Info
Category
Machine Learning
Timeframe
September 2025 - April 2026