Back to Projects
Machine Learning

In Progress: Machine Learning Analysis of Diabetes-Related Health Outcomes (University of London Coursework)

September 2025 - Present

About This Project

An ongoing machine learning coursework project for ST3189 (Machine Learning) at the University of London, analyzing diabetes-related health outcomes using the CDC Behavioral Risk Factor Surveillance System (BRFSS) dataset. The project requires implementing three core machine learning tasks: unsupervised learning for population segmentation, regression for continuous target variables, and classification for categorical outcomes. The analysis will compare multiple techniques for regression and classification tasks, with a focus on presenting results in an accessible format for audiences with quantitative backgrounds but no prior machine learning knowledge.

Project Details

This project is an ongoing coursework assignment for ST3189 (Machine Learning) at the University of London, which contributes 30% to the final course grade. The project is currently in its early stages, and I am working with the CDC Behavioral Risk Factor Surveillance System (BRFSS) dataset to analyze diabetes-related health outcomes. To maintain academic integrity, I am not sharing links to my work-in-progress or the full assignment instructions. However, I can describe the project's scope and the dataset being used. Personal Motivation This project holds particular personal significance for me, as my family has a history of diabetes. This personal connection drives my interest in using machine learning to derive meaningful insights from leading health-related surveillance data. The BRFSS dataset, being the world's largest continuously conducted health survey system, provides an exceptional opportunity to explore patterns, risk factors, and outcomes related to diabetes at a population level. By applying unsupervised learning, regression, and classification techniques to this comprehensive dataset, I hope to uncover insights that could contribute to understanding diabetes-related health outcomes and potentially inform preventive strategies. Project Requirements The coursework requires completing three core machine learning tasks, which can be implemented on one or more real-world datasets: 1. Unsupervised Learning: Identifying homogeneous population groups or applying dimension reduction techniques that can be used in the context of the empirical application. 2. Regression: Addressing problems with continuous target variable(s), using multiple regression techniques and comparing their results. 3. Classification: Addressing problems with categorical target variable(s), using multiple classification techniques and comparing their results. The project requires presenting each dataset, identifying research questions that can be addressed by the analysis, and ideally presenting relevant existing literature to contrast results against. The ability to present and interpret results in accessible language is regarded as equally important as technical implementation. Dataset: CDC Behavioral Risk Factor Surveillance System (BRFSS) The BRFSS is a collaborative surveillance project between US states, participating territories, and the CDC’s National Center for Chronic Disease Prevention and Health Promotion. It is the world’s largest continuously conducted telephone health survey system, designed to collect uniform state-specific data on health risk behaviors, chronic diseases and conditions, access to care, and use of preventive health services related to the leading causes of death and disability in the United States. Since 1984 it has expanded to all 50 states, the District of Columbia, and several US territories, using a dual-frame design that combines landline and cellular telephone interviews with iterative proportional fitting (“raking”) to produce representative, weighted estimates. For this project, I am focusing on diabetes-related aspects, using variables from the core questionnaire and optional modules on prediabetes and diabetes available in recent BRFSS cycles (for example, 2024). The BRFSS data provides rich opportunities for all three required tasks: - Unsupervised learning can identify distinct health behavior or risk-factor clusters, or reduce dimensionality across hundreds of survey variables - Regression can model continuous outcomes such as health-related quality-of-life indices or number of days with poor physical or mental health - Classification can predict categorical outcomes such as diabetes diagnosis status, prediabetes status, or self-reported diabetes management behaviors The final deliverable will be a 10-page article in A4 format (excluding title page, table of contents, and references) along with well-commented code in R or Python (RMarkdown or Jupyter notebook format). The analysis will be presented in a paper-like format, avoiding highly technical language where possible, with an audience in mind of people with quantitative backgrounds but no prior machine learning knowledge. The complete BRFSS dataset and documentation can be accessed through the CDC link provided, which offers annual survey data from 1990 to present, along with technical documentation, questionnaires, and supplementary information.

Technologies

RMachine LearningStatistical Analysis

Project Information

Category

Machine Learning

Timeframe

September 2025 - Present