Programming for Data Science Coursework: MCMC Algorithms & Flight Data Analysis (University of London)
September 2024 - April 2025
About This Project
A comprehensive statistical computing project completed for ST2195 (Programming for Data Science) at the University of London, consisting of two parts: (1) implementation and analysis of the Metropolis-Hastings MCMC algorithm for simulating random numbers from a Laplace distribution, and (2) analysis of commercial flight data from the 2009 ASA Statistical Computing and Graphics Data Expo. The project demonstrates proficiency in both R and Python, covering topics from Bayesian statistics and convergence diagnostics to logistic regression modeling and large-scale data analysis.
Project Details
This project was completed as coursework for ST2195 (Programming for Data Science) at the University of London. To maintain academic integrity, I am not sharing links to my submitted work or the full assignment instructions. However, I can provide the dataset link and describe the project's scope and methodology. Part 1: Metropolis-Hastings MCMC Algorithm The first part of the project focused on implementing and analyzing the Metropolis-Hastings algorithm, specifically the random walk Metropolis variant. The goal was to simulate random numbers from a distribution with probability density function f(x) = (1/2)exp(-|x|), which is a Laplace distribution. Key tasks included: - Implementing the random walk Metropolis algorithm with N = 10,000 iterations and step size s = 1 - Constructing histograms and kernel density plots to visualize the generated samples - Overlaying the theoretical density function to assess the quality of the estimates - Computing Monte Carlo estimates of the mean and standard deviation - Implementing convergence diagnostics using the R̂ (R-hat) statistic - Analyzing convergence behavior across different step sizes (s values from 0.001 to 1) with multiple chains The implementation required careful attention to numerical stability, using log-space calculations (log u < log r) to avoid underflow errors when computing acceptance ratios. Part 2: Flight Data Analysis The second part involved analyzing the 2009 ASA Statistical Computing and Graphics Data Expo dataset, which contains flight arrival and departure details for all commercial flights on major carriers within the USA from October 1987 to April 2008. This is a massive dataset with nearly 120 million records (12 GB uncompressed). I selected a subset of five consecutive years and addressed three main research questions: (a) Best times and days to minimize delays: Analyzed patterns in flight delays across different times of day and days of the week to identify optimal travel windows for minimizing delay risk. (b) Aircraft age and delays: Evaluated whether older aircraft experience more delays on a year-to-year basis, requiring careful data wrangling to match aircraft information with flight records. (c) Logistic regression for flight diversions: Built logistic regression models to predict the probability of flight diversions using features including: - Departure date attributes (day of week, month, season) - Scheduled departure and arrival times - Geographic coordinates and distance between departure and arrival airports - Carrier information The models were fitted separately for each year, and coefficient visualizations were created to show how relationships between predictors and diversion probability evolved over time. Technical Approach The project required proficiency in both R and Python, with code organized in RMarkdown and Jupyter notebooks respectively. The workflow involved extensive data cleaning, feature engineering, statistical modeling, and visualization. For the flight data analysis, I implemented efficient data processing techniques to handle the large dataset size, including strategic subsetting, aggregation, and database-like operations. The complete dataset and supplementary information can be accessed through the Harvard Dataverse link provided.
Technologies
Project Information
Category
Statistical Computing & Data Analysis
Timeframe
September 2024 - April 2025