Big Data Pipelines for Severe Weather Analysis
The goal of this ongoing project is to use machine learning methods to analyze the biases in severe weather prediction models. The primary obstacle to this goal is formatting the data in a way that makes the most robust predictors accessible for analysis. These prediction models use a combination of weather balloon, radar, and satellite data and each source has its own unique limitations and formatting conventions. In my analysis, which you can check out on my GitHub site, I wrote several programs for compiling and analyzing the massive historical datasets publicly available from the NOAA website. The dataframes that I built use raw variables such as temperature and pressure, to compute a time series of high-level variables such as potential energy and wind helicity. These high-level variables are then stored in a simple hdf5 format. I am currently in the process of scaling this procedure with Apache Spark.
The most comprehensive weather data comes from balloon soundings, which get a full 3-dimensional reading of the atmosphere as they travel skywards. Every robust predictor used today could in principle be calculated from this raw data. The problem is that balloon data is sparse and sounding stations cannot always take data on schedule. All of this results in high variability that can create local minima for a machine learning algorithm. In order to validate my analysis, I compared the gridded interpolation generated by the prediction models to real physical balloon data. This figure illustrates one comparison with a model called "Rapid Refresh" (or RAP for short) from Dallas on 6/16/2014 and shows very good agreement.
With the gridded data validated, we can go ahead and use it to derive high-level variables relevant to severe weather prediction over the entire United States as a time-series. One of the most basic variables is the amount of potential energy in the atmosphere, and tornados are nature's way of dissipating energy when it builds up to extreme levels. The figure below shows the (surface-based) potential energy during the famous El Reno tornado of 5/31/2013. This tornado was the largest and most powerful tornado ever recorded, with a diameter close to 3 miles wide and wind speeds clocked at over 295 miles per hour. In this plot we can see the build-up of energy before touchdown (left), the initial dissipation just after touchdown (middle), and the environment just after the event (right).