Python
Georgia Tech Master’s in Analytics Practicum Project
For the Practicum capstone to my Master’s in Analytics degree at Georgia Tech, I worked with Sandia National Laboratories to study analytical methods for distinguishing and classifying healthy and degraded signals. The impetus for this project scope was to facilitate proactive maintenance and reduce downtime in mechanical and manufacturing equipment that had, over time, deviated from normal patterns of output and exhibited behavioral anomalies.
This project evaluates the ability of four neural network architectures to detect anomalies in financial data and distinguish between healthy and degraded signals of stock index price and trading activity. Each architecture is comprised of an autoencoder, suitable for detecting patterns in time series data, along with a dense binary classifier for discriminating between healthy and degraded signals. The autoencoder types examined include a simple recurrent neural network (RNN), long short-term memory (LSTM), gated recur- rent unit (GRU), and transformer.
Each architecture is trained and evaluated on daily closing price data of the Dow Jones Industrial Average, as well as a number of engineered statistical and technical financial features incorporating the Dow’s historical price and trading volume. Starting with the Dow’s daily closing price, tracked from its inception on 26 May 1896, trading activity is partitioned into 413 60-day periods. Each period constitutes a signal, some of which exhibit anomalous, degraded price behavior due to market volatility or other factors. Degraded signals are labeled by determining whether each 60-day period contains any anomalous data points or trends in price activity. Anomalies are identified using two disparate measures of detection: median absolute deviation and rolling Z-score analysis.
For each period, a number of features are engineered to capture long- and short-term trends in price and volume, as well as technical financial measures of volatility, momentum, and the interplay between price and volume. For each of the four architectures, the autoencoder component is trained on only healthy signals to allow it to learn expected patterns. The autoencoder’s performance is assessed using a validation set of healthy signals, and the reconstruction errors of this validation step are utilized to determine a threshold for flagging anomalous data points.
Each architecture’s trained autoencoder component predicts on a holdout data set consisting of labeled healthy and degraded signal samples. The latent representation of the autoencoder output from this step serves as training data for the binary classifier component, which learns to distinguish between healthy and degraded signals based on the established anomaly threshold. After training on the autoencoder output’s latent representation, the classifier outputs a binary prediction of healthy or degraded for each signal. Its performance is evaluated with a final test data set consisting of labeled healthy and degraded signals.
The four architectures are evaluated using accuracy, precision, recall, ROC-AUC score, and F1 score. The GRU is selected to form the autoencoder component of a final hyperparameter-tuned, high-performing optimized model, which consists of three GRU layers with neuron sizes 512, 256, and 128, along with three dense binary classifier layers. The optimized model achieves outstanding performance across all metrics, including accuracy of 0.964, precision of 1, and Fβ score of 0.994, using tuned parameter β = 0.1 and tuned binary decision threshold θ = 0.879. The results suggest this framework – a multi-layer autoencoder to detect sequential patterns, coupled with robust anomaly detection logic and a multi-layer dense binary classifier to distinguish between healthy and degraded signal behavior – can be used to process signals and determine their anomalous behavior across a variety of contexts.
Code and data here.
R
Election Fraud in Russia
Benford’s law is a statistical phenomenon by which the distribution of natural collections of digits is skewed toward lower values: 1 organically occurs nearly seven times as frequently as does 9. It is also a common method for detecting fraud in accounting, scientific publications, and elections. This project extends principles of Benford’s law to investigate allegations of tampering in the 2011 Russian election for state legislature. Code and data here.
Using NLP to Analyze Constitutional Preambles
The United States Constitution outlines the systems and principles upholding the world’s oldest democracy. This project explores the academic debate over whether the linguistic and ideological influence of the U.S. Constitution has waned over time. Utilizing fundamental tools of natural language processing, we can analyze the textual similarity of constitutional preambles in order to evaluate the U.S. Constitution’s relationships with other founding documents and its importance within the global network of constitutional inspiration. Code and data here.
Predicting the United States Presidential Election
The efficient market hypothesis of economics states that markets reflect all information, which all market participants share equally. For the stock market, this implies that it is inherently impossible to “beat the market” consistently over the long run. But what about other markets? Can betting market data, for example, accurately predict the outcome of an election? And how does it fare against a commonplace method of pre-election forecasting — polls? Code and data here.
Voting Patterns in the United Nations General Assembly
Over the last half-century the U.S.-Soviet rivalry and its aftermath have shaped culture, trade, and the global geopolitical power balance. Voting data from the United Nations General Assembly provides a faithful, straightforward proxy for scrutinizing world nations’ attitudes with regard to these key international developments, and can serve as a starting point to foretell or analyze continued realignment of political power structures. Code and data here.
Efficacy of Small Class Size in Improving Educational Outcomes
Small classes have long been assumed to benefit students, providing them more targeted, individualized instruction conducive to greater engagement and the development of stronger mental models. But how effective are small classes in the early years of education at establishing foundations for continued growth? This project analyzes data from the STAR educational study on the efficacy of small kindergarten classes in improving standardized test scores and high school graduation rates. Code and data here.
Understanding World Population Dynamics
Demography is the study of the dynamics of human population, captured by the quantization and interplay of births, deaths, income, and disease incidence. This project explores fundamental demographic statistics of the populations of Kenya, Sweden, and the world at large. Such study engenders a more thorough understanding of the underlying socioeconomic factors responsible for changing population structures. Code and data here.
U.S. COVID-19 Death Projections
Using simple linear regression, we predict COVID-19 deaths across all 50 states in the first week of November 2020 based on pandemic history, population density, hospital and healthcare access, mask usage, geographic region, reopening protocols and compliance, and political sentiment. Code and data here.
California Wildfires, 1992-2015
Using simple linear regression – as enforced by the requirements of the project – I attempt to predict the size of California wildfires between 1992 and 2015 based on meteorological conditions in an effort to better inform policy-making based on the likelihood of a fire to spread to devastating proportions.
The attempt is largely unsuccessful, indicating that the project should be refined to utilize some non-linear machine learning technique to derive the relative importance of particular features and achieve stronger model performance. Code and data here.
NBA Master Scrape
This script automates the scraping, cleaning, and formatting of 24 seasons of NBA game and gambling results from Goldsheet.com, showcasing skills in organizing and addressing the idiosyncrasies of messy, real-world data. Code here.