Data Science Projects

R

Election Fraud in Russia
Benford’s law is a statistical phenomenon by which the distribution of natural collections of digits is skewed toward lower values: 1 organically occurs nearly seven times as frequently as does 9. It is also a common method for detecting fraud in accounting, scientific publications, and elections. This project extends principles of Benford’s law to investigate allegations of tampering in the 2011 Russian election for state legislature. Code and data here.

Using NLP to Analyze Constitutional Preambles
The United States Constitution outlines the systems and principles upholding the world’s oldest democracy. This project explores the academic debate over whether the linguistic and ideological influence of the U.S. Constitution has waned over time. Utilizing fundamental tools of natural language processing, we can analyze the textual similarity of constitutional preambles in order to evaluate the U.S. Constitution’s relationships with other founding documents and its importance within the global network of constitutional inspiration. Code and data here.

Predicting the United States Presidential Election
The efficient market hypothesis of economics states that markets reflect all information, which all market participants share equally. For the stock market, this implies that it is inherently impossible to “beat the market” consistently over the long run. But what about other markets? Can betting market data, for example, accurately predict the outcome of an election? And how does it fare against a commonplace method of pre-election forecasting — polls? Code and data here.

Voting Patterns in the United Nations General Assembly
Over the last half-century the U.S.-Soviet rivalry and its aftermath have shaped culture, trade, and the global geopolitical power balance. Voting data from the United Nations General Assembly provides a faithful, straightforward proxy for scrutinizing world nations’ attitudes with regard to these key international developments, and can serve as a starting point to foretell or analyze continued realignment of political power structures. Code and data here.

Efficacy of Small Class Size in Improving Educational Outcomes
Small classes have long been assumed to benefit students, providing them more targeted, individualized instruction conducive to greater engagement and the development of stronger mental models. But how effective are small classes in the early years of education at establishing foundations for continued growth? This project analyzes data from the STAR educational study on the efficacy of small kindergarten classes in improving standardized test scores and high school graduation rates. Code and data here.

Understanding World Population Dynamics
Demography is the study of the dynamics of human population, captured by the quantization and interplay of births, deaths, income, and disease incidence. This project explores fundamental demographic statistics of the populations of Kenya, Sweden, and the world at large. Such study engenders a more thorough understanding of the underlying socioeconomic factors responsible for changing population structures. Code and data here.

U.S. COVID-19 Death Projections
Using simple linear regression, we predict COVID-19 deaths across all 50 states in the first week of November 2020 based on pandemic history, population density, hospital and healthcare access, mask usage, geographic region, reopening protocols and compliance, and political sentiment. Code and data here.

California Wildfires, 1992-2015
Using simple linear regression – as enforced by the requirements of the project – I attempt to predict the size of California wildfires between 1992 and 2015 based on meteorological conditions in an effort to better inform policy-making based on the likelihood of a fire to spread to devastating proportions.

The attempt is largely unsuccessful, indicating that the project should be refined to utilize some non-linear machine learning technique to derive the relative importance of particular features and achieve stronger model performance. Code and data here.

NBA Master Scrape
This script automates the scraping, cleaning, and formatting of 24 seasons of NBA game and gambling results from Goldsheet.com, showcasing skills in organizing and addressing the idiosyncrasies of messy, real-world data. Code here.


Python

Predicting Titanic Passengers’ Survival
Identify the best machine learning model and parameters for successfully predicting whether passengers survived the wreck. Code and data here.

Neural Networks Digits Classifier
Build deep, feedforward neural networks to classify images of handwritten digits. Code and data here.

Predicting Car Prices
Use KNN analysis to predict price from a number of physical automobile features, distinguishing between the success rates of univariate, multivariate, and hyperparameter-optimized models. Code and data here.

Naive Bayes Spam Filter
Create a spam filter built on simple natural language processing principles which correctly classifies over 98% of new SMS messages. Code and data here.

Predicting Washington D.C. Bike Rentals
Understand how predictive success changes with the tuning of the parameters of the random forest algorithm. Code and data here.

Analyzing SAT Performance in NYC Schools
Understand how race, geography and school safety relate to performance on the SAT in New York City high schools. Code and data here.

Popular Data Science Questions
Provide a recommendation for the specific type of deep learning content a company should create for maximal online engagement. Code and data here.

Employee Exit Surveys
Plot and analyze which experience brackets of workers are the most likely to resign over job dissatisfaction. Code and data here.

Car Sales Data
Analyze data set of over 50,000 eBay classifieds of German automobiles. Code and data here.

‘Hacker News’ Posts
Provide a recommendation for the optimal time to post on the Hacker News web forum in order to maximize post engagement. Code and data here.

App Profiles
Provide a recommendation for which genres of free apps are best-suited for high engagement across the App Store and Google Play. Code and data here.


SQL

Music Store Data and Business Questions
Perform complex queries which draw qualitative and quantitative conclusions from a range of sales and customer data for a digital music store. Code and data here.

CIA Factbook
Use basic querying of the CIA Factbook to understand which nations are densely populated. Code and data here.