Election Doodle Dandy
Original image replaced by one that auto-updates
The original image can be found here
Election Polling Data and Models
For several election cycles, RealClearPolitics has maintained a running average of election polls on a state-by-state basis. This provides a quick snapshot of who is leading in each state and by how much. It doesn’t estimates of the probabilities that a candidate will carry a state if the election was held today.
On the other hand, the highly respected Nate Silver at FiveThirtyEight runs an election model based on state-by-state probabilities. But his model uses several inputs in addition to the polls as well as weighting the polling data. Inputs such as the state’s prior voting patterns, pollster bias, economic trends and national polling trends.
I wanted something in-between these two approaches that would provide the probability that a presidential candidate would be elected today based solely on the current polling data. The first step is simple enough: gather the current polls for each state. For this exercise, I am using only the state’s listed as “toss-up” or “battleground” states at RCP. This includes CO, FL, IA, MI, NV, NH, OH, PA, VA, WI.
I wrote a small python script to read the current polling data from the RealClearPolitics battleground state web pages and to calculate the initial statistics. One item to be careful of is that the order in which the numbers for Romney and Obama are displayed changes depending upon which candidate is leading.
Simple Mean: The first stat to calculate is the simple mean of the split in support (‘+2.5 O’ or ‘+1.8 R’). Interestingly, this isn’t what RCP displays as the rolling average. RCP appears to average the percentage numbers of support for each of candidate separately, and then calculate the split. In most cases, this results in the same average, but in some cases, rounding errors lead to differing means for these two methods although they are both based on the same data.
Variance: Two variances are calculated, the variance between the polls and the variance within the polls. The latter is weighted by the sample of size of the polls. The total variance for each state is the sum of the variances between and within the polls. The s-statistic is the square root of this summed variance.
Polling Bias: While Nate Silver titled his blog post Poll Averages Have No History of Consistent Partisan Bias, he actually showed that national polls in Presidential races have a historical mean bias of 0.9 and median bias of 0.3 in favor of Democrats. Similarly, he shows that the state polls for Presidential races have a mean bias of 0.5 and median bias of 0.7 in favor of Democrats. Granted, these vary broadly over the years. Nevertheless, I have adjusted the RCP polling average for each state 0.6 points in favor of the Republicans to account for this historical bias.
Z-score: Z-scores are used to determine the probability that a given candidate will win the state. The z-score is calculated as the mean of the polls / s-statistic. For instance, a state with a mean of 2 and s-statistic of 2 would have a z-score of 1 which gives the leading candidate an 84% chance of carrying the state. If the s-statistic was 1, then the z-score would be 2 which gives the leading candidate a 95% chance of carrying the state. The larger the variance, the less certain that the candidate will turn his lead into a win. The tighter the variance, the more likely that the poll average accurately reflects the preference of the ‘likely voter’ and indicates greater certainty that the the lead shown by the polling average is correct.
Monte Carlo: I feed the state-by-state probabilities into a matrix of 10000 possible elections, and then sum the electoral votes for each sample run. The results are displayed in a graph of the aggregate of the simulated Electoral Vote distribution as seen above.
This method accepts the RCP recent polls uncritically.
Those polls have unique, possibly incompatible Likely Voter models.
This method invokes no trend analysis.
The only historical information drawn upon is that used to determine the historical polling bias adjustment.
No other information aside from election polls are used.
It is assumed that the RCP “MoE” represents a 95% CI and is ~ 2*s.
I was surprised at first that this purely statistical model favored the President so strongly. After chewing on the data for a day, I realized that what the model is telling me is that the race has basically become a two state race. In most scenarios, Romney needs to win both Ohio and Virginia to win the election. Virginia is currently rated as a dead heat with a 50% probability of going either way, 56% chance with the polling bias adjustment. Ohio leans in favor of Obama, with RCP giving him a 1.9 pt lead which translates into only a 31% for Romney. Using only current RCP polls, Romney has only a 17% chance of winning both. The remaining 5-10% chance of his winning comes from unlikely combinations of the other states.
At the National Review, Josh Jordan recently criticized Silver’s model in his article Nate Silver’s Flawed Model , partly due to his use of weighted polls. Something tells me that he would be even less satisfied with this unweighted poll model.
I may not have used the optimal method of calculating variance and choosing an appropriate s-statistic. Similarly, it may be more appropriate to use a t-test rather than the z-score with its standard normal distribution.
I have automated the production of the EV Distribution chart. It will be updated periodically at the following location: