Forecasting Congressional Elections Using Election-Specific Features

Ben Albert
8 min readJul 12, 2021

--

To kick off my Congressional election project, I designed a simple model to predict congressional election outcomes using election-specific features. Two of the most widely know phenomenon in American elections are the midterm loss and incumbent advantage. The midterm loss is the tendency for the Presidents party to lose seats in Congress during midterm elections. The incumbent advantage is the empirical observation that incumbents tend to win congressional elections more often than challengers.

With those two ideas alone, we have enough information to build a simple model of congressional elections with a high degree of predictive power.

Data Description

The primary source of data is from the MIT Election Lab. They provide data on US Election for the House of Representatives (HOR) and the Senate. The HOR data covers elections from 1976 to 2018, and the Senate data covers elections from 1976 to 2020. To add the 2020 election to the HOR sample, I gathered data from The Daily Kos and The Guardian.

rm(list = ls())suppressWarnings(library(tidyverse))
suppressWarnings(library(verification))
suppressWarnings(library(car))
set.seed(1234)HOR <- readRDS('ElectionSpecificVarsHOR.rds')
Senate <- readRDS('ElectionSpecificVarsSenate.rds')

To construct my incumbent variable, I matched the candidate that received the most votes in the last election with the list of candidates in the current election. Therefore, I had to exclude 1976 from my analysis for the HOR. So the HOR sample consists of all HOR elections from 1978 to 2020; this amounts to 9,570 elections.

The Senate is more complicated because only one-third of the body is up for re-election in an election year. To make the Senate incumbency variable, I matched one of the two candidates that received the most votes in the last two Senate elections with the list of current candidates. Therefore, I had to exclude the first three Senate elections from my analysis. So the Senate sample consists of all Senate elections from 1982 to 2020. The Senate data has a total of 686 elections.

Lastly, I gathered data on presidential approval ratings from The American Presidency Project. You can download all of my cleaned and merged data on my GitHub.

head(HOR)
Table One: HOR Data
head(Senate)
Table Two: Senate Data

The dependent variable, Dem, is a dummy variable indicating if a Democrat won the election. There are seven predictors in total. First, Pres is a dummy variable indicating if the current President is a Democrat. The second variable, Mid, is also a dummy variable indicating if it is a midterm election year. Pres.Mid is the interaction between Pres and Mid; this variable captures the effect of having a Democratic President during a midterm election. VS is the vote share that the winning candidate received in the last election. If the winning candidate was a Democratic, then VS is greater than zero, and if the winning candidate was a Republican, then it was multiplied by negative one. Thus negative values of VS indicate a Republican won the last election. Incm is a categoric variable denoting if there was an incumbent in the current election. If there was a Democratic incumbent, Incm equals one, and if there was a Republican incumbent, then Incm equals negative one. Incm is equal to 0 if there is not an incumbent in the current election. PA is the average Presidential approval rating taken over the entire year.

HOR Analysis

Since the dependent variable is a dummy variable, logistic regression is the appropriate model. It is essential to leave out part of the data as a test set to assess the true predive power of the model. I kept three-fourths of the data to train the model and used one-fourth as a test set. A summary of the HOR model is given below.

inds <- sample(1:nrow(HOR),size = round(0.75*nrow(HOR)))
test <- HOR[-inds,]
HOR <- HOR[inds,]
mod.HOR <- glm(Dem~Pres+Mid+Pres.Mid+VS+Incum+PA,HOR,family = binomial())
summary(mod.HOR)
Table Three: HOR Regression Summary

The results are promising. All of the coefficients are statistically significant at the 0.001 level. The findings support the midterm loss hypothesis. Under a Democratic President, the probability a Democrat wins in a midterm election decrease substantially. The model predicts during a Democratic administration, with an average approval rating, a Democratic candidate running in a race without an incumbent will see a 13% decrease in the probability they get elected in a midterm year.

There is also evidence to support the incumbent advantage hypothesis. The model predicts that a Democratic candidate running under a Democratic president, with an average approval rating during a midterm year, will see a 40% increase in the probability they get elected if they are the incumbent.

Now let’s examine the in-sample fit of the model. There are several ways to evaluate in-sample fit and out-sample predictive power. This analysis will use the McFadden R squared, classification error, and the ROC curve. The McFadden R squared associated with the model is quite large with a value of 0.64. Unlike the standard OLS R Squared, the McFadden R squared rarely goes above 0.2. So a value of 0.64 indicates a robust in-sample fit.

Calc.Mcfadden.R.squared<- function(y,mod,x=NULL) {
nullmod <- glm(y~1, family="binomial")
LL <- logLik(mod)
R2 <- 1-LL/logLik(nullmod)

return(R2)
}
Calc.Mcfadden.R.squared(HOR$Dem,mod.HOR)

Next, let’s look at the confusion matrix below. Using a threshold of 0.5, the misclassification error from the model is approximately 8.5% (463 elections). Furthermore, the model has similar levels of type one and type two errors. The rate of false positives (11%) is close to the false-negative rate (7%).

Con.Mat <- function(y,y_hat,threshold=0.5) {
y_hat <- ifelse(y_hat>threshold,1,0)
Y <- data.frame(y,y_hat)
Classification <- Y %>%
mutate(n=1) %>%
group_by(y,y_hat) %>%
summarise(n=sum(n)) %>%
pivot_wider(names_from = y, values_from = n)

return(Classification)
}
Con.Mat(HOR$Dem,predict(mod.HOR,type = 'response'))
Table Four: In Sample HOR Confusion Matrix

I also calculated the area under the ROC curve (AUC). High AUC values indicate a good fit, while values close to 0.5 suggest the model does no better than random guessing. The model has an AUC score of 0.96, which is very large.

roc.area(HOR$Dem,predict(mod.HOR,type = 'response'))
roc.plot(HOR$Dem,predict(mod.HOR,type = 'response'))
Figure One: In Sample HOR ROC Curve

Next, let’s examine the out-of-sample performance. The out-of-sample fit is of particular interest because models can overfit the data. Overfitting produces strong in-sample statistics but poor out-of-sample predictive power. The out-of-sample results are displayed below.

y_hat <- predict(mod.HOR,test,type = "response")
Con.Mat(test$Dem,y_hat)
roc.area(test$Dem,y_hat)
roc.plot(test$Dem,y_hat)
Table Five: Out Of Sample HOR Confusion Matrix
Out Of Sample HOR ROC Curve
Figure Two: Out Of Sample HOR ROC Curve

The out-of-sample fit look good as well. In the test sample, the model has a 10% misclassification rate (181 elections). Once again, the model does not display a large amount of type one error (13%) or type two error (9%). Lastly, the AUC is large, with a value of 0.95.

Senate Analysis

Next, let turn to the Senate. Once again, three-fourths of the data was used as a training set, and one-fourth was used as a test set. The results for the Senate model are displayed below.

inds <- sample(1:nrow(Senate),size = round(0.75*nrow(Senate)))
test <- Senate[-inds,]
Senate <- Senate[inds,]
mod.Senate <- glm(Dem~Pres+Mid+Pres.Mid+VS+Incum+PA,Senate,family = binomial())
summary(mod.Senate)
Table Six: Senate Regression Summary

The results are similar to the HOR model; however, the Mid variable is no longer statistically significant. An f-test was run to test the joint hypothesis that Mid and Pres.Mid can be excluded from the regression. The results suggest that Mid and Pres.Mid should be included in the model.

linearHypothesis(mod.Senate,c("MidTRUE = 0","Pres.Mid = 0"))
Figure Three: Senate F-Test

Once again, there is evidence supporting the midterm loss hypothesis and incumbent advantage hypothesis. The results suggest that a Democratic candidate running in a race without an incumbent during a Democratic Presidential administration will experience a 27% decrease in the probability of winning the election during a midterm year. A Democrat, running for the Senate when the President is a Democrat with average approval ratings, will have a 50% increase in the probability that they win the election if they are the incumbent.

The in-sample fit of the Senate model is similar to the HOR model. The McFadden R Squared is 0.40. The Senate model has a misclassification error rate of 18% (97 elections). Lastly, the Senate model has an in-sample AUC of 0.88.

Calc.Mcfadden.R.squared(Senate$Dem,mod.Senate)Con.Mat(Senate$Dem,predict(mod.Senate,type = 'response'))roc.area(Senate$Dem,predict(mod.Senate,type = 'response'))
roc.plot(Senate$Dem,predict(mod.Senate,type = 'response'))
Table Seven: In Sample Senate Confusion Matrix
Figure Four: In Sample Senate ROC Curve

The out-of-sample results are also promising, however not as strong as the HOR model. The model has an out-of-sample misclassification rate of 22% (38 elections). The model also has a high AUC score of 86.

y_hat <- predict(mod.Senate,test,type = "response")
Con.Mat(test$Dem,y_hat)
roc.area(test$Dem,y_hat)
roc.plot(test$Dem,y_hat)
Table Eight: Out Sample Senate Confusion Matrix
Figure Five: Out Of Sample Senate ROC Curve

Conclusion

The results presented here suggest the HOR and Senate models can be used to predict the 2022 midterms. The HOR model has a better in-sample and out-sample performance than the Senate model. However, the results suggest that both models have predictive power. Before we can make our first forecast of the 2022 midterms, an additional step is needed. An input variable in the model is Presidential approval ratings. Therefore, to forecast the 2022 midterm elections, we need a forecast of the 2022 Presidential approval rating. In the next post, I will develop a model to predict Presidential approval ratings using standard time series methods.

--

--

Ben Albert

Hi, my name is Ben and I’m a data scientist. In this blog you will find posts about current events, politics, and economics. I hope that you enjoy the content.