24  Logistic regression

So far we have been using ordinary least squares regression (OLS) to examine the causal relationship between two variables. As we have discussed, OLS is a really powerful tool for this purpose and many applied statisticians use it for a lot of their work. While it is a great general tool, like many general tools, it has its limitations and we often times need to use more refined tools to account for such limitations.

Let’s see what happens when we try to fit a OLS regression with a dichotomous (two category) dependent variable. For this example, I am going to examine the relationship between how someone voted in the 2012 Presidential election and their partisan affiliation. So let me run a simple bivariate linear regression. This model will be what is called a Linear Probability Model. The underlying math between OLS and LPM’s remain the same. What changes, however, is that the dependent variable being dichotomous in this case leads to differences in interpretation and have other underlying implications for the model.

First, I am going to recode the vote12 variable to contain two categories: did they vote for Romney or Obama? Anyone who was not asked the question or reported voting for someone else are coded as NA (missing) values. I am also going to recode pid7 to make those who said “Not Sure” to be marked as “Independents” and to then convert it into integers rather than as a factor variable.

# make a new column called vote_obama...
    #* ...and make the default value NA
nes$vote_obama <- NA

# if voted for Romney, give the new column a value of 0
nes$vote_obama[nes$vote12 == "Mitt Romney"] <- 0

# if voted for Obama, give the new column a value of 1
nes$vote_obama[nes$vote12 == "Barack Obama"] <- 1

# Check to make sure everything looks right
summary(nes$vote12)
 Mitt Romney Barack Obama Someone Else    Not asked 
         364          505           68          241 
summary(nes$vote_obama)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  1.0000  0.5811  1.0000  1.0000     309 
# make a new column called pid7_int...
    #* ...and make the default value NA
nes$pid7_int <- NA

# if said "Not sure", make them a independent
nes$pid7[nes$pid7 == "Not sure"] <- "Independent"

# then remove labels from pid7 and store as pid7_int
nes$pid7_int <- as.numeric(nes$pid7)

# Make a new column called white...
    #* ...and make the default value NA
nes$white <- NA

# if said "White", then make them White
nes$white[nes$race == "White"] <- "White"

# if said "Black", then make them Black
nes$white[nes$race == "Black"] <- "Black"

# Convert to factor
nes$white <- as.factor(nes$white)

# let's remove all NA rows from the dataset
nes_clean <- na.omit(nes)
# Run linear regression
bivariate_linear <- lm(
    formula = vote_obama ~ pid7_int # does party identification predict vote choice?
    , data = nes_clean
)

# Make a table of results
modelsummary(
    bivariate_linear # the regression I made above
    , notes = c(
        "Data source: NES dataset."
        , "Estimates from an Ordinary Least Squares regression."
        , "Standard errors in parentheses."
    ) # include these notes at the bottom of the table
    , stars = TRUE # include stars in table
    , output = "my_linear_bivariate_regression.docx"
)
Table 24.1: The effect of party identification on support for Obama
 (1)
(Intercept) 1.194***
(0.021)
pid7_int −0.175***
(0.005)
Num.Obs. 746
R2 0.628
R2 Adj. 0.627
AIC 336.9
BIC 350.8
Log.Lik. −165.469
RMSE 0.30
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Data source: NES dataset.
Credit: damoncroberts.com
Estimates from an Ordinary Least Squares regression.
Standard errors in parentheses.

Since the dependent variable is dichotomous, my interpretation is going to be slightly different. Now it tells me that the stronger I identify with Republicans, the probability that I voted for Obama decreases by 0.173 points. This effect is statistically significant.

For each unit increase in [my independent variable], the probability of [the 1 category] [increases/decreases]. The effect [is/is not] statistically significant. Which means that it is relatively [implausible/plausible] that I would have estimated an effect that large or larger if there was truly no relationship between the variables.

Let’s see whether this model was a good one, though. We might first believe that there are confounding variables that I have missed. The most obvious one would be that I had excluded race and I should fit a multivariate regression controlling for that.

However, there are some other things that are lurking here. Let’s look at the predicted and actual values here.

# make a table (data.frame) with predicted and actual values
predicted_plot <- data.frame(
    predicted_values = predict(bivariate_linear),
    actual_values = nes_clean$vote_obama
)
#plot predicted vs. actual values
ggplot(
    data = predicted_plot # take the data from predicted_plot
) + 
geom_point( # put points on the graph based on the data
    aes(
        x = predicted_values, # put predicted_values column on x-axis
        y = actual_values # put actual_values column on y-axis
    )
) +
geom_abline( # draw a line through the points
    intercept=0
    , slope=1
) +
theme_minimal() + # make it prettier with this theme
labs( # add labels to the plot
    x='Predicted Values'
    , y='Actual Values'
)

Figure 24.1: Predicted values versus actual values

There are a few things that stand out here: 1. I am predicting lots of values that do not exist. In my variable, you either voted for Obama or you didn’t. The model predicts that there are people that are between that. The model can also predict that there are people that are above the value of 1 (voted for Obama) and below 0 (did not vote for Obama). These are not possible values and they make no sense. 2. I have some really massive residuals (difference between my predicted values and my actual values) in some areas but not in others. This leads to me having an unequal variance in my error term. Which as we discussed last week is a problem. To show that this is the case, I can use some of the model diagnostics from last week to prove it.

gg_diagnose(
    bivariate_linear
)

Figure 24.2: Diagnostics of LPM

Look at all of those plots – look at them! They are bad. I’ve got lots of instances with a residual of 0, but look at the variance on those residuals! To see that I can look at my Normal-QQ plot which looks problematic and so does my Cook’s distance plot (remember, for this plot I do not want spikes to show up in clusters).

So how can we address these problems created by the LPM? We can estimate a model that is designed to handle dichotomous dependent variables. This model is called the logistic regression model (sometimes called logit).

How do logistic regression models work? Well, it is relatively complex. But, what you should know is that it operates much the same way an OLS does. The big difference that you should be aware of is that we can improve upon the LPM model by ensuring that our predicted values can either be 0 or 1. This solves those two issues that we had prior. How does the LPM do this? It uses what is called a link function. Specifically, it uses a log-link function:

\[ \hat{Y} = logit^{-1}(\alpha + \beta_1 \times X_1 + \beta_2 \times X_2) \]

A perhaps easier way to visualize this. Would be:

\[ Ln(\frac{P_1}{1-P_0}) = \alpha + \beta_1 \times X_1 + \beta_2 \times X_2 \]

Where \(P_1\) is the probability of vote_obama being equal to 1. What is different here is that now the relationship between our \(\beta\) coefficients and our \(\hat{Y}\) values is no longer 1 to 1. What this means is that the way that we interpret the effect of our independent variables can no longer be stated as: “a one unit increase in our independent variable leads to a \(\beta\) unit increase in our dependent variable”. Instead, the link function means that the relationship is in terms of logged-odds. So the interpretation is now “a one unit increase in our independent variable leads to a logged-odds unit increase in the probability of our dependent variable being equal to one.”

Without going too much into the algebra of this, this relatively simple link function does quite a lot to solve our problems that we get with the LPM! Let me show you!

Using the glm() function, we can run this logistic regression model.

# run logistic regression model
logistic_regression <- glm(
    formula = vote_obama ~ pid7_int # vote for obama as predicted by party identification
    , data = nes_clean # using this dataset
    , family = binomial # a logistic regression model
)

# make table of results
modelsummary(
    logistic_regression # the regression I made above
    , notes = c(
        "Data source: NES dataset."
        , "Estimates from an Ordinary Least Squares regression."
        , "Standard errors in parentheses."
    ) # include these notes at the bottom of the table
    , stars = TRUE # include stars in table
    , output = "my_linear_bivariate_regression.docx"
)
Table 24.2: The effect of party identification on vote choice
 (1)
(Intercept) 5.256***
(0.374)
pid7_int −1.306***
(0.090)
Num.Obs. 746
AIC 430.4
BIC 439.6
Log.Lik. −213.199
RMSE 0.29
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Data source: NES dataset.
Credit: damoncroberts.com
Estimates are from a logistic regression; interpreted as logged-odds.
Standard errors in parentheses.

How would we interpret the main part of the table we are interested in? The closer to a strong Republican you become, the logged-odds of voting for Obama decreases by 1.258 units. This effect is statistically significant.

But what the heck does it mean for the logged-odds to decrease? Really hard for us to conceptualize that. At least it is much harder than a one unit change leading to a \(\beta\) unit change like we had with OLS. The second equation above shows us that the link function is essentially the natural log of the odds. So we can calculate the odds ratio by exponentiating the coefficient.

Remember that we can look inside our objects by using the str() function:

str(logistic_regression)
List of 30
 $ coefficients     : Named num [1:2] 5.26 -1.31
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "pid7_int"
 $ residuals        : Named num [1:746] 1.02 1.02 -1.28 -1.02 -2.03 ...
  ..- attr(*, "names")= chr [1:746] "1" "2" "3" "4" ...
 $ fitted.values    : Named num [1:746] 0.9811 0.9811 0.2186 0.0201 0.508 ...
  ..- attr(*, "names")= chr [1:746] "1" "2" "3" "4" ...
 $ effects          : Named num [1:746] -0.891 -14.571 -0.53 -0.14 -1.024 ...
  ..- attr(*, "names")= chr [1:746] "(Intercept)" "pid7_int" "" "" ...
 $ R                : num [1:2, 1:2] -8.03 0 -31.64 11.16
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "pid7_int"
  .. ..$ : chr [1:2] "(Intercept)" "pid7_int"
 $ rank             : int 2
 $ qr               :List of 5
  ..$ qr   : num [1:746, 1:2] -8.0318 0.0169 0.0515 0.0175 0.0622 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:746] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "pid7_int"
  ..$ rank : int 2
  ..$ qraux: num [1:2] 1.02 1.04
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-11
  ..- attr(*, "class")= chr "qr"
 $ family           :List of 12
  ..$ family    : chr "binomial"
  ..$ link      : chr "logit"
  ..$ linkfun   :function (mu)  
  ..$ linkinv   :function (eta)  
  ..$ variance  :function (mu)  
  ..$ dev.resids:function (y, mu, wt)  
  ..$ aic       :function (y, n, mu, wt, dev)  
  ..$ mu.eta    :function (eta)  
  ..$ initialize: language {     if (NCOL(y) == 1) { ...
  ..$ validmu   :function (mu)  
  ..$ valideta  :function (eta)  
  ..$ simulate  :function (object, nsim)  
  ..- attr(*, "class")= chr "family"
 $ linear.predictors: Named num [1:746] 3.9503 3.9503 -1.274 -3.8862 0.0321 ...
  ..- attr(*, "names")= chr [1:746] "1" "2" "3" "4" ...
 $ deviance         : num 426
 $ aic              : num 430
 $ null.deviance    : num 1020
 $ iter             : int 6
 $ weights          : Named num [1:746] 0.0185 0.0185 0.1708 0.0197 0.2499 ...
  ..- attr(*, "names")= chr [1:746] "1" "2" "3" "4" ...
 $ prior.weights    : Named num [1:746] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "names")= chr [1:746] "1" "2" "3" "4" ...
 $ df.residual      : int 744
 $ df.null          : int 745
 $ y                : Named num [1:746] 1 1 0 0 0 1 0 0 1 0 ...
  ..- attr(*, "names")= chr [1:746] "1" "2" "3" "4" ...
 $ converged        : logi TRUE
 $ boundary         : logi FALSE
 $ model            :'data.frame':  746 obs. of  2 variables:
  ..$ vote_obama: num [1:746] 1 1 0 0 0 1 0 0 1 0 ...
  ..$ pid7_int  : num [1:746] 1 1 5 7 4 1 6 5 3 7 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language vote_obama ~ pid7_int
  .. .. ..- attr(*, "variables")= language list(vote_obama, pid7_int)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "vote_obama" "pid7_int"
  .. .. .. .. ..$ : chr "pid7_int"
  .. .. ..- attr(*, "term.labels")= chr "pid7_int"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(vote_obama, pid7_int)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "vote_obama" "pid7_int"
 $ call             : language glm(formula = vote_obama ~ pid7_int, family = binomial, data = nes_clean)
 $ formula          :Class 'formula'  language vote_obama ~ pid7_int
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
 $ terms            :Classes 'terms', 'formula'  language vote_obama ~ pid7_int
  .. ..- attr(*, "variables")= language list(vote_obama, pid7_int)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "vote_obama" "pid7_int"
  .. .. .. ..$ : chr "pid7_int"
  .. ..- attr(*, "term.labels")= chr "pid7_int"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(vote_obama, pid7_int)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "vote_obama" "pid7_int"
 $ data             : tibble [746 × 54] (S3: tbl_df/tbl/data.frame)
  ..$ follow      : Factor w/ 4 levels "Most of the time",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..$ birthyr     : int [1:746] 1960 1980 1974 1951 1973 1936 1978 1943 1936 1938 ...
  ..$ turnout12   : Factor w/ 3 levels "Definitely voted",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..$ vote12      : Factor w/ 4 levels "Mitt Romney",..: 2 2 1 1 1 2 1 1 2 1 ...
  ..$ meet        : Factor w/ 5 levels "Extremely Likely",..: 1 5 2 5 3 1 5 5 3 5 ...
  ..$ march       : Factor w/ 2 levels "Have done this in past 4 years",..: 1 2 1 2 2 1 2 2 2 2 ...
  ..$ compromise  : Factor w/ 2 levels "Compromies to get things done",..: 1 1 2 1 2 1 1 1 1 1 ...
  ..$ ftobama     : int [1:746] 100 89 1 0 12 87 32 10 70 80 ...
  ..$ ftblack     : int [1:746] 100 61 61 70 50 75 53 80 88 67 ...
  ..$ ftwhite     : int [1:746] 100 64 58 70 50 90 52 80 90 74 ...
  ..$ fthisp      : int [1:746] 100 61 71 69 50 51 54 80 90 57 ...
  ..$ ftgay       : int [1:746] 96 62 55 49 5 85 52 70 86 27 ...
  ..$ fttrump     : int [1:746] 1 0 13 85 70 5 74 95 39 82 ...
  ..$ fthrc       : int [1:746] 76 69 1 0 3 87 6 10 49 3 ...
  ..$ ftsanders   : int [1:746] 84 71 13 40 31 97 51 10 90 41 ...
  ..$ ftpolice    : int [1:746] 51 56 89 85 80 60 97 90 91 99 ...
  ..$ ftfem       : int [1:746] 62 60 31 45 21 61 34 50 80 22 ...
  ..$ ftmuslim    : int [1:746] 20 61 22 40 12 72 43 50 81 53 ...
  ..$ ftsci       : int [1:746] 100 96 40 76 50 91 98 90 91 1 ...
  ..$ econnow     : Factor w/ 5 levels "1","2","3","4",..: 2 2 4 3 4 2 4 3 2 5 ...
  ..$ lcself      : Factor w/ 8 levels "Very liberal",..: 4 2 6 7 6 2 6 6 3 7 ...
  ..$ disc_b      : Factor w/ 6 levels "A great deal",..: 4 3 3 3 4 2 4 4 3 4 ...
  ..$ disc_h      : Factor w/ 6 levels "A great deal",..: 4 4 4 3 4 3 4 4 4 4 ...
  ..$ disc_g      : Factor w/ 6 levels "A great deal",..: 3 3 5 4 4 2 3 4 2 4 ...
  ..$ disc_w      : Factor w/ 6 levels "A great deal",..: 4 5 1 4 4 4 3 4 5 4 ...
  ..$ disc_m      : Factor w/ 6 levels "A great deal",..: 4 5 1 4 4 4 4 5 5 5 ...
  ..$ disc_fed    : Factor w/ 8 levels "Treats whites much better",..: 3 3 7 5 4 3 5 5 6 4 ...
  ..$ disc_police : Factor w/ 8 levels "Treats whites much better",..: 2 1 4 3 4 1 4 4 4 4 ...
  ..$ immig_numb  : Factor w/ 7 levels "Increased a lot",..: 2 2 1 4 3 3 4 5 5 7 ...
  ..$ terror_worry: Factor w/ 6 levels "Extremely worried",..: 5 5 6 3 3 4 1 1 6 6 ...
  ..$ healthspend : Factor w/ 8 levels "Increase a great deal",..: 1 1 7 4 7 2 7 5 2 4 ...
  ..$ finwell     : Factor w/ 8 levels "A great deal better",..: 6 6 1 5 6 5 5 5 7 2 ...
  ..$ warmcause   : Factor w/ 3 levels "Mostly by human activity",..: 1 1 2 2 2 3 2 2 3 3 ...
  ..$ freetrade   : Factor w/ 7 levels "Favor a great deal",..: 4 5 4 6 4 6 1 5 4 4 ...
  ..$ stopblack   : Factor w/ 5 levels "Never","Rarely",..: 4 4 3 3 3 2 1 3 3 2 ...
  ..$ stop_ever   : Factor w/ 4 levels "Has happened",..: 1 4 1 4 1 1 1 4 4 2 ...
  ..$ birthright_b: Factor w/ 8 levels "Favor a great deal",..: 2 8 5 8 8 8 6 8 8 7 ...
  ..$ bo_muslim   : Factor w/ 3 levels "Muslim","Not a Muslim",..: 2 2 2 2 2 2 2 2 2 1 ...
  ..$ amer_ident  : Factor w/ 5 levels "Extremely important",..: 4 4 1 2 2 2 1 1 1 1 ...
  ..$ marstat     : Factor w/ 6 levels "Married","Separated",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..$ employ      : Factor w/ 9 levels "Full-time","Part-time",..: 1 1 1 1 1 5 1 5 5 5 ...
  ..$ educ        : Factor w/ 6 levels "No HS","High school graduate",..: 5 5 2 2 4 6 6 2 2 2 ...
  ..$ gender      : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 2 1 2 ...
  ..$ race        : Factor w/ 8 levels "White","Black",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..$ pid3        : Factor w/ 5 levels "Democrat","Republican",..: 1 1 4 2 3 1 2 3 1 2 ...
  ..$ pid7        : Factor w/ 8 levels "Strong Democrat",..: 1 1 5 7 4 1 6 5 3 7 ...
  ..$ ideo5       : Factor w/ 6 levels "Very liberal",..: 6 2 4 5 4 2 4 4 3 5 ...
  ..$ pew_churatd : Factor w/ 8 levels "More than once a week",..: 6 6 2 6 2 1 4 2 4 1 ...
  ..$ faminc      : int [1:746] 4 12 10 10 8 10 11 5 5 6 ...
  ..$ weight      : num [1:746] 0.951 0.914 0.264 0.318 1.019 ...
  ..$ dpolice.new : Factor w/ 4 levels "Treats whites much better",..: 2 1 4 3 4 1 4 4 4 4 ...
  ..$ vote_obama  : num [1:746] 1 1 0 0 0 1 0 0 1 0 ...
  ..$ pid7_int    : num [1:746] 1 1 5 7 4 1 6 5 3 7 ...
  ..$ white       : Factor w/ 2 levels "Black","White": 2 2 2 2 2 2 2 2 2 2 ...
  ..- attr(*, "na.action")= 'omit' Named int [1:432] 2 3 6 7 16 33 38 48 54 56 ...
  .. ..- attr(*, "names")= chr [1:432] "2" "3" "6" "7" ...
 $ offset           : NULL
 $ control          :List of 3
  ..$ epsilon: num 1e-08
  ..$ maxit  : num 25
  ..$ trace  : logi FALSE
 $ method           : chr "glm.fit"
 $ contrasts        : NULL
 $ xlevels          : Named list()
 - attr(*, "class")= chr [1:2] "glm" "lm"

From this output I want to grab a coefficient and specifically the second coefficient. Then once I have it, I’ll take the exponent of it!

exp(logistic_regression$coefficients[2])
 pid7_int 
0.2708807 

The output is 0.284. This means that the odds ratio of voting for Obama the closer someone is to a strong Republican, is 0.284. For those that are familiar with odds, this might make sense. For the rest of us, we still might be a bit confused. An odds ratio of 1 would mean that I am neither more or less likely to vote for Obama. Since I am below that, it would mean that the odds of me voting for Obama the more Republican I become decrease by 71.6% (1 - 0.284).

As a shortcut, the modelsummary package lets me display my results in terms of odds ratios by including one extra option!

# make table of results
modelsummary(
    logistic_regression # the regression I made above
    , exponentiate = TRUE # show odds ratio instead
    , notes = c(
        "Data source: NES dataset."
        , "Credit: damoncroberts.com"
        , "Estimates are from a logistic regression; interpreted as odds ratios."
        , "Standard errors in parentheses."
    ) # include these notes at the bottom of the table
    , stars = TRUE # include stars in table
    , output = "odds_ratio_results.docx"
)
Table 24.3: The odds-ratio of voting for Obama based on your party identification
 (1)
(Intercept) 191.787***
(71.814)
pid7_int 0.271***
(0.024)
Num.Obs. 746
AIC 430.4
BIC 439.6
Log.Lik. −213.199
RMSE 0.29
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Data source: NES dataset.
Credit: damoncroberts.com
Estimates are from a logistic regression; interpreted as odds ratios.
Standard errors in parentheses.

One option would be to use the visreg package (Breheny and Burchett 2017) to plot those predicted probabilities:

visreg(
    logistic_regression # the logistic regression model above
    , "pid7_int" # the independent variable
    , scale = "response" # want to look at it in terms of possible values (0 or 1)
    , band = FALSE # do not show a band of confidence intervals
)

Figure 24.3: The predicted probability of voting for Obama (Visreg)

Or, I can use the jtools package (Long 2022) to make a plot of my predicted probabilities using the ggplot2 package (Wickham 2016) as a backend. This means that I can make adjustments to this plot the way I would with my other plots using ggplot2. Nice!

effect_plot(
    logistic_regression # use the results from this model
    , pred = pid7_int # the independent variable to look at
    , interval = FALSE # do not show confidence interval
) +
theme_minimal() + # apply the minimal theme
labs( # apply the following labels
    y = "Probability of voting for Obama"
    , x = "Party Identification"
)

Figure 24.4: Predicted probability of voting for Obama depending on party identification

Both plots show me the same thing: The probability that I’d vote for Obama, the more Republican I am decreases. We see that the probability of voting for Obama if I am a strong Democrat is almost 1. However, if I am an independent, the probability that I vote for Obama is just above 0.5. And if I am a strong Republican, the probability that I vote for Obama is just above 0.

While the interpretation of these models requires a bit more work, we can see that we are no longer coming up with super wonky predictions (value outside of 0 and 1). Because we don’t have really weird predictions anymore, we can also be a bit more comfortable that our residuals aren’t super messed up anymore either!

Now, say that I do a multivariate logistic regression. The interpretation of this plot changes slightly and can be a bit more complicated!

# run logistic regression model
logistic_regression <- glm(
    formula = vote_obama ~ pid7_int + white # vote for obama as predicted by party identification and race
    , data = nes_clean # using this dataset 
    , family = binomial # a logistic regression model
)

# make table of results
modelsummary(
    logistic_regression # the regression I made above
    , notes = c(
        "Data source: NES dataset."
        , "Estimates from an Ordinary Least Squares regression."
        , "Standard errors in parentheses."
    ) # include these notes at the bottom of the table
    , stars = TRUE # include stars in table
    , output = "my_linear_bivariate_regression.docx"
)
Table 24.4: The effect of party identification on vote choice
 (1)
(Intercept) 7.892***
(0.893)
pid7_int −1.275***
(0.091)
whiteWhite −2.924***
(0.794)
Num.Obs. 746
AIC 409.8
BIC 423.6
Log.Lik. −201.888
RMSE 0.28
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Data source: NES dataset.
Credit: damoncroberts.com
Estimates are from a logistic regression; interpreted as logged-odds.
Standard errors in parentheses.
visreg(
    logistic_regression # the logistic regression model above
    , "pid7_int" # the independent variable
    , scale = "response" # want to look at it in terms of possible values (0 or 1)
    , band = FALSE # do not show a band of confidence intervals
)

Figure 24.5: The predicted probability of voting for Obama (Visreg)

This plot demonstrates that the probability of voting for Obama decreases the more Republican an individual is when holding the white variable at it’s mean value.

What if we don’t want to hold white at it’s mean value? Well we can plot this with a few tweaks to the function.

summary(nes_clean$white)
Black White 
  104   642 
visreg(
    logistic_regression # the logistic regression model above
    , "pid7_int" # the independent variable
    , by = "white"
    , breaks = c("White", "Black")
    , scale = "response" # want to look at it in terms of possible values (0 or 1)
    , band = FALSE # do not show a band of confidence intervals
    , overlay = TRUE # overlay the plots
)

Figure 24.6: Predicted probability of voting for Obama depending on party identification

This plot demonstrates that the probability of voting for Obama decreases the more Republican someone is. We see this occurs for both white and Black respondents. Though it appears that the decrease in predicted probability for Obama for Black respondents remains relatively high, even if they identify as a “Strong Republican”.

Note you can make a similar graph using the effect_plot() function.

24.1 Some final notes:

  • OLS when run with a dichotomous dependent variable is called the LPM model.
  • Problems with the LPM model:
    • Impossible predicted values for our dependent variable.
    • Heteroskedastic errors (unequal variance in our residuals)
  • Logistic regression:
    • Fixes the problems with the LPM model
    • More work to interpret them/harder to interpret
  • For an R script to replicate what we did in this exercise, go to this page.