11  2️⃣ Simple bivariate analyses

Often times in science, we don’t want to just understand how one thing occurs. We often want to understand and explain phenomenon. These phenomenon are not just how one feature in the world varies, but how it varies with something else.

We often seek to understand each variable on its own through univariate analyses through summarizing our variables see the summarizing data page. We then move into examining how two or more variables vary together through bivariate (two variables) and multivariate (more than two variables) analyses.

Many new analysts often want to jump right into examining correlations between two variables or even getting into exploring how one variable may cause another. But, again, it is important to first understand our variables and their underlying processes before we jump right into more complicated analyses. If we jump right into more complicated analyses, we could be missing some really important insights that may highlight a number of useful things to the analyst. We will see examples of what kinds of things it can help us identify as we move throughout the term.

So again, recall we want to provide univariate summaries of our data before we even engage in bivariate summaries. What are the tools that we can use to perform bivariate summaries? Like with univariate descriptive statistics, we can use a number of visual or numerical representations. Let’s start with numerical.

11.1 Numerical representations

11.1.1 Cross tabulations (crosstabs)

One really great option for summarizing the relationship between categorical (discrete) variables is through a cross tabulation. The cross tabulation allows us to examine how various values of one variable map onto the values of another variable.

Let’s go through an example. Say we want to examine whether Barack Obama voters in the 2012 presidential election are more likely to believe that the police treat White citizens better than Black citizens.

There are a few ways that we can create a crosstab in R:

  1. table() function: this is the most simple code, but also provides the least visual information

I am going to use the table() function and will tell it to take the votes12 and disc_police variables from the nes dataset to create a crosstab. I tell it how to access those variables from the nes dataset by using the $ function (e.g., nes$vote12 and nes$disc_police).

# Create crosstab with table
table( # use the table function
    nes$vote12, # from the nes dataset, use the vote12 variable
    nes$disc_police # from the nes dataset, use the disc_police variable
)
              
               Treats whites much better Treats whites moderately better
  Mitt Romney                         14                              48
  Barack Obama                       245                              94
  Someone Else                        11                              11
  Not asked                           67                              28
              
               Treats whites a little better Treats both the same
  Mitt Romney                             86                  204
  Barack Obama                            72                   86
  Someone Else                            17                   28
  Not asked                               46                   92
              
               Treats blacks a little better Treats blacks moderately better
  Mitt Romney                              9                               1
  Barack Obama                             4                               4
  Someone Else                             0                               1
  Not asked                                4                               2
              
               Treats Blacks much better Skipped
  Mitt Romney                          2       0
  Barack Obama                         0       0
  Someone Else                         0       0
  Not asked                            1       1
  1. xtabs() and ftable() functions: a bit more code, but easier to read
    1. I am going to first create the crosstab using the xtabs() function. I can tell it to make a cross tab with the vote12 and disc_police variables. I am going to tell it which dataset these variables are in by using the data options. I am also going to use the ~ symbol at the front so that R knows which variable comes first. I will then use the + to let it know what the second variable is.
    2. I can then make the crosstab a bit more readable by then passing the result of my xtabs() function to the ftable() function.

Notice how I do the second step differently in the code below. In the first one, I store the results of the crosstab produced with the xtabs() function into an object called tab. I then tell the ftable() function to use the tab object to make the table. In the second option, I use the %>% function instead. I tell it to grab the nes dataset, to then use the xtabs() function on the vote12 and disc_police variables in the nes dataset. I then use the %>% function again to pass the result of the xtabs() function directly to the ftable() function. This code is a bit more compact and easier to read. Either approach gets you the same result.

# Option 1: Less compact code
tab <- xtabs( # store the result in tab
    ~ vote12 + disc_police, # use the vote12 and disc_police variables
    data = nes # use the nes dataset for these variables
)
ftable( # Apply the ftable function to...
    tab # ... the tab object (defined above)
)

# Option 2: More compact code
nes %>% # grab the nes dataset
    xtabs( # perform the crosstab..
        ~ vote12 + disc_police, # use the vote12 and disc_police variables
    data = . # and use the data connected by the pipe operator (can just use .)
    ) %>%
    ftable()  #... and then make it look nicer with ftable
             disc_police Treats whites much better Treats whites moderately better Treats whites a little better Treats both the same Treats blacks a little better Treats blacks moderately better Treats Blacks much better Skipped
vote12                                                                                                                                                                                                                               
Mitt Romney                                     14                              48                            86                  204                             9                               1                         2       0
Barack Obama                                   245                              94                            72                   86                             4                               4                         0       0
Someone Else                                    11                              11                            17                   28                             0                               1                         0       0
Not asked                                       67                              28                            46                   92                             4                               2                         1       1
  1. datasummary_crosstab() from the modelsummary package (Arel-Bundock 2022): code is much cleaner, so is output.

A third option for making crosstab is to use the datasummary_crosstab() function from the modelsummary package. I personally think these crosstabs are a bit easier to read and are a bit cleaner. As they are tables produced from the same package we used to do univariate descriptive statistics, we can use very similar code to produce these crosstabs!

First, I am going to take the nes dataset and am going to tell R to apply the datasummary_crosstab() function to it. In the options for the functions, I am going to tell it to use the disc_police and vote12 variables. I am going to also tell it to add a note to the bottom of the table that lets the reader know the dataset you used here. Since I am using the %>% function, I can use the shortcut . to tell it what dataset I am using. Using the %>% already let R know that I am using the nes dataset so I do not need to write it out in that option. Finally, I am going to tell R to save the table from the crosstab in a word document called bivariate_crosstab.docx. I can then open the file after I’ve run the code, and then copy and paste the results into my main report or assignment for the class!

nes %>% # grab the nes dataset
    datasummary_crosstab( # use the datasummary_crosstab on it
        disc_police ~ vote12, # using the disc_police and vote12 variables
        notes = "Data source: NES dataset.", # add notes to table
        data = ., # using the dataset connected by the pipe operator
        output = "bivariate_crosstab.docx" # store result in the word document bivariate_crosstab.docx
    )
disc_police Mitt Romney Barack Obama Someone Else Not asked All
Treats whites much better N 14 245 11 67 337
% row 4.2 72.7 3.3 19.9 100.0
Treats whites moderately better N 48 94 11 28 181
% row 26.5 51.9 6.1 15.5 100.0
Treats whites a little better N 86 72 17 46 221
% row 38.9 32.6 7.7 20.8 100.0
Treats both the same N 204 86 28 92 410
% row 49.8 21.0 6.8 22.4 100.0
Treats blacks a little better N 9 4 0 4 17
% row 52.9 23.5 0.0 23.5 100.0
Treats blacks moderately better N 1 4 1 2 8
% row 12.5 50.0 12.5 25.0 100.0
Treats Blacks much better N 2 0 0 1 3
% row 66.7 0.0 0.0 33.3 100.0
Skipped N 0 0 0 1 1
% row 0.0 0.0 0.0 100.0 100.0
All N 364 505 68 241 1178
% row 30.9 42.9 5.8 20.5 100.0
Data source: NES dataset. Credit: damoncroberts.com

Notice how we’ve got a few different options to choose from here. One additional thing to make our interpretation a bit easier is to clean the variables. Again, remember the primary goal of my analysis from the start was to examine whether or not people that voted for Barack Obama in the 2012 election felt that Black Americans are more discriminated against by the police or not. This implies that we can compare folks that voted for Obama to those who did not and only on whether they felt that Black Americans were targets of discrimination by the police or not. This means that for my particular analysis here, I can turn these two variables into a dichotomous variable to help me examine my question. One variable that reflects whether the respondent voted for Barack Obama or not and a second variable that reflects whether the respondent indicates that Black Americans are not treated better by the police or those who believe that Black Americans are compared to Whites.

Recall from the previous exercise on the basics of cleaning data, that we can take advantage of the super nifty functions provided by the dplyr (Wickham et al. 2023) package for data cleaning.

I am going to take the nes dataset and am going to select() the two columns I am interested in examining.

Since I am going to be trying to make new columns using the vote12 and disc_police columns, I can use the mutate() function. First, I am going to create a new column called vote12_clean that has a value of 1 if the respondent voted for Obama and a value of 0 if they voted for Mitt Romney or Someone Else.

Then, I am going to add value labels with the factor() function where people that have a value of 0 for the newly created vote12_clean variable have a label of “Not Obama” and a label of “Obama” if they did vote for Obama (a value of 1).

I am going to do a similar thing with the disc_police variable. For people that do not indicate that the police treat Black Americans better, they will have a value of 0 in the new disc_police_clean variable. Those that say that police do treat Black Americans better will now have a value of 1 in the new disc_police_clean variable. I am going to use the factor() function to add labels to the values of 0 and 1 to make my life easier. I am going to tell it to add a label “Does not treat better” to the smallest value in the new disc_police_clean variable – so 0 – and add the label “Does treat better” to those who have a value of 1 in the new disc_police_clean variable.

# Clean the variables of interest in the nes datasets
nes_clean <- nes %>% # from the nes dataset...
    select( # ... select the following variables...
        vote12, # ... the vote12 variable...
        disc_police #... and the disc_police variable
    ) %>% # ... after they've been selected...
    mutate( #... I am now going to "mutate" or transform these columns into a new one
        #* vote12 - did the respondent vote for Obama in '12 or not?
            #** Coded as: 
                #*** 1 = Mitt Romney
                #*** 2 = Barack Obama
                #*** 3 = Someone Else
                #*** 4 = Not asked
            #** Recoded to:
                #*** 0 = Mitt Romney/Someone Else
                #*** 1 = Barack Obama
                #*** NA = Not Asked
        vote12_clean = case_when(
            vote12 == "Mitt Romney" ~ 0, # make values of Mitt Romney equal to 0
            vote12 == "Barack Obama" ~ 1, # make values of Barack Obama equal to 1
            vote12 == "Someone Else" ~ 0 # make values of Someone Else equal to 0
        ),
                vote12_clean = factor(
            vote12_clean, # take the newly created column
            labels = c(
                "Not Obama", # add this label to values of 0
                "Obama" # add this labels to values of 1
            ),
            ordered = TRUE # the numerical value order matters here
        ),
        #* disc_police - belief that police treat Black Americans better than Whites
            #** Coded as:
                #*** 1 = Treat Whites much better
                #*** 2 = Treat Whites moderately better
                #*** 3 = Treats Whites a little better
                #*** 4 = Treats both the same 
                #*** 5 = Treats Blacks a little better
                #*** 6 = Treats Blacks moderately better
                #*** 7 = Treats Blacks much better
            #** Recoded to:
                #*** 0 = Treats Whites [much/moderately/a little better]/treats both the same
                #*** 1 = Treats Blacks [much/moderately a little better]
        disc_police_clean = case_when(
            disc_police == "Treats whites much better" ~ 0, # make this response equal to 0
            disc_police == "Treats whites moderately better" ~ 0, # make this response equal to 0
            disc_police == "Treats whites a little better" ~ 0, # make this response equal to 0
            disc_police == "Treats both the same" ~ 0, # make this response equal to 0
            disc_police == "Treats blacks a little better" ~ 1, # make this response equal to 1
            disc_police == "Treats blacks moderately better" ~ 1, # make this response equal to 1
            disc_police == "Treats Blacks much better" ~ 1 # make this response equal to 1
        ),
        disc_police_clean = factor(
            disc_police_clean, # take the newly created column
            labels = c(
                "Does not treat better", # add this label if equal to 0
                "Does treat better" # add this label if equal to 1
            ),
            ordered =  TRUE # the numerical value order matters here
        )
    )

Now I can make a even cleaner crosstab.

nes_clean %>% # use the nes_clean dataset...
    datasummary_crosstab( # ... with the dataset, create a crosstab
        disc_police_clean ~ vote12_clean, #... and use the two cleaned variables
        notes = "Data source: Nes dataset.", # add notes to table
        data = ., # and use the dataset connected by the pipe operator
        output = "cleaner_bivariate_crosstab.docx" # store the table in this word document
    )
disc_police_clean Not Obama Obama All
Does not treat better N 419 497 1149
% row 36.5 43.3 100.0
Does treat better N 13 8 28
% row 46.4 28.6 100.0
All N 432 505 1178
% row 36.7 42.9 100.0
Data source: NES dataset. Credit: damoncroberts.com

11.2 Visual

11.2.1 Boxplots

Just as we can with univariate summaries, we can visually depict bivariate summaries. One of the more powerful ways to do this if you have at least one categorical variable is to use a boxplot. In the previous exercise, we used a boxplot for a univariate description. In that example, we only specified a variable for the x-axis. For a bivariate case, we can put a categorical variable on the x-axis and a variable on the y-axis.

Say we want to examine the composition of states’ murder rate that voted for Donald Trump in the 2016 Presidential election relative to states that did not vote for Donald Trump. We can use a categorical variable that documents whether Trump won the state or lost the state. We can put this categorical variable on the x-axis. For the y-axis, we can use a variable documenting the state’s murder rate.

We can use the states dataset as it has both of these variables. The first variable, trumpwin, documents whether Trump won or lost in a given state. The second variable, murderrate, documents the murder rate per capita in a state.

To make the box plot, we can run the following code:

states %>% # grab the states dataset
    mutate( # create a new column based on an existing one
        trumpwin_clean = factor( # create a column trumpwin_clean with value labels
            trumpwin, # use the existing trumpwin variable for this
            labels = c(
                "Loss", # for the first value (0), label it Loss
                "Won" # for the second value (1), label it Won
            ),
            ordered = TRUE # the order of the values and labels matter
        )
    ) %>% # after I've done this, now I can make a plot
    ggplot() + # make a ggplot object
    geom_boxplot( # make a box plot
        aes(
            x = trumpwin_clean, # where the x-axis is whether trump won or not
            y = murderrate # and the y-axis is the murder rate in the state
        ),
        fill = "#D3D3D3" # fill the boxplot with a light grey color
    ) +
    theme_minimal() + # add the minimal theme to look prettier
    labs( # clean up the labels
        x = "Trump won state", # change x-axis label to this
        y = "Murder rate per capita", # change y-axis label to this
        caption = "Data source: States dataset." # add caption
    )

Here is a break down of what this code does:

  1. I am going to grab the states dataset and use the pipe function (%>%) to chain all of the code together. This makes it so that R knows that I am referring to the states dataset for the the code connected by pipe functions.
  2. I am going to use some of my data cleaning skills learned in the previous exercise and am going to add value labels to the trumpwin variable. I do this because I want it to be easier for me to keep track of what a 0 means for the trumpwin variable versus a 1. So to add these value labels, I am going to use the built-in factor() function. With dplyr’s mutate() function (Wickham et al. 2023), I am able to tell R that I am going to create a new column using information from an existing variable. I am going to create a new variable called trumpwin_clean that will have the value labels defined by the factor() function. So I am going to tell it that I am going to be using the trumpwin variable and am going to add the value labels of “Loss” and “Win” to the variable. So values of 0 in the trumpwin variable will be labelled as “Loss” in the trumpwin_clean variable and values of 1 in trumpwin will be labelled as “Win”. I am also going to set the ordered option equal to TRUE because I want to respect the order of the values in the original trumpwin variable.
  3. Now I can make a plot. So I am going to use the ggplot() function to tell R, that after you’ve done the previous things, now I am going to make a plot.
  4. I am going to use the geom_boxplot() function to tell R that the plot I told it I want to make will be a boxplot. Within that function, I am going to provide a couple of options. I am going to tell it that I want to make the x-axis of the boxplot the new variable that I created with the value labels, trumpwin_clean. I am also going to tell it that the y-axis should be the murderrate variable. Finally, I am going to tell it that I want the color of the boxes to be filled in with a light grey color.
  5. I want to make the plot prettier so I am going to use the theme_minimal() function to apply that theme to the plot
  6. I am also going to add cleaner labels to the plot as well. So I am going to use the labs() function to tell it that I want the x-axis to be labeled as “Trump won state”, the y-axis to be “Murder rate per capita”, and to add a caption to the bottom that says “Data source: States dataset.”

Boxplot of Trump winning and murder rate for a state

11.3 Scatterplot

If we have two continuous variables, then we can make a scatterplot instead.

Say for example, we want to look at those who have positive feelings toward Barack Obama and their feelings toward LGBTQIA+ rights.

Looking at the nes dataset, I see that there is a question that people were asked about their “temperature” toward Obama and toward Gay rights. These questions are referred to as feeling thermometers where respondents can choose on a scale from 0-100, how warmly they feel toward the person or group or how cold their feelings are toward them. There is a feeling thermometer for Obama, ftobama, and for the LGBTQIA+ community, ftgay, in the nes dataset.

Using functions from the ggplot2 package (Wickham 2016), I can make a scatter plot for these.

nes %>% # grab the nes dataset and apply the following
    ggplot() + # create a ggplot object
    geom_point( # make this plot a scatterplot
        aes(
            x = ftobama, # use the ftobama variable for the x-axis
            y = ftgay # use the ftgay variable for the y-axis
        ),
        color = "#000000", # add black for outline of points
        fill = "#D3D3D3", # add grey to fill of points
        alpha = 0.5 # add transparency to each point
    ) + 
    theme_minimal() + # use the minimal theme to make prettier
    labs( # adjust the labels
        x = "Feeling thermometer: Obama", # change the x-axis label to this
        y = "Feeling thermometer: LGBTQIA+", # change the y-axis label to this
        caption = "Data source: NES dataset." # add this as the caption
    )

Here’s what this code does:

  1. Take the nes dataset and apply the following to it.
  2. Make a ggplot() object
  3. Specifically, make a scatter plot with the geom_point() function. For this scatterplot, I want the x-axis to be the ftobama variable and the y-axis variable to be the ftgay variable. I want to make the outline color of the circles to be black and the fill color of the circles to be light grey. I want the dots to be somewhat transparent so I am going to set the alpha to 0.5.
  4. Then, I want to use the minimal theme for the plot, so I am going to use the theme_minimal() function.
  5. I also want to change the labels on the plot. So using the labs() function, I am going to tell the x-axis to be labelled as “Feeling thermometer: Obama”, the y-axis to be labelled as “Feeling thermometer: LGBTQIA+”, and add a caption to indicate that the data source is the “NES dataset.”

Figure 11.1: Feeling thermometer of Obama and the LGBTQIA+ community

11.4 Some final notes:

  • The choices and considerations that apply when you are trying to summarize a single variable more than double by doing this for two variables. You should think very carefully about what information you want to convey to your reader. This will help you know what steps you need to take (should I clean a variable before hand? Should I use a scatterplot or a boxplot? etc.)
  • You can find a complete working R file for this by going to this page.