18 😸 Midterm review!

L8r H8rs GIFfrom Bye GIFs

Below are some sample questions that you should be able to answer before you take the midterm! This does not necessarily reflect the questions you may be asked on the midterm.

Why might I want to take the log of a variable?

Answer

To make it appear as though there is less skew in the data. To bring large values closer to smaller values. Makes it appear more like a normal distribution.

What does the central tendency of a univariate descriptive statistic refer to?

Answer

It is one way of describing a value I might expect to get if I randomly grabbed an observation from my data.

What are two calculations I can make to describe the central tendency of a variable?

Answer

Mean
Median

What does the dispersion or the spread of a univariate descriptive statistic refer to?

Answer

It goes beyond saying what the average value is for a variable in my data, but it tells me how spread out they are. I need to know more than what the median house price is; I may want to know whether all houses are around that median or if there are some really cheap houses and some really expensive houses.

What are the two calculations I can make to describe the spread or dispersion of a variable?

Answer

Variance (\(\sigma^2\))
Standard deviation (\(\sigma\))

When you are asked to describe a variable, what are the two things that you should include to describe it?

Answer

Central tendency
Spread/dispersion

I need to do this because one of these things on their own is not sufficient for me to understand what the bulk of observations look like on that observation (central tendency – mean or median) or I won’t understand how spread out observations are from that central tendency (spread/dispersion – variance or standard deviation).

An independent variable refers to what?

Answer

The variable that we think explains, has an effect upon or predicts another variable.

A dependent variable refers to what?

Answer

The variable that we think is the outcome, is explained by, or is dependent on some other variable.

A bivariate regression refers to a regression including two variables or more variables?

Answer

Two variables. Bi – two; variate – variables

What is a confounding variable?

Answer

A variable that effects both the dependent and independent variable. It is not a variable that is effected by either of the two.

What plot is appropriate for describing the bivariate relationship between a categorical variable and a continuous variable?

Answer

A two-way boxplot! The categorical variable goes on the x-axis and the continuous variable would be on the y-axis. Make sure to know which plots are most appropriate for different types of data!

How would I interpret Table 25.1 from a bivariate regression model?

Table 18.1: The effect of family income on feelings toward Hillary Clinton
	(1)
(Intercept)	43.643***
	(1.218)
faminc	−0.032
	(0.035)
Num.Obs.	1178
R2	0.001
R2 Adj.	0.000
AIC	11822.2
BIC	11837.4
Log.Lik.	−5908.086
RMSE	36.47
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Data source: Waffles dataset (McElreath 2020).
Credit: damoncroberts.com
Coefficient estimates from OLS.
Standard errors in parentheses

Answer

For every unit increase in family income, I would expect a -0.032 decrease in favorable attitudes directed toward Hillary Clinton. The probability that the effect of family income on feelings toward Hillary Clinton would be this large or larger if the true effect were actually 0 is 37.336. It seems relatively plausible that the effect of income on feelings toward Hillary Clinton is actually zero.

For Table 25.1, what does the constant represent?

Answer

It reflects the variation in Feelings toward Hillary Clinton (the dependent variable) that is not explained by the independent and control variables I include in my model. It reflects the baseline feeling toward Hillary Clinton for people that have 0 family income. It is the y-intercept.

What does the p-value of a model represent and what does it tell me about statistical significance?

Answer

The p-value states the probability that I’d observe an effect (my \(\beta\) coefficient) that large or larger if the actual effect is zero.

Smaller values means it is more implausible that I would have come up with a \(\beta\) coefficient if the effect of the independent variable on the dependent variable were actually zero.

This means that the smaller the p-value, the better for statistical significance! Usually the standard is: if your p-value is less than 0.05, then you have a statistically significant result on your hands.

What is a residual?

Answer

It is the difference between the observed value (what I have in my data) and the predicted value I get from my regression model (or line of best fit if I plot it). It reflects how well my particular regression fits to my data. The larger the residuals, the worse my model is doing in predicting my observed values.

Knowing this about a residual, what does my standard error tell me?

Answer

The standard error is an estimate of how uncertain we are about our model. It tells us that, when I am wrong (when my residual is not equal to zero), just how “off” am I? When I am wrong, is my residual huge or small? The smaller the standard error, the better. It would mean that, if I am off, my residuals aren’t all that large on average.

Say I give you Table 25.2 to interpret, how would you go about doing that?

Table 18.2: The effect of family income on feelings toward Hillary Clinton, conditional on gender.
	(1)
(Intercept)	40.746***
	(1.749)
faminc	−0.069
	(0.053)
genderFemale	5.702*
	(2.429)
faminc × genderFemale	0.063
	(0.071)
Num.Obs.	1178
R2	0.010
R2 Adj.	0.007
AIC	11815.2
BIC	11840.6
Log.Lik.	−5902.616
RMSE	36.30
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Data source: Waffles dataset (McElreath 2020).
Credit: damoncroberts.com
Coefficient estimates from OLS.
Standard errors in parentheses

Answer

When I am looking at male respondents (when family income equals zero), for every unit increase in family income, there is a -0.069 unit decrease in feelings toward Hillary Clinton. This effect does not appear statistically insignificant. When Looking at Female individuals with zero income, they tend to report 5.702 points higher on their feelings toward Hillary Clinton relative to Males with zero income. This does not appear to be statistically significant. We see that for every unit increase in family income, Women tend to report 0.063 points higher on their feelings toward Hillary Clinton relative to males. This effect also does not appear to be statistically significant.