18  ๐Ÿ˜ธ Midterm review!

Below are some sample questions that you should be able to answer before you take the midterm! This does not necessarily reflect the questions you may be asked on the midterm.

To make it appear as though there is less skew in the data. To bring large values closer to smaller values. Makes it appear more like a normal distribution.

It is one way of describing a value I might expect to get if I randomly grabbed an observation from my data.

  • Mean
  • Median

It goes beyond saying what the average value is for a variable in my data, but it tells me how spread out they are. I need to know more than what the median house price is; I may want to know whether all houses are around that median or if there are some really cheap houses and some really expensive houses.

  • Variance (\(\sigma^2\))
  • Standard deviation (\(\sigma\))
  • Central tendency
  • Spread/dispersion

I need to do this because one of these things on their own is not sufficient for me to understand what the bulk of observations look like on that observation (central tendency โ€“ mean or median) or I wonโ€™t understand how spread out observations are from that central tendency (spread/dispersion โ€“ variance or standard deviation).

The variable that we think explains, has an effect upon or predicts another variable.

The variable that we think is the outcome, is explained by, or is dependent on some other variable.

Two variables. Bi โ€“ two; variate โ€“ variables

A variable that effects both the dependent and independent variable. It is not a variable that is effected by either of the two.

A two-way boxplot! The categorical variable goes on the x-axis and the continuous variable would be on the y-axis. Make sure to know which plots are most appropriate for different types of data!

Table 18.1: The effect of family income on feelings toward Hillary Clinton
 (1)
(Intercept) 43.643***
(1.218)
faminc โˆ’0.032
(0.035)
Num.Obs. 1178
R2 0.001
R2 Adj. 0.000
AIC 11822.2
BIC 11837.4
Log.Lik. โˆ’5908.086
RMSE 36.47
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Data source: Waffles dataset (McElreath 2020).
Credit: damoncroberts.com
Coefficient estimates from OLS.
Standard errors in parentheses

For every unit increase in family income, I would expect a -0.032 decrease in favorable attitudes directed toward Hillary Clinton. The probability that the effect of family income on feelings toward Hillary Clinton would be this large or larger if the true effect were actually 0 is 37.336. It seems relatively plausible that the effect of income on feelings toward Hillary Clinton is actually zero.

It reflects the variation in Feelings toward Hillary Clinton (the dependent variable) that is not explained by the independent and control variables I include in my model. It reflects the baseline feeling toward Hillary Clinton for people that have 0 family income. It is the y-intercept.

The p-value states the probability that Iโ€™d observe an effect (my \(\beta\) coefficient) that large or larger if the actual effect is zero.

Smaller values means it is more implausible that I would have come up with a \(\beta\) coefficient if the effect of the independent variable on the dependent variable were actually zero.

This means that the smaller the p-value, the better for statistical significance! Usually the standard is: if your p-value is less than 0.05, then you have a statistically significant result on your hands.

It is the difference between the observed value (what I have in my data) and the predicted value I get from my regression model (or line of best fit if I plot it). It reflects how well my particular regression fits to my data. The larger the residuals, the worse my model is doing in predicting my observed values.

The standard error is an estimate of how uncertain we are about our model. It tells us that, when I am wrong (when my residual is not equal to zero), just how โ€œoffโ€ am I? When I am wrong, is my residual huge or small? The smaller the standard error, the better. It would mean that, if I am off, my residuals arenโ€™t all that large on average.

Table 18.2: The effect of family income on feelings toward Hillary Clinton, conditional on gender.
 (1)
(Intercept) 40.746***
(1.749)
faminc โˆ’0.069
(0.053)
genderFemale 5.702*
(2.429)
faminc ร— genderFemale 0.063
(0.071)
Num.Obs. 1178
R2 0.010
R2 Adj. 0.007
AIC 11815.2
BIC 11840.6
Log.Lik. โˆ’5902.616
RMSE 36.30
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Data source: Waffles dataset (McElreath 2020).
Credit: damoncroberts.com
Coefficient estimates from OLS.
Standard errors in parentheses

When I am looking at male respondents (when family income equals zero), for every unit increase in family income, there is a -0.069 unit decrease in feelings toward Hillary Clinton. This effect does not appear statistically insignificant. When Looking at Female individuals with zero income, they tend to report 5.702 points higher on their feelings toward Hillary Clinton relative to Males with zero income. This does not appear to be statistically significant. We see that for every unit increase in family income, Women tend to report 0.063 points higher on their feelings toward Hillary Clinton relative to males. This effect also does not appear to be statistically significant.