In workshop #5 you’ll practice constructing and understanding logistic regression models.
A Data Science Researcher in the University of Maastricht is interested in investigate how variables, such as statistics exam score, responsible exam score, prestige of the undergraduate institution and gender, effect admission rate into the new graduate program launched at the Institute of Data Science called Responsable Data Science. We are going to work with a response binary variable (y), admit/don’t admit.
We have generated hypothetical data, which can be obtained from this link
This dataset has a binary response (outcome, dependent) variable called admit
. There are four explanatory variables: stats
, rank
, responsible
female
. We will treat the variables stats
and responsable
as continuous. The variable rank
takes on the values 1 through 4. Institutions with a rank
of 1 have have highest prestige, while those with a rank of 4 have the lowest. The female
variable is 0 when male and 1 in case is female.
Import the data:
library(readxl)
## Warning: package 'readxl' was built under R version 4.0.5
ResponsableDataScience <- read_excel("InstituteofDataScience.xlsx")
Check the ResponsableDataScience
variables:
str(ResponsableDataScience)
## tibble [400 x 5] (S3: tbl_df/tbl/data.frame)
## $ admit : num [1:400] 0 1 1 1 0 1 1 0 1 0 ...
## $ responsible: num [1:400] 380 660 800 640 520 760 560 400 540 700 ...
## $ stats : num [1:400] 3.61 4.67 5 4.19 2.93 4 3.98 3.08 4.39 3.92 ...
## $ rank : num [1:400] 3 3 1 4 4 2 1 2 3 2 ...
## $ female : num [1:400] 0 1 1 1 0 1 1 0 1 0 ...
We can get basic descriptive for the entire data set by using summary()
:
summary(ResponsableDataScience)
## admit responsible stats rank
## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.217 1st Qu.:2.000
## Median :0.0000 Median :580.0 Median :3.585 Median :2.000
## Mean :0.3175 Mean :587.7 Mean :3.707 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:4.173 3rd Qu.:3.000
## Max. :1.0000 Max. :800.0 Max. :5.000 Max. :4.000
## female
## Min. :0.00
## 1st Qu.:0.00
## Median :0.00
## Mean :0.33
## 3rd Qu.:1.00
## Max. :1.00
Let’s first have a look on the relationship between and stats
variable and whether or not the student was admitted in the graduate program:
## Plotting the data
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(ResponsableDataScience, aes(stats, admit)) +
geom_point()
So each dots represent different students and how its grade in stats is connected with whether or not they were admitted into the Graduate Program. Value 1 are students that were admitted and value 0 not admitted.
In order to fully understand the differences between linear and binomial regression, we can model this relationship using a linear regression model for example. Let’s try it and add it on the plot:
ggplot(ResponsableDataScience, aes(stats, admit)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) + # linear regression and not plot the standard error
coord_cartesian(ylim = c(0,1)) # limit the plot
## `geom_smooth()` using formula 'y ~ x'
Super, quite an interesting plot! The linear model is not really a good fit right? Indeed, it is violating some of the assumptions for linear model such as linearity (i.e. this plots shows non linear relationship between the variables) and homoscedasticity (we could expect some that our residuals varies across values of our independent variable). You are always required to fully check these assumptions out before conducting any statistical analysis. HOWEVER, for the sake of simplicity, today we are not going to deal with such exercise.
Want to know how to deal with that? To get more information about it please this short tutorial.
Thus, ignoring the fact that assumptions have been violated, lets convert the blue line into a binomial graph/curve in the same plot but now, with the method glm(Generalized Linear model as a family models that are not linear, such logistic model):
ggplot(ResponsableDataScience, aes(stats, admit)) +
geom_point() +
geom_smooth(method = 'glm', se = FALSE, method.args = list(family = 'binomial')) + # and not plot the standard error
coord_cartesian(ylim = c(0,1)) # limit the plot
## `geom_smooth()` using formula 'y ~ x'
Note: if you get this `geom_smooth()
using formula ‘y ~ x’`, don’t worry much, your plot should be correctly displayed in your R studio.
As you might notice, this blue line appears to fit our data much better. Instead of linear regression(e.g. straight line) we just incorporated a binomial curve
. So this is essentially what we do in this modelling; this model looks a little bit better when predicting admit rate based on stats scores. Indeed, the binomial model will predict those values closer to 0 as not admitted, whereas it will predict students as admitted if the value is around 1.
Let’s build a simple logistic regression formally, and we can see how likely a student is admitted in the graduate program with a one unit change in stats score.
Let’s create a logit model to save our results as follow. The type of code is pretty similar to the linear regression one (though not identical ;)):
# MODEL ADMIT BY STATS
logitmodel <- glm(formula = admit ~ stats, # dependent variable ~ independent variable
data = ResponsableDataScience,
family = 'binomial')
Go ahead and look at the models output:
# call the result object
summary(logitmodel)
##
## Call:
## glm(formula = admit ~ stats, family = "binomial", data = ResponsableDataScience)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1613 -0.1277 -0.0270 0.0424 3.1920
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -34.869 4.771 -7.308 2.71e-13 ***
## stats 8.708 1.205 7.227 4.93e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 117.27 on 398 degrees of freedom
## AIC: 121.27
##
## Number of Fisher Scoring iterations: 8
Model Output Interpretation
Call, this is R reminding us what the model we ran was, what options we specified.
Deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model.
Coefficients with their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both intercept
and stats
are statistically significant. We can get the coefficient models from the model output as following:
coef(logitmodel)
## (Intercept) stats
## -34.868533 8.707648
So the 8.70764 is the log odds of the variable stats
and means that for a one unit increase in stats
, the log odds of being admitted to graduate school increases by 0.804.
To derive odds ratios from log odds you need to take the exponent of the coefficient (e.g. to do the opposite of log):
exp(coef(logitmodel))
## (Intercept) stats
## 7.190987e-16 6.049001e+03
informally speaking, this numbers here tell is telling us that for one unit increase in stats
grade, a students is 6049 more likely to be admitted in the graduate program. So that is the way we interpret odd ratios. Learn more about log odds and odds ratios here.
responsible
score as the independent variable and admit
as the dependent variable. Interpret the results.# insert here your code
This work is licensed under the CC BY-NC 4.0 Creative Commons License.