WELCOME

In workshop #5 you’ll practice constructing and understanding logistic regression models.


LOGISTIC REGRESSION

Case Study

A Data Science Researcher in the University of Maastricht is interested in investigate how variables, such as statistics exam score, responsible exam score, prestige of the undergraduate institution and gender, effect admission rate into the new graduate program launched at the Institute of Data Science called Responsable Data Science. We are going to work with a response binary variable (y), admit/don’t admit.

Description of the data

We have generated hypothetical data, which can be obtained from this link

This dataset has a binary response (outcome, dependent) variable called admit. There are four explanatory variables: stats, rank, responsible female. We will treat the variables stats and responsable as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have have highest prestige, while those with a rank of 4 have the lowest. The female variable is 0 when male and 1 in case is female.

Import the data:

library(readxl)
## Warning: package 'readxl' was built under R version 4.0.5
ResponsableDataScience <- read_excel("InstituteofDataScience.xlsx")

Check the ResponsableDataSciencevariables:

str(ResponsableDataScience)
## tibble [400 x 5] (S3: tbl_df/tbl/data.frame)
##  $ admit      : num [1:400] 0 1 1 1 0 1 1 0 1 0 ...
##  $ responsible: num [1:400] 380 660 800 640 520 760 560 400 540 700 ...
##  $ stats      : num [1:400] 3.61 4.67 5 4.19 2.93 4 3.98 3.08 4.39 3.92 ...
##  $ rank       : num [1:400] 3 3 1 4 4 2 1 2 3 2 ...
##  $ female     : num [1:400] 0 1 1 1 0 1 1 0 1 0 ...

We can get basic descriptive for the entire data set by using summary():

summary(ResponsableDataScience)
##      admit         responsible        stats            rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.217   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.585   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.707   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:4.173   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :5.000   Max.   :4.000  
##      female    
##  Min.   :0.00  
##  1st Qu.:0.00  
##  Median :0.00  
##  Mean   :0.33  
##  3rd Qu.:1.00  
##  Max.   :1.00

The essence of logit models

Let’s first have a look on the relationship between and stats variable and whether or not the student was admitted in the graduate program:

## Plotting the data
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(ResponsableDataScience, aes(stats, admit)) +
  geom_point()

So each dots represent different students and how its grade in stats is connected with whether or not they were admitted into the Graduate Program. Value 1 are students that were admitted and value 0 not admitted.

In order to fully understand the differences between linear and binomial regression, we can model this relationship using a linear regression model for example. Let’s try it and add it on the plot:

ggplot(ResponsableDataScience, aes(stats, admit)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) + # linear regression and not plot the standard error
  coord_cartesian(ylim = c(0,1)) # limit the plot
## `geom_smooth()` using formula 'y ~ x'

Super, quite an interesting plot! The linear model is not really a good fit right? Indeed, it is violating some of the assumptions for linear model such as linearity (i.e. this plots shows non linear relationship between the variables) and homoscedasticity (we could expect some that our residuals varies across values of our independent variable). You are always required to fully check these assumptions out before conducting any statistical analysis. HOWEVER, for the sake of simplicity, today we are not going to deal with such exercise.

Want to know how to deal with that? To get more information about it please this short tutorial.

Thus, ignoring the fact that assumptions have been violated, lets convert the blue line into a binomial graph/curve in the same plot but now, with the method glm(Generalized Linear model as a family models that are not linear, such logistic model):

ggplot(ResponsableDataScience, aes(stats, admit)) +
  geom_point() +
  geom_smooth(method = 'glm', se = FALSE, method.args = list(family = 'binomial')) + #  and not plot the standard error
  coord_cartesian(ylim = c(0,1)) # limit the plot
## `geom_smooth()` using formula 'y ~ x'

Note: if you get this `geom_smooth() using formula ‘y ~ x’`, don’t worry much, your plot should be correctly displayed in your R studio.

As you might notice, this blue line appears to fit our data much better. Instead of linear regression(e.g. straight line) we just incorporated a binomial curve. So this is essentially what we do in this modelling; this model looks a little bit better when predicting admit rate based on stats scores. Indeed, the binomial model will predict those values closer to 0 as not admitted, whereas it will predict students as admitted if the value is around 1.

Simple Logistic regression

Let’s build a simple logistic regression formally, and we can see how likely a student is admitted in the graduate program with a one unit change in stats score.

Let’s create a logit model to save our results as follow. The type of code is pretty similar to the linear regression one (though not identical ;)):

# MODEL ADMIT BY STATS
logitmodel <- glm(formula = admit ~ stats, # dependent variable ~ independent variable
                  data = ResponsableDataScience,
                  family = 'binomial')

Go ahead and look at the models output:

# call the result object
summary(logitmodel)
## 
## Call:
## glm(formula = admit ~ stats, family = "binomial", data = ResponsableDataScience)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1613  -0.1277  -0.0270   0.0424   3.1920  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -34.869      4.771  -7.308 2.71e-13 ***
## stats          8.708      1.205   7.227 4.93e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 117.27  on 398  degrees of freedom
## AIC: 121.27
## 
## Number of Fisher Scoring iterations: 8

Model Output Interpretation

  • Call, this is R reminding us what the model we ran was, what options we specified.

  • Deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model.

  • Coefficients with their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both intercept and stats are statistically significant. We can get the coefficient models from the model output as following:

coef(logitmodel)
## (Intercept)       stats 
##  -34.868533    8.707648

So the 8.70764 is the log odds of the variable stats and means that for a one unit increase in stats, the log odds of being admitted to graduate school increases by 0.804.

To derive odds ratios from log odds you need to take the exponent of the coefficient (e.g. to do the opposite of log):

exp(coef(logitmodel))
##  (Intercept)        stats 
## 7.190987e-16 6.049001e+03

informally speaking, this numbers here tell is telling us that for one unit increase in stats grade, a students is 6049 more likely to be admitted in the graduate program. So that is the way we interpret odd ratios. Learn more about log odds and odds ratios here.

Exercise 1 – Using the previous explained steps, run a simple logistic regression model using responsible score as the independent variable and admit as the dependent variable. Interpret the results.

# insert here your code

This work is licensed under the CC BY-NC 4.0 Creative Commons License.