In workshop #5 you’ll practice constructing and understanding regression models.
Create simple and multiple linear regression models with lm()
.
Explore the outputs of your models, including coefficients, confidence intervals, and p-values.
Use your model to make predictions for new data.
The baseball.csv file contains Major League Baseball Data from the 1986 and 1987 seasons. The data originally come from the ISLR
package.
Using the following command in R, load the baseball.excel data into R and store it as a new object called hitters.
library(readxl) # import library
## Warning: package 'readxl' was built under R version 4.0.5
baseball <- read_excel("Baseball.xlsx") # import the baseball.xlsx file
First, you want to get to know the data set and see whether importing the dataset was successful and worked out correctly. To do this, first take a look at the first 6 rows of the dataset(s) by printing it/them to the console.
Which command allows you to do this? The output should look like the table below:
# add your command here
Next, you take a look at the structure of the data in order to see the number of observations the dataset baseball
has, as well as the number and type of variables.
Which code can you use? The output should look like…or give you exactly this information!
Now, you see that this data contains 50 observations and 20 variables. Also, we can clearly see that most of them are numerical and few of them are characters
(characters = text/string). Furthermore, none of the numerical variables are categorical ones, while all of the character variables are categorical. However, since the categorical variables are considered characters, they can’t be effectively used in further analysis. Therefore, we need to convert them into factors.
Note, due to the fact that we import the file from excel into R, R does not directly recognize categorical variables as factor and therefore we need to convert them ourselves.
Does the code class tell you that all the 3 categorical variables have been changed into a factor? If yes, then perfect! Another way to check it is having a look again the structure of our baseball dataset.
str(baseball)
## tibble [50 x 20] (S3: tbl_df/tbl/data.frame)
## $ AtBat : num [1:50] 215 458 586 568 399 579 496 510 551 313 ...
## $ Hits : num [1:50] 51 114 159 158 102 174 119 126 171 78 ...
## $ HmRun : num [1:50] 4 13 12 20 3 7 8 2 13 6 ...
## $ Runs : num [1:50] 19 67 72 89 56 67 57 42 94 32 ...
## $ RBI : num [1:50] 18 57 79 75 34 78 33 44 83 41 ...
## $ Walks : num [1:50] 11 48 53 73 34 58 21 35 94 12 ...
## $ Years : num [1:50] 1 4 9 15 5 6 7 11 13 12 ...
## $ CAtBat : num [1:50] 215 1350 3082 8068 670 ...
## $ CHits : num [1:50] 51 298 880 2273 167 ...
## $ CHmRun : num [1:50] 4 28 83 177 4 32 36 44 128 35 ...
## $ CRuns : num [1:50] 19 160 363 1045 89 ...
## $ CRBI : num [1:50] 18 123 477 993 48 337 280 519 900 321 ...
## $ CWalks : num [1:50] 11 122 295 732 54 218 165 256 917 170 ...
## $ League : Factor w/ 2 levels "A","N": 1 1 2 2 1 2 2 2 2 2 ...
## $ Division : Factor w/ 2 levels "E","W": 1 2 1 2 2 1 2 2 1 2 ...
## $ PutOuts : num [1:50] 116 246 181 105 211 ...
## $ Assists : num [1:50] 5 389 13 290 9 479 371 358 149 206 ...
## $ Errors : num [1:50] 12 18 4 10 3 5 29 20 5 7 ...
## $ Salary : num [1:50] 70 475 1043 775 80 ...
## $ NewLeague: chr [1:50] "A" "A" "N" "N" ...
Great. It seems like we are ready to run some basic summary statistics of our dataset now, in order to get a first impression of the data distribution.
Which code can you use? The output should look like…or give you exactly this information!
## AtBat Hits HmRun Runs
## Min. : 19.0 Min. : 1.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:288.0 1st Qu.: 77.0 1st Qu.: 5.25 1st Qu.: 35.75
## Median :418.5 Median :105.0 Median :10.00 Median : 52.50
## Mean :407.6 Mean :108.3 Mean :11.04 Mean : 54.14
## 3rd Qu.:541.8 3rd Qu.:147.2 3rd Qu.:15.50 3rd Qu.: 67.75
## Max. :642.0 Max. :211.0 Max. :31.00 Max. :107.00
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.00 Min. : 19.0
## 1st Qu.: 33.00 1st Qu.:26.25 1st Qu.: 3.25 1st Qu.: 806.2
## Median : 46.00 Median :34.50 Median : 6.00 Median :1993.0
## Mean : 50.60 Mean :40.66 Mean : 7.10 Mean :2767.2
## 3rd Qu.: 69.75 3rd Qu.:56.75 3rd Qu.:11.00 3rd Qu.:4730.2
## Max. :116.00 Max. :94.00 Max. :16.00 Max. :8068.0
## CHits CHmRun CRuns CRBI
## Min. : 4 Min. : 1.00 Min. : 2.0 Min. : 3.0
## 1st Qu.: 202 1st Qu.: 18.75 1st Qu.: 100.5 1st Qu.: 75.0
## Median : 542 Median : 38.00 Median : 277.0 Median : 264.0
## Mean : 762 Mean : 63.58 Mean : 368.6 Mean : 335.6
## 3rd Qu.:1282 3rd Qu.: 83.00 3rd Qu.: 581.8 3rd Qu.: 475.8
## Max. :2273 Max. :305.00 Max. :1135.0 Max. :1234.0
## CWalks League Division PutOuts Assists
## Min. : 1.0 A:25 E:24 Min. : 0.0 Min. : 0.0
## 1st Qu.: 65.0 N:25 W:26 1st Qu.: 116.0 1st Qu.: 7.0
## Median :163.0 Median : 225.5 Median : 46.5
## Mean :253.3 Mean : 286.5 Mean :130.9
## 3rd Qu.:380.5 3rd Qu.: 312.0 3rd Qu.:216.5
## Max. :917.0 Max. :1236.0 Max. :487.0
## Errors Salary NewLeague
## Min. : 0.00 Min. : 70.0 Length:50
## 1st Qu.: 5.00 1st Qu.: 167.5 Class :character
## Median : 7.50 Median : 542.5 Mode :character
## Mean : 9.54 Mean : 608.8
## 3rd Qu.:12.75 3rd Qu.: 768.8
## Max. :29.00 Max. :2127.3
At this point, we are very interested on exploring the Baseball dataset. We are wondering if there is a relationship between the number of Hits
a player had and his Salary.
Lets create a scatterplot for this.
library(ggplot2) # import ggplot2 library for visualisation
## Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(baseball, aes(x = Hits, y = Salary)) +
geom_point(col = "pink") +
labs(title = "Is there any relationship between baseball player Hits vs Salary?",
subtitle = "Plotting your data is the first step to figuring out",
caption = "R course Venlo Course")
Note again, that library() only works if you have already installed the related package previously, in this case install.packages(“ggplot2”)
.
Interpretation
The y-axis is the amount of salary (the dependent variable is always on the y-axis) and the x-axis is the total Hits. Each pink dot represents one player the baseball dataset. Glancing at this data, you probably notice that salary are higher for players that hit a lot. That’s interesting to know, but:
We can do this by drawing a line through the chart above, one that runs roughly through the middle of all the data points. We do this as follows:
ggplot(baseball, aes(x = Hits, y = Salary)) +
geom_point(col = "pink") +
geom_smooth(method='lm', # Add linear regression line
se = FALSE) + # # Don't add shaded confidence region
labs(title = "Is there any relationship between baseball player Hits vs Salary?",
subtitle = "Plotting your data with the best fitted line",
caption = "R course Venlo Course")
## `geom_smooth()` using formula 'y ~ x'
The resulting blue line (see plot below) is called the regression line and it’s the line that best fits the data. In other words, the blue line is the best explanation of the relationship between the independent variable (Hits)
and dependent variable (Salary)
.
In addition to drawing the line, in statistics, once you have calculated the slope and y-intercept to form the best-fitting regression line in a scatterplot, you can then interpret their values:
\(Y\)= \(\beta_{0}\) +\(\beta_{1}\)*\(x_{1}\) + \(\varepsilon\)
\(Salary\) = \(\beta_{0}\) +\(\beta_{1}\)*\(x_{Hits}\) + \(\varepsilon\)
In this case, \(\beta_{0}\) is the intercept, the value from where you start measuring (or in the figure the value where the regression line starts when X (in this case the Hits
= 0). \(\beta_{1}\) is the slope of the best-fitting line. The slope measures the change of Salary
with respect to the Hits
as predicted by the line.
Stated differently, the line illustrates the salary increase (\(\beta_{1}\)) for every one more hits a player did achieve it.
At this point, I am kind of curious about the relationship: is there any possibility to suggest that we can predict the amount of money player earn and number of hits?
We are going to create a model that uses the number of hits
a player does as an indicator of increasing in player salary Salary
. I believe this model makes sense since we firstly show a linear relationship between both variables and I would assume that the better a player does, the higher increase in salary. Let’s go ahead and try to create our first simple linear regression
What is a linear regression?
A linear regression is a statistical model that analyzes the relationship between a dependent variable (Y) and one or more other variables (often explanatory variables or independent \(x_{1} + x_{2} + \cdots + x_{n}\)). You make this kind of relationships in your head all the time, for example when you calculate the age of a child based on her height, you are assuming the older she is, the taller she will be. Or just the example taken above as a way to hypothesize an expected relationship between Salary
and Hits.
As mentioned before, we attempt to create a linear model to see this relationship. This model will predict Salary
, using, in our case, the explanatory variable hits
.
Using the guide below, we are going to conduct a regression analysis predicting a player’s Salary
as a function of his Hits
value. We are going to save the result to an object called baseball_simple
:
baseball_simple <- lm(formula = Salary ~ Hits, # linear regression formula (Dependent ~ Independent)
data = baseball) # the name of my dataset
With the command summary(baseball_simple)
you can see detailed information on the model’s performance and coefficients.
summary(baseball_simple)
##
## Call:
## lm(formula = Salary ~ Hits, data = baseball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -546.9 -337.7 -127.9 183.6 1861.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 262.886 176.440 1.490 0.1428
## Hits 3.195 1.499 2.131 0.0382 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 489.6 on 48 degrees of freedom
## Multiple R-squared: 0.08646, Adjusted R-squared: 0.06743
## F-statistic: 4.543 on 1 and 48 DF, p-value: 0.0382
Coefficients
We can see the Coefficients of the baseball_simple
as following:
baseball_simple$coefficients
## (Intercept) Hits
## 262.885665 3.194528
Thus, we have a our mathematical equation as salary
variables as our dependent variable and our explanatory variable Hits
.
Salary=B0+B1*Hits
\(Salary\) = \(\beta_{0}\) +\(\beta_{1}\)*\(x_{Hits}\) + \(\varepsilon\)
\(Salary\) = \(262.886\) +\(3.19\)*\(x_{Hits}\)
Intercept: This values will be always in the equation, unless you explictly erase it from the model
Hits: One variable included into our linear model. In this particular case we just have added one.
1. Create a scatterplot showing the relationship (including a line) between HmRun and Salary using the code just learned. The output should look like this:
## `geom_smooth()` using formula 'y ~ x'
2. Conduct a regression analysis predicting a player’s Salary
as a function of its HmRun
value. Save the result to an object called baseball_simple1 and ask R to give you the output (which should look like below). Interpret the output.
3. Use the summary()
function to print additional summary information from your baseball_simple1
object.
4. Interpret the quality of the model based on your baseball_simple1
model result please.
5. Using the code below, plot the relationship between the model fitted values and the observed values.
A nice tutorial for a simple regression is here
Multiple linear regression is an extension of simple linear regression. It is used to predict a dependent variable (y) on the basis of multiple explanatory/independent variables (x).
When including four predictor variables (x), the prediction of y is expressed by the following equation:
\(Salary\)= \(\beta_{0}\) +\(\beta_{1}\)\(x_{1}\) +\(\beta_{2}\)\(x_{2}\) +\(\beta_{3}\)\(x_{3}\) +\(\beta_{4}\)\(x_{4}\) + \(\varepsilon\)
The “\(\beta\)” values are called the regression weights (or \(\beta\) coefficients). They measure the association between the explanatory variables and the dependent variable.
Independent variables from the baseball file:
Hits
- Number of hits in yearCWalks
- Number of walks during his careerAssists
- Number of assists in yearErrors
- Number of errors in yearTo do this, use the formula = ‘y ~ x1 + x2 + x3 + X4’
notation. Assign the result to an object called baseball_multiple.
# add your command here
Again, create a summary() baseball_multiple
in order to check its summary result!
This work is licensed under the CC BY-NC 4.0 Creative Commons License.