WELCOME

In workshop #5 you’ll practice constructing and understanding regression models.


IMPORT DATASET

The baseball.csv file contains Major League Baseball Data from the 1986 and 1987 seasons. The data originally come from the ISLR package.

Using the following command in R, load the baseball.excel data into R and store it as a new object called hitters.

library(readxl) # import library
## Warning: package 'readxl' was built under R version 4.0.5
baseball <- read_excel("Baseball.xlsx") # import the baseball.xlsx file

DATA PREPARATION

First, you want to get to know the data set and see whether importing the dataset was successful and worked out correctly. To do this, first take a look at the first 6 rows of the dataset(s) by printing it/them to the console.

Which command allows you to do this? The output should look like the table below:

# add your command here

Next, you take a look at the structure of the data in order to see the number of observations the dataset baseball has, as well as the number and type of variables.

Which code can you use? The output should look like…or give you exactly this information!

Now, you see that this data contains 50 observations and 20 variables. Also, we can clearly see that most of them are numerical and few of them are characters (characters = text/string). Furthermore, none of the numerical variables are categorical ones, while all of the character variables are categorical. However, since the categorical variables are considered characters, they can’t be effectively used in further analysis. Therefore, we need to convert them into factors.

Note, due to the fact that we import the file from excel into R, R does not directly recognize categorical variables as factor and therefore we need to convert them ourselves.

Exercise 1: Can you please convert all three categorical variables into a factor? Remember from workshop 2 the code?

Does the code class tell you that all the 3 categorical variables have been changed into a factor? If yes, then perfect! Another way to check it is having a look again the structure of our baseball dataset.

str(baseball)
## tibble [50 x 20] (S3: tbl_df/tbl/data.frame)
##  $ AtBat    : num [1:50] 215 458 586 568 399 579 496 510 551 313 ...
##  $ Hits     : num [1:50] 51 114 159 158 102 174 119 126 171 78 ...
##  $ HmRun    : num [1:50] 4 13 12 20 3 7 8 2 13 6 ...
##  $ Runs     : num [1:50] 19 67 72 89 56 67 57 42 94 32 ...
##  $ RBI      : num [1:50] 18 57 79 75 34 78 33 44 83 41 ...
##  $ Walks    : num [1:50] 11 48 53 73 34 58 21 35 94 12 ...
##  $ Years    : num [1:50] 1 4 9 15 5 6 7 11 13 12 ...
##  $ CAtBat   : num [1:50] 215 1350 3082 8068 670 ...
##  $ CHits    : num [1:50] 51 298 880 2273 167 ...
##  $ CHmRun   : num [1:50] 4 28 83 177 4 32 36 44 128 35 ...
##  $ CRuns    : num [1:50] 19 160 363 1045 89 ...
##  $ CRBI     : num [1:50] 18 123 477 993 48 337 280 519 900 321 ...
##  $ CWalks   : num [1:50] 11 122 295 732 54 218 165 256 917 170 ...
##  $ League   : Factor w/ 2 levels "A","N": 1 1 2 2 1 2 2 2 2 2 ...
##  $ Division : Factor w/ 2 levels "E","W": 1 2 1 2 2 1 2 2 1 2 ...
##  $ PutOuts  : num [1:50] 116 246 181 105 211 ...
##  $ Assists  : num [1:50] 5 389 13 290 9 479 371 358 149 206 ...
##  $ Errors   : num [1:50] 12 18 4 10 3 5 29 20 5 7 ...
##  $ Salary   : num [1:50] 70 475 1043 775 80 ...
##  $ NewLeague: chr [1:50] "A" "A" "N" "N" ...

DATA EXPLORATION

Great. It seems like we are ready to run some basic summary statistics of our dataset now, in order to get a first impression of the data distribution.

Which code can you use? The output should look like…or give you exactly this information!

##      AtBat            Hits           HmRun            Runs       
##  Min.   : 19.0   Min.   :  1.0   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:288.0   1st Qu.: 77.0   1st Qu.: 5.25   1st Qu.: 35.75  
##  Median :418.5   Median :105.0   Median :10.00   Median : 52.50  
##  Mean   :407.6   Mean   :108.3   Mean   :11.04   Mean   : 54.14  
##  3rd Qu.:541.8   3rd Qu.:147.2   3rd Qu.:15.50   3rd Qu.: 67.75  
##  Max.   :642.0   Max.   :211.0   Max.   :31.00   Max.   :107.00  
##       RBI             Walks           Years           CAtBat      
##  Min.   :  0.00   Min.   : 0.00   Min.   : 1.00   Min.   :  19.0  
##  1st Qu.: 33.00   1st Qu.:26.25   1st Qu.: 3.25   1st Qu.: 806.2  
##  Median : 46.00   Median :34.50   Median : 6.00   Median :1993.0  
##  Mean   : 50.60   Mean   :40.66   Mean   : 7.10   Mean   :2767.2  
##  3rd Qu.: 69.75   3rd Qu.:56.75   3rd Qu.:11.00   3rd Qu.:4730.2  
##  Max.   :116.00   Max.   :94.00   Max.   :16.00   Max.   :8068.0  
##      CHits          CHmRun           CRuns             CRBI       
##  Min.   :   4   Min.   :  1.00   Min.   :   2.0   Min.   :   3.0  
##  1st Qu.: 202   1st Qu.: 18.75   1st Qu.: 100.5   1st Qu.:  75.0  
##  Median : 542   Median : 38.00   Median : 277.0   Median : 264.0  
##  Mean   : 762   Mean   : 63.58   Mean   : 368.6   Mean   : 335.6  
##  3rd Qu.:1282   3rd Qu.: 83.00   3rd Qu.: 581.8   3rd Qu.: 475.8  
##  Max.   :2273   Max.   :305.00   Max.   :1135.0   Max.   :1234.0  
##      CWalks      League Division    PutOuts          Assists     
##  Min.   :  1.0   A:25   E:24     Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 65.0   N:25   W:26     1st Qu.: 116.0   1st Qu.:  7.0  
##  Median :163.0                   Median : 225.5   Median : 46.5  
##  Mean   :253.3                   Mean   : 286.5   Mean   :130.9  
##  3rd Qu.:380.5                   3rd Qu.: 312.0   3rd Qu.:216.5  
##  Max.   :917.0                   Max.   :1236.0   Max.   :487.0  
##      Errors          Salary        NewLeague        
##  Min.   : 0.00   Min.   :  70.0   Length:50         
##  1st Qu.: 5.00   1st Qu.: 167.5   Class :character  
##  Median : 7.50   Median : 542.5   Mode  :character  
##  Mean   : 9.54   Mean   : 608.8                     
##  3rd Qu.:12.75   3rd Qu.: 768.8                     
##  Max.   :29.00   Max.   :2127.3

At this point, we are very interested on exploring the Baseball dataset. We are wondering if there is a relationship between the number of Hits a player had and his Salary. Lets create a scatterplot for this.

library(ggplot2) # import ggplot2 library for visualisation
## Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(baseball, aes(x = Hits, y = Salary)) +
  geom_point(col = "pink") +
  labs(title = "Is there any relationship between baseball player Hits vs Salary?",
       subtitle = "Plotting your data is the first step to figuring out",
       caption = "R course Venlo Course")

Note again, that library() only works if you have already installed the related package previously, in this case install.packages(“ggplot2”).

Interpretation

The y-axis is the amount of salary (the dependent variable is always on the y-axis) and the x-axis is the total Hits. Each pink dot represents one player the baseball dataset. Glancing at this data, you probably notice that salary are higher for players that hit a lot. That’s interesting to know, but:

by how much is the salary higher? How can we answer with some degree of certainty how much a player typically earns when he hits a certain amount of balls?

We can do this by drawing a line through the chart above, one that runs roughly through the middle of all the data points. We do this as follows:

ggplot(baseball, aes(x = Hits, y = Salary)) +
  geom_point(col = "pink") +
  geom_smooth(method='lm',  # Add linear regression line 
              se = FALSE) + # # Don't add shaded confidence region
  labs(title = "Is there any relationship between baseball player Hits vs Salary?",
       subtitle = "Plotting your data with the best fitted line",
       caption = "R course Venlo Course")
## `geom_smooth()` using formula 'y ~ x'

The resulting blue line (see plot below) is called the regression line and it’s the line that best fits the data. In other words, the blue line is the best explanation of the relationship between the independent variable (Hits) and dependent variable (Salary).

Exercise 2: Based on this plot, do you expect there to be a relationship between these variables? Which kind of test would you run in order to quantitatively support your answer?


In addition to drawing the line, in statistics, once you have calculated the slope and y-intercept to form the best-fitting regression line in a scatterplot, you can then interpret their values:

\(Y\)= \(\beta_{0}\) +\(\beta_{1}\)*\(x_{1}\) + \(\varepsilon\)

\(Salary\) = \(\beta_{0}\) +\(\beta_{1}\)*\(x_{Hits}\) + \(\varepsilon\)

In this case, \(\beta_{0}\) is the intercept, the value from where you start measuring (or in the figure the value where the regression line starts when X (in this case the Hits = 0). \(\beta_{1}\) is the slope of the best-fitting line. The slope measures the change of Salary with respect to the Hits as predicted by the line.

Stated differently, the line illustrates the salary increase (\(\beta_{1}\)) for every one more hits a player did achieve it.

At this point, I am kind of curious about the relationship: is there any possibility to suggest that we can predict the amount of money player earn and number of hits?

We are going to create a model that uses the number of hits a player does as an indicator of increasing in player salary Salary. I believe this model makes sense since we firstly show a linear relationship between both variables and I would assume that the better a player does, the higher increase in salary. Let’s go ahead and try to create our first simple linear regression


SIMPLE LINEAR REGRESSION

What is a linear regression?

A linear regression is a statistical model that analyzes the relationship between a dependent variable (Y) and one or more other variables (often explanatory variables or independent \(x_{1} + x_{2} + \cdots + x_{n}\)). You make this kind of relationships in your head all the time, for example when you calculate the age of a child based on her height, you are assuming the older she is, the taller she will be. Or just the example taken above as a way to hypothesize an expected relationship between Salary and Hits.

As mentioned before, we attempt to create a linear model to see this relationship. This model will predict Salary, using, in our case, the explanatory variable hits.

Using the guide below, we are going to conduct a regression analysis predicting a player’s Salary as a function of his Hits value. We are going to save the result to an object called baseball_simple:

baseball_simple <- lm(formula = Salary ~ Hits, # linear regression formula (Dependent ~ Independent)
                data = baseball) # the name of my dataset

With the command summary(baseball_simple) you can see detailed information on the model’s performance and coefficients.

summary(baseball_simple)
## 
## Call:
## lm(formula = Salary ~ Hits, data = baseball)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -546.9 -337.7 -127.9  183.6 1861.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  262.886    176.440   1.490   0.1428  
## Hits           3.195      1.499   2.131   0.0382 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 489.6 on 48 degrees of freedom
## Multiple R-squared:  0.08646,    Adjusted R-squared:  0.06743 
## F-statistic: 4.543 on 1 and 48 DF,  p-value: 0.0382

Interpretation focus on the coefficients

Coefficients

We can see the Coefficients of the baseball_simple as following:

baseball_simple$coefficients
## (Intercept)        Hits 
##  262.885665    3.194528

Thus, we have a our mathematical equation as salary variables as our dependent variable and our explanatory variable Hits.

Salary=B0+B1*Hits

\(Salary\) = \(\beta_{0}\) +\(\beta_{1}\)*\(x_{Hits}\) + \(\varepsilon\)

\(Salary\) = \(262.886\) +\(3.19\)*\(x_{Hits}\)

Intercept: This values will be always in the equation, unless you explictly erase it from the model

Hits: One variable included into our linear model. In this particular case we just have added one.

Exercise 3: You are asked to answer the question: “Is there a relationship between the number of home runs a player scores and salary” using simple linear regression.

1. Create a scatterplot showing the relationship (including a line) between HmRun and Salary using the code just learned. The output should look like this:

## `geom_smooth()` using formula 'y ~ x'

2. Conduct a regression analysis predicting a player’s Salary as a function of its HmRun value. Save the result to an object called baseball_simple1 and ask R to give you the output (which should look like below). Interpret the output.

3. Use the summary() function to print additional summary information from your baseball_simple1 object.

4. Interpret the quality of the model based on your baseball_simple1 model result please.

5. Using the code below, plot the relationship between the model fitted values and the observed values.

A nice tutorial for a simple regression is here


MULTIPLE REGRESSION

Multiple linear regression is an extension of simple linear regression. It is used to predict a dependent variable (y) on the basis of multiple explanatory/independent variables (x).

When including four predictor variables (x), the prediction of y is expressed by the following equation:

\(Salary\)= \(\beta_{0}\) +\(\beta_{1}\)\(x_{1}\) +\(\beta_{2}\)\(x_{2}\) +\(\beta_{3}\)\(x_{3}\) +\(\beta_{4}\)\(x_{4}\) + \(\varepsilon\)

The “\(\beta\)” values are called the regression weights (or \(\beta\) coefficients). They measure the association between the explanatory variables and the dependent variable.

Exercise 4: Conduct a regression analysis predicting a player’s Salary as a function of the following independent variables.

Independent variables from the baseball file:

  1. Hits - Number of hits in year
  2. CWalks - Number of walks during his career
  3. Assists - Number of assists in year
  4. Errors - Number of errors in year

To do this, use the formula = ‘y ~ x1 + x2 + x3 + X4’ notation. Assign the result to an object called baseball_multiple.

# add your command here

Again, create a summary() baseball_multiple in order to check its summary result!


This work is licensed under the CC BY-NC 4.0 Creative Commons License.