WELCOME

Welcome to the workshop number 3: Introduction to inferential stats with R.

Learning outcomes:

By the end of this demo, you should be able to:

Quick started with hyptohesis testing using R Markdown

Artwork by @allison_horst

R Markdown provides an unified authoring framework for data science, combining your code, its results, and your prose commentary. R Markdown documents are fully reproducible and support dozens of output formats, like PDFs, Word files, slideshows, and more.

Using the Student’s T-test

R can handle the various versions of T-test using the t.test() command. This function can be used to deal with one-sample tests as well as two-sample (un-)paired tests.

DEMO 1: (Unparied) Independent samples t-test

Research question:Is there any (statistical) difference in the mean orange juice poured in a glass of 33cl between Maastricht and Amsterdam coffee lovers?

The independent sample t-test compares the mean of one distinct group to the mean of another group.


YOUR TURN

Complete the steps to perform a t-test comparing means. In a markdown, add R script

Step 1: Formulate the hypothesis

  • The null hypothesis:

\(H0:\) ….

  • The alternative hypothesis:

\(Ha:\) ….

Step 2: State level significance

The level of significance of a test, alpha-level of a test or \(\alpha\) level of a test, is probability of making a false-positive error, assuming the null hypothesis is correct. The alpha-level of a test is typically decided on, in advance of an experiment, and sets the actual critical value for accepting the alternative hypothesis that is used, regardless of the sample.

Step 3: Collect data from different cities

# Define orange sample
orange_maastricht <- c(28,31,28,37,30,
            33,25,33,24,30)

orange_amsterdam <-  c(24,31,25,37,30,
            33,23,32,24,30)

print(orange_amsterdam)
##  [1] 24 31 25 37 30 33 23 32 24 30
print(orange_maastricht)
##  [1] 28 31 28 37 30 33 25 33 24 30

Step 4: Conduct t-test for comparing means

# read official documentation for t-test method
?t.test()
## starting httpd help server ... done

These numbers below are the results of t.test:

## 
##  Two Sample t-test
## 
## data:  orange_maastricht and orange_amsterdam
## t = 0.51925, df = 18, p-value = 0.6099
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.046056  5.046056
## sample estimates:
## mean of x mean of y 
##      29.9      28.9

Step 5: Conclusion & Model Output Interpretation


DEMO 2: Gapminder dataset

Research question: Is there any (statistical) difference in the mean Life Expectancy between South Africa and Ireland?

Import Gapminder dataset:

library(gapminder)
data("gapminder")

Filter for specific country:

library(dplyr) # import the necessary packages
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydat <- filter(gapminder, country == 'South Africa' | country == 'Ireland')

Get average for each country:

summarize(filter(mydat, country == 'South Africa'),
          Avg_Life_SouthAfrica = mean(lifeExp, na.rm = TRUE))
## # A tibble: 1 x 1
##   Avg_Life_SouthAfrica
##                  <dbl>
## 1                 54.0
summarize(filter(mydat, country == 'Ireland'),
          Avg_Life_Ireland = mean(lifeExp, na.rm = TRUE))
## # A tibble: 1 x 1
##   Avg_Life_Ireland
##              <dbl>
## 1             73.0

Conduct t-test for comparing means

t.test(data = mydat, lifeExp ~ country)
## 
##  Welch Two Sample t-test
## 
## data:  lifeExp by country
## t = 10.067, df = 19.109, p-value = 4.466e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  15.07022 22.97794
## sample estimates:
##      mean in group Ireland mean in group South Africa 
##                   73.01725                   53.99317

Scatter plot and pearson correlation

Artwork by @allison_horst

In statistics, you deal with a lot of data. The hard part is finding patterns that fit the data. To look for patterns, there are several statistical tools that help identify these patterns. But before you use any of these tools, you should look for basic patterns. As you learned, you can identify basic patterns using a scatter plot and correlation.

Let’s create a simple scatter plot:

plot(gapminder$lifeExp,gapminder$gdpPercap)

You could see from the scatter plot that including this line, add more information about the slope/direction of the relationship.

plot(gapminder$lifeExp,gapminder$gdpPercap)
abline(lm(gdpPercap~lifeExp,gapminder), col ="blue")

What about create a scatterplot with ggplot library?

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(gapminder,aes(x = lifeExp, y = gdpPercap)) + 
  geom_point() + 
  facet_wrap(~continent) + 
  geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'

Another one:

ggplot(
  data = gapminder,
  mapping = aes(x = lifeExp, y = gdpPercap)
) +
  geom_point(alpha = .1) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Relationship between GDP per capita and Life Expectancy",
    subtitle = "Gapminder dataset - Worldwide",
    x = "GDP (euro)",
    y = "Life Expectancy (years)"
  ) +
  theme_minimal()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Because both are both continuous variables, will use Pearson’s product-moment correlation to quantify the strength of the relationship between these two variables. There are a few ways to do this in R, but we will only consider one method here.

Can you please use cor() function to calculate the correlation between both variables

# correlation between both variables

Can you please test the correlation between these two variables? Use cor.test:

# Calculating Pearson's product-moment correlation

This work is licensed under the CC BY-NC 4.0 Creative Commons License.