Welcome to the workshop number 3: Introduction to inferential stats with R.
Learning outcomes:
By the end of this demo, you should be able to:
Open and create a R Markdown.
Interpret comparing means model outputs for independent t-test.
Calculate and interpret pearson correlation and scatter plots.
R Markdown provides an unified authoring framework for data science, combining your code, its results, and your prose commentary. R Markdown documents are fully reproducible and support dozens of output formats, like PDFs, Word files, slideshows, and more.
R for Data Science: Read Chapter 27 to get more information about R Markdown.
If you want to get more familiar with the hand procedure of hypothesis testing, you have a look in the following tutorials:
Statistics, probability, significance, likelihood: words mean what we define them to mean
R can handle the various versions of T-test using the t.test()
command. This function can be used to deal with one-sample tests as well as two-sample (un-)paired tests.
Research question:Is there any (statistical) difference in the mean orange juice poured in a glass of 33cl between Maastricht and Amsterdam coffee lovers?
The independent sample t-test compares the mean of one distinct group to the mean of another group.
Complete the steps to perform a t-test comparing means. In a markdown, add R script
Step 1: Formulate the hypothesis
\(H0:\) ….
\(Ha:\) ….
Step 2: State level significance
The level of significance of a test, alpha-level of a test or \(\alpha\) level of a test, is probability of making a false-positive error, assuming the null hypothesis is correct. The alpha-level of a test is typically decided on, in advance of an experiment, and sets the actual critical value for accepting the alternative hypothesis that is used, regardless of the sample.
Step 3: Collect data from different cities
# Define orange sample
orange_maastricht <- c(28,31,28,37,30,
33,25,33,24,30)
orange_amsterdam <- c(24,31,25,37,30,
33,23,32,24,30)
print(orange_amsterdam)
## [1] 24 31 25 37 30 33 23 32 24 30
print(orange_maastricht)
## [1] 28 31 28 37 30 33 25 33 24 30
Step 4: Conduct t-test for comparing means
# read official documentation for t-test method
?t.test()
## starting httpd help server ... done
These numbers below are the results of t.test:
##
## Two Sample t-test
##
## data: orange_maastricht and orange_amsterdam
## t = 0.51925, df = 18, p-value = 0.6099
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.046056 5.046056
## sample estimates:
## mean of x mean of y
## 29.9 28.9
Step 5: Conclusion & Model Output Interpretation
Research question: Is there any (statistical) difference in the mean Life Expectancy between South Africa and Ireland?
Import Gapminder dataset:
library(gapminder)
data("gapminder")
Filter for specific country:
library(dplyr) # import the necessary packages
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydat <- filter(gapminder, country == 'South Africa' | country == 'Ireland')
Get average for each country:
summarize(filter(mydat, country == 'South Africa'),
Avg_Life_SouthAfrica = mean(lifeExp, na.rm = TRUE))
## # A tibble: 1 x 1
## Avg_Life_SouthAfrica
## <dbl>
## 1 54.0
summarize(filter(mydat, country == 'Ireland'),
Avg_Life_Ireland = mean(lifeExp, na.rm = TRUE))
## # A tibble: 1 x 1
## Avg_Life_Ireland
## <dbl>
## 1 73.0
Conduct t-test for comparing means
t.test(data = mydat, lifeExp ~ country)
##
## Welch Two Sample t-test
##
## data: lifeExp by country
## t = 10.067, df = 19.109, p-value = 4.466e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 15.07022 22.97794
## sample estimates:
## mean in group Ireland mean in group South Africa
## 73.01725 53.99317
In statistics, you deal with a lot of data. The hard part is finding patterns that fit the data. To look for patterns, there are several statistical tools that help identify these patterns. But before you use any of these tools, you should look for basic patterns. As you learned, you can identify basic patterns using a scatter plot and correlation.
Let’s create a simple scatter plot:
plot(gapminder$lifeExp,gapminder$gdpPercap)
You could see from the scatter plot that including this line, add more information about the slope/direction of the relationship.
plot(gapminder$lifeExp,gapminder$gdpPercap)
abline(lm(gdpPercap~lifeExp,gapminder), col ="blue")
What about create a scatterplot with ggplot
library?
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(gapminder,aes(x = lifeExp, y = gdpPercap)) +
geom_point() +
facet_wrap(~continent) +
geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'
Another one:
ggplot(
data = gapminder,
mapping = aes(x = lifeExp, y = gdpPercap)
) +
geom_point(alpha = .1) +
geom_smooth(se = FALSE) +
labs(
title = "Relationship between GDP per capita and Life Expectancy",
subtitle = "Gapminder dataset - Worldwide",
x = "GDP (euro)",
y = "Life Expectancy (years)"
) +
theme_minimal()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Because both are both continuous variables, will use Pearson’s product-moment correlation to quantify the strength of the relationship between these two variables. There are a few ways to do this in R, but we will only consider one method here.
Can you please use cor()
function to calculate the correlation between both variables
# correlation between both variables
Can you please test the correlation between these two variables? Use cor.test
:
# Calculating Pearson's product-moment correlation
This work is licensed under the CC BY-NC 4.0 Creative Commons License.