One of the things that blew my mind the first time I learned about it is bootstrapping. Although very counterintuitive at first glance, it makes perfect sense to estimate the accuracy of your sample estimates, given that you have a representative sample. In this blog post I explain how you can calculate confidence intervals for any difference in estimate between two samples, using the simpleboot R package.
First we load in the packages. I load in the simpleboot package for performing the two-sample bootstrap and I will use ggplot2 for demonstration.
library(simpleboot)
library(ggplot2)
Next, I generate two samples of data where we can perform the bootstrap on. I also plot their densities to show how these two samples differ from each other.
sample1 <- data.frame(sample = 'first sample', values = rnorm(1200,24,5))
sample2 <- data.frame(sample = 'second sample', values = rnorm(1600,20,5))
samples <- rbind(sample1,sample2)
ggplot(samples,aes(x = values, col = sample, fill = sample, group = sample)) + geom_density(alpha=0.2)
Next up, I bootstrap the mean difference of the two samples. I resample (with replacement of course) 2500 times.
bootstrapped <- two.boot(sample1$values, sample2$values, mean, 2500)
The bootstrapped object is a list. The most interesting elements are the initial sample mean (t0), the sample differences (t), the amount of resamples (R), the data, the seed, etc.
In the next part of the code, I turn the vector of sample differences into a data frame so that I can use it in ggplot. I plot the differences as a histogram and add three vertical lines: the initial sample mean and the confidence interval boundaries (p = 0.05).
bootstrapped_mean_diff <- data.frame(bootstrapped$t)
colnames(bootstrapped_mean_diff) <- 'mean_diffs'
ggplot(bootstrapped_mean_diff, aes(x=mean_diffs)) +
geom_histogram(bins=20, alpha=0.8) +
geom_vline(xintercept = bootstrapped$t0, size = 1.5) +
geom_vline(xintercept = quantile(bootstrapped_mean_diff$mean_diffs, 0.05), size = 1, linetype = 'dashed') +
geom_vline(xintercept = quantile(bootstrapped_mean_diff$mean_diffs, 0.95), size = 1, linetype = 'dashed') +
xlab('sample mean') + ylab('bootstrapped count')
Given this information we got from bootstrapping our samples, we can reject the null hypothesis that both means are the same. On a p = 0.05 level, we can see that both samples have a different mean.
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!
Great success!
That is a very good tip especially to those fresh to the blogosphere. Short but very accurate info… Thank you for sharing this one. A must read article!
Thanks Calvin!
Hey there! I just wish to offer you a huge thumbs up for the great information you have got here on this post. I will be returning to your website for more soon.
Your article helped me a lot, is there any more related content? Thanks! https://accounts.binance.com/ph/register?ref=V3MG69RO
I loved as much as you will receive carried out right here.
The sketch is attractive, your authored material stylish.
nonetheless, you command get got an impatience over that you wish be delivering the following.
unwell unquestionably come further formerly again as
exactly the same nearly very often inside case you shield this hike.
my web page :: vpn special