Disclaimer: I know enough about statistics to know that I do not know anywhere near enough about statistics.

I've been trying to figure out how to do this for quite a while. I've done a lot of searching and come across some great questions on Cross Validated, but no great answers. Just pearls of wisdom like so:

A simple way to do sample size calculation for any method (non-parametric or otherwise) is to use a bootstrap approach

that left me going, "Great! But how?!"

Bootstrapping itself I understood the principle of: it's just Monte Carlo, but with resampling - but I really couldn't translate that into figuring out a sample size.

After lots more searching I found this paper on Sample Size / Power Considerations [PDF], by Elizabeth Colantuoni, that rather marvellously had a bootstrap approach, using R, to figuring out sample size / power!

In Elizabeth's example the distribution isn't normal, but is parametric because there is a regression model for the distribution. So it still threw me for a bit as to how I could use bootstrapping to determine sample size for non-normal and non-parametric data. After reading through her paper a few times I think I finally get it (see disclaimer). Since Statistical Power is "the probability that the test will reject the null hypothesis when the null hypothesis is false" I figure I can use any suitable statistical test within the bootstrap. And since I was interested in using non-parametric tests that compare median's between two groups (such as Mood's Median and Mann-Whitney-Wilcoxon, etc (although I gather there are other ways to determine sample size for Mann-Whitney-Wilcoxon)) in simpler terms this meant that statistical power would be the number of times I'd get P<0.05.

So using Elizabeth's R code as a basis I ended up with something like so:

#Non-normal Group 1
data1 <- c(24.0,4.0,5.0,4.0,2.5,2.0,10.0,6.0,2.0,14.0,6.5,3.0,6.0,3.0,4.0,5.0)
#Non-normal Group 2
data2 <- c(0,0,0,0,0,0,0,8,6,6,4,4,4,4)

#Below based on code from Elizabeth Colantuoni
#http://www.biostat.jhsph.edu/~ejohnson/regression/Sample%20Size%20Power%20Considerations.pdf
power = function(sample1, sample2, reps=500, size=10) {
    results  <- sapply(1:reps, function(r) {
        resample1 <- sample(sample1, size=size, replace=TRUE) 
        resample2 <- sample(sample2, size=size, replace=TRUE) 
        test <- wilcox.test(resample1, resample2, paired=FALSE)
        test$p.value
    })
    sum(results<0.05)/reps
}

#Find power for a sample size of 100
power(data1, data2, reps=1000, size=100)

The slight difference I have is that I'm resampling twice (once from each group) and determining a sample size for both groups.


For awhile I was uncertain of a couple of things when using a bootstrap approach to determine sample size:

  1. Why isn't the difference / effect size taken into account?
  2. What if the null hypothesis isn't actually false?

For 1, I mean that if you are determining sample size / power for a parametric test you need to specify the effect size. But here, using bootstrap, we have not needed to specify the effect size at all. This is because for parametric tests the type of test, power, sample size, effect size and the significance level (Alpha / Type I error) are all interlinked; to determine the power the other four are specified. The test type and effect size describe the distributions of the two groups that are being compared/tested. In the bootstrap we don't need to describe these because via resampling we are using the distributions directly (and Alpha of course we specify in the sum of the results).

For 2 what I mean is, if the definition of statistical power is "the probability that the test will reject the null hypothesis when the null hypothesis is false", what if we are actually committing a Type 1 error and rejecting the null hypothesis when it's true? Here I was having a complete brain freeze! This is why we specify the acceptable Alpha (Type I error) level of 0.05. Also, for what it is worth, at this stage of the game we aren't actually using the test to make the decision on the difference in medians between the groups: We are only interested in finding the sample size that will enable us to make that decision later.