The trouble with science is that you need to do things properly. I’m working on a paper at the moment where we measured some phase diagrams. We’ve known what the results are for ages now, but because we have to do it properly we have to quantify how certain we are. Yes, that’s right. ERRORS!

I’ve come on a long way with statistics, I’ve learned to love them, I
defy anyone to truly love errors. However, I took a step closer this
month after discovering *bootstrapping*. It’s a name that has long
confused me, I seem to see it everywhere. It comes from the phrase “to
pull yourself up by your boot straps”. My old
friend says it’s “a
self-sustaining process that proceeds without external help”. We’ll see
why that’s relevant in a moment.

**Doing errors “properly”**

Calculating errors properly is often a daunting task. You can spend
thousands on the software, many people make careers out of it. This will
often involve creating a statistical model and all sorts of clever
stuff. I really don’t have much of a clue about this and, to be honest,
I just want a reasonable error bar that doesn’t undersell, or oversell,
my data. Also, in my case, I have to do quite a bit of arithmetic
gymnastics to convert my raw data into a final number so knowing where
to start with models is beyond me.

**Bootstrapping**

I think this is best introduced with an example. Suppose we have
measured the heights of ten women and we want to make an estimate of the
average height of the population. For the sake of argument our numbers
are:

135.8 | 145.0 | 160.2 | 160.9 | 145.6 |

156.3 | 170.5 | 192.7 | 174.3 | 138.2 |

in cm

The mean is 157.95cm, the standard deviation is 16.88cm. Suppose we don’t have anything except these numbers. We don’t necessarily want to assume a particular model (Normal distribution in this case), we just want to do the best with what we have.

The key step with bootstrapping is to make a new “fake” data set by
randomly selecting from the original (allowing duplicates). If the
measurements are all independent and randomly distributed etc, then the
fake data set can be thought of as an alternate version of the data. It
is a data set that you *could* have taken the first time if you’d
happened to get a different sample of people. Each fake set is thought
equally likely. So let’s make a fake set:

156.3 | 192.7 | 160.9 | 135.8 | 135.8 |

156.3 | 156.3 | 170.5 | 156.3 | 192.7 |

Mean=161.36cm, standard deviation = 18.5935

As you can see, there’s quite a bit of replication of data. For larger sets it doesn’t look quite so weird. On average you keep about 60% of the original data and the rest is replicated. Now let’s do this again lots and lots of times (say 10000) using different fake data sets each time, generating different means and standard deviations. We can make a histogram

From this distribution we can estimate the error on the mean to whatever confidence interval we like. If it’s 67% (+/- sigma) then we can say that the error on the mean is +/-5.2cm. Incidentally that’s nearly what we’d get if we’d assumed a normal distribution and done 16.88/sqrt(10). Strangely the mean of the means is not 157.95 as the input data was, but 160.2. This is interesting because I drew the example data from a normal distribution centred at 160cm.

We can also plot the bootstrapped standard deviation.

What’s interesting about this is that the average is std=15.2 whereas the actual standard deviation that I used for the data was 19.5. I guess this is an artefact of the tiny data set. That said 19.5 looks within “error”.

So, without making any assumptions about the model we’ve got a way of getting an uncertainty in measurements where all we have is the raw data. This is where the term bootstrap comes in; the error calculation was a completely internal process. If it all seems a bit too good to be true then you’re not alone. It took statisticians a while to accept bootstrapping and I’m sure it’s not always appropriate. For me it’s all I’ve got and it’s relatively easy.

To make these figures I used a python code that you can get here. Data here.

Update: It’s been pointed out to me that working out the error on the standard deviation is a bit dodgy. I think that the distribution is interesting - “what standard deviations could I have measured in a sample of 10?” - but perhaps one should be a little careful extrapolating to the population values. Like I said, I’m not a statistician!