You have been probably seeing a lot of polls lately. They all end by saying something like “we sampled 1,000 people and our margin of error is 3.8% or 4.5%” or some other weird percentage. I thought I would take some time to explain where this number comes from and what it means. I want to start by saying that the technical term isn’t “margin of error” but rather “margin of sampling error.” And the keywords are “sampling error” And, although it seems not directly connected to the pandemic, “sampling error” is a fundamental concept in statistics that must come up in dealing with trying to find Covid 19 prevalence for example, so I thought I would take some time to explain it. This post won’t get too much into the math but eventually math will rear its head when discussing sampling error so I will have some future posts that are a little more math centric.
Anyway, statisticians like to talk about a “population” – that’s what you are trying to understand by taking a “sample.” We can’t test everybody in the United States for the antibodies to the virus that causes Covid, so we test a “sample”. From the results of that sample we try to estimate the “true” result – the actual number of people that have caught the disease. For example, suppose we find that 10% of our sample test positive and we “jump” to the conclusion that, heck, probably 10% of the whole population is positive. Are we really jumping?
The answer depends on how the sample was taken! But if it was taken “randomly” – and I will have to have a post on what just that means, it’s actually a tricky concept, the answer is “probably no, we are really not jumping to conclusions” and this is true even if the sample seems so small compared to the actual population size. And yes, it seems magic that a sample size of a 1000 or so allows one to make reasoned judgements about populations in the 100’s of millions i.e. that you can poll 1,000 people and make reasoned judgements on how the 210 million adult population of the United States feels or is.
But it is true. A fundamental result – perhaps the fundamental result in statistics says that the results from relatively small-sized random samples come pretty close to the true result for the whole population under some pretty general and very reasonable assumptions. And this is true no matter how large the population you are sampling. And yes I’ll repeat it: it does seem like magic that it is the sample size rather than say the size of the sample relative to the size of the population is what matters.
In fact, if you take a “random” sample of about 1000 people from the adult population of the United States (about 210 million people), the odds of being off by more than 3% in either direction is roughly speaking 1/20. Go to about 2400 people and then 19/20 times you are within about 1% of the correct answer. All this means is that if you had the time and money to increase the sample size to what still seems ridiculously small relative to the population size, you can make the chance of you being wrong also ridiculously small. So I hope you can see why sampling can be so powerful in determining the hidden occurrence of Covid 19 infections for example and that polling, if done properly can work.
OK as a mathematician I need to say this: mathematics isn’t magic, it just seems that way sometimes. And for what it is worth, if I had to pick a single result in all of mathematics, that any mathematician can understand relatively easily why it is true and yet still have trouble believing it, it is this result.
But I need to reiterate that when looking at the results of any survey: (a)you need to be sure they did a random, unbiased sample, and (b)even if they did that, you need to keep in mind that almost all reported sampling results use a 1/20 chance of being off by more than their “margin of error.” Finally (c)it’s worth keeping in mind that if they did if they did a random unbiased sample, that there is only a very very small chance of them being off by twice their margin of error.
(Technical note: These calculations were done if you are looking at result of a more or less equally split population. The numbers needed would change slightly if you were doing a sample where you had a more extreme split such as (75-25%). But, roughly speaking, the error is proportional to the square root of the sample – and the population doesn’t figure into it!)
So stay tuned for more posts that go deeper into the magic and mystery of how sampling works!