A very, very very fast spreading disease, severe outcomes take longer to develop – what happens to reported rates of hospitalization?

As a precaution against what will surely be misreported by the press, here is a thought experiment with some numbers.

The UK is reporting a doubling time of 2 days. Suppose there are 100 Omicron cases now,  then in 2 weeks we would have 12,800 people testing positive on a given day if everyone was tested. (Seven doublings means multiply by 2*2*2*2*2*2*2 = 128 = 2^7 . ) Suppose, as may very be true, that Omicron is substantially less virulent than Delta. Let’s assume a hospitalization rate of say 1% instead of Delta’s 2.3% which will happen 2 weeks after a positive test. (Hospitalizations lag positive tests by about 2 weeks for all versions of Covid so far). Then, from our initial 100 cases, there would only be one hospitalization in the two weeks that follow because 1% of a 100 is 1!

But what does this mean? Should we say that Omicron doesn’t seem to result in any significant hospitalizations? Of course not, the signal is hidden in the noise. But I betcha that we can expect some of the more innumerate press to say that Omicron has “a vanishingly small chance of being hospitalized”. After all, 1/12,800 is pretty darn small – it just isn’t what is going on.

The Base Rate Fallacy: X% of new Covid cases are among the vaccinated is a BS statement

Suppose you see a headline that says something like: “50% of our 100,000 new Covid cases were among the vaccinated”? Should you be concerned that the vaccine isn’t working anymore? The answer is: absolutely not – well, not without a lot more information. This statement is an example of using numbers to confuse rather than illuminate. And the best way to understand that this is almost certainly a totally meaningless statistic,  perhaps even rising to the level of complete BS,  is to use a technique I’ve explained before – think about what a statement would mean at extremes. 

So here is an extreme situation to use to think about this statement. Imagine someone is publishing this “statistic” about a place where, say, 99% of a 20,000,000 population were vaccinated and yes they had 100,000 new cases they were reporting on. The “statistic” in our headline is saying that, of the 100,000 new cases, 50,000 of them were in the vaccinated population (50% of 100,000 cases), and so 50,000 were in unvaccinated people.

So, first off, we can calculate the total number of unvaccinated people is 200,000 (1% of 20mil) and 19,800,000 people were vaccinated (our 99% vaccinated rate => .99*20mil vaccinated people). Then, the odds of getting sick if you are vaccinated is:

50,000/19,800,000 or about .0025 =1/4%

I.e. really low. But if you are unvaccinated the odds are:

50,000/200,000 = 25% 

or 100X greater and really really high. 

Thus, for this hypothetical example, you would know that someone is deliberately trying to confuse you or is simply unaware of the effect choosing the wrong size for the bottom of the fraction (the denominator) has on percentages!  

If the denominator you chose in a calculation is the wrong one, you have fallen victim to what is called “the base rate fallacy.” In this case, the “statistic” used the total number of cases of covid (100,000) as the denominator, not the total number of vaccinated people (19,800,000). You simply can not divide by 200,000 to find out the odds of getting the disease if you are vaccinated, because your “population” size of vaccinated people is 19,800,000 not 200,000.  And, when you divide by 200,000 when you are supposed to divide by 19,800,000  – well you saw the result above, you are off by about 100 fold! .

Base rate fallacies come up all the time in thinking about medical statistics. They are, for example, at the root of the “paradox of the false positive.” which I talked about before (https://garycornell.com/2020/05/28/testing-4-i-tested-positive-do-i-really-have-the-disease/). Recall that having a positive test result for a disease isn’t enough data to make a decision – you need to know how rare the disease is in a population.  

To sum up:

Any statement about “odds” or “probability” is meaningful only when you know they have used the right size of the sample to divide by. Denominators matter!

When will we get to herd immunity?

I haven’t written about the pandemic in a while because, well, we have vaccines that work pretty damn well – even against the incredibly contagious delta variant. People just need to get vaccinated. I could do a post every day that just repeats that 500 times I suppose.

But I was talking to someone and they asked just how bad the delta variant could be for the United States. First off, what is absolutely clear is that:

Delta is so contagious that, until we get to herd immunity, if you don’t have some sort of immunity or don’t take strong precautions, i.e. N95 masks, social distancing, you will catch it. 

So the most important question is when we will reach herd immunity? That’s actually not an easy question to answer and what answer you get depends on the model you use for herd immunity. And all the models depend on questions we don’t yet have complete answers to,  for example: how rare is it that a vaccinated person gets reinfected and if they do get reinfected, how likely will it be that they can transmit it? Similarly, if a person already had a version of Covid and isn’t vaccinated, how likely are they to get reinfected and then transmit delta?  And you can also ask: how likely is a child under 12 who catches delta to transmit the virus etc. The point is, the number of groups you can use and how they transmit delta in your model can grow, and then the model becomes very complicated. At that point, large-scale computer simulations are often the best way to get an answer for your model. 

But there is some reason to believe that the naive model for calculating herd immunity I discussed here https://garycornell.com/2020/10/27/herd-immunity-1/ using an R0 of roughly 7 (https://www.thelancet.com/pdfs/journals/lanres/PIIS2213-2600(21)00328-3.pdf) will work pretty well. One can see for example how well it matches up with the Institute for Health Metrics projections which are based on very sophisticated computer simulations. 

Using the model I described in my blog and an R0 of 7, we need that 85.71% (1-1/7) of the population to be immune – to not be transmitting the virus to other people – before herd immunity kicks in.  So when will 85.71% of the population not be transmitting the virus? 

I want to explain how one might get a handle on this number in the rest of this blog. I’m going to make the following simplifying assumption:

  • I will assume that herd immunity happens when 85.71% of the population over 12 is vaccinated or has had a version of Covid.

This assumes vaccinated people and people who have had Covid are not contributing significantly to the transmission of delta and the transmission from children under 12 also isn’t significant to blocking herd immunity.  If these assumptions are false and people in these groups do contribute to transmission significantly, it will make herd immunity happen much later, but based on what I have read so far, current thinking seems to be that this is unlikely.

There are about 329 million people in the United States and about 280 million of them are over the age of 12.

So since 85.71% of the 280 million people over 12 is about 240 million (.8571*240mil=239,988.000), we have to find out when 240 million people are not transmitting it to the remaining 40 million people over 12.  

According to the CDC, as I write this, about 185 million people 12 and older have received at least one shot and about 161 million are fully vaccinated. I’m going to assume therefore that we can take 185 million people out of the equation. Let call these people category “A”.  Category A lets us remove a lot of people from our 240 million goal – if only it were more. 

Our goal shrinks to:

240mil – A = 240mill- 185mil = 55 million

So we are down to a goal of 55 million more people being or becoming immune before we get to herd immunity.

This 55 million people goal is made up of two groups in our model. Those who have already had Covid and those who are vulnerable and will get it in the months to come. 

To analyze this number, we need to first figure out how many people have gotten Covid and aren’t vaccinated. Let’s call the number of people that are 12 and older, aren’t vaccinated, but have been infected by Covid, B.  This means the number of people who will get sick going forward before we get to our  goal of herd immunity is:

55mill – B

 Let’s call this number “V” for vulnerable.

V = 55mill – B

Now we get to the joys of modeling. I have searched for good information on how many people are in group B (have gotten Covid but aren’t vaccinated), but have come up short. There just doesn’t seem to be any good numbers on the size of group B. 

But all is not lost: there are good estimates on the total number of people who have been infected by Covid, we just don’t know how to distribute them between groups A (vaccinated) and B (unvaccinated). 

The best estimates I have seen are that between two and three times the number of people who have tested positive (roughly 33mil) actually have had COVID. This means it is reasonable to assume between 67 and 100 million people in the United States have had a version of Covid. Let’s be as optimistic as possible and assume that 100 million people over the age of 12 have had some version of Covid. 

But we still don’t know how to split these 100 million people between groups A and B is. We have to do this because if they are in group A we have already removed them from the equation, we don’t want to count them twice! How do we proceed?  Here’s what we are assuming:

  • The “odds” of having been infected with Covid if you are over 12 is 

100mil/280mill = .357 

So, of the 185 million people in our vaccinated group A, we will assume 35.7% of them have already had Covid:

.357*A = .357*185mil  = about 66 million people in group A have had Covid

(Yes, I know that people in group A probably took better precautions, or got vaccinated before they could catch Covid, so their infection rates are lower than group B’s, but you can change this number to take this into account if you want.)

The rest of these 100 million people are exactly the people who have had Covid but aren’t vaccinated i.e. group B. So 

B = 100mil – 66 million

This means that, with our assumptions, group B has about 34 million people!

So now let’s calculate V – the people who will get sick from delta before we get to herd immunity with our assumptions.  In our model, since V is equal to:

V = 238mil – A –  B  

or

V = 238mil – 185mil – 34mill  

so

V = 55mil – 34mil = 21mil

Our model predicts 21 million more people over 12 will get delta before we get to herd immunity! 

Let’s check our simple model against the very sophisticated Institute for Health Metrics (IHME) model which goes until November 2021 (https://covid19.healthdata.org/global). Their model predicts there will be another 50,000 deaths by November 1st and since the death rate is roughly 1/10 of the hospitalization rate, their projections imply 500,000 or so hospitalizations (https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2778237) by November 1. 

Now compare this to the result of our analysis. What we got was 21 million more people who have no immunity will get sick from delta before we get to herd immunity. This implies that there will be roughly 610,000 hospitalizations from delta (best knowledge is that 3% of those infected are hospitalized)  and 61,000 more deaths before herd immunity kicks in (using the current knowledge that mortality is 1% of hospitalized cases). This number is consistent with the IHME numbers and probably means our model, simple though it may be, is realistic. Also, if you believe the IHME model and this analysis, we probably won’t get to herd immunity until the beginning of 2022. And, alas if our analysis is right:

hospitals in areas with low vaccination rates i.e. where most of the people in group V live, will not just be overwhelmed by sick patients, they will break completely under the burden. 

Feel free to make your own assumptions in this model and change the values for the variables accordingly. But I believe this isn’t a bad model and it gives a good picture, how &^%$ things are going to get in the United States because of the number of people over 12 who are unvaccinated. 

Move to a single dose now!

I get so mad when the people in charge don’t seem to do the obvious logical reasoning from the facts. But it is actually often even worse that that. Too often, even if they do know what logic requires, they won’t follow through on the conclusions that those facts and logic implied. For the latest case in point,  consider the following:

  1. We now have some really good evidence that a single dose of mRNA vaccines convey really good immunity (https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(21)00448-7/fulltext) (even if it may be short-lived unless people get a booster shot).
  2. The more people get some protection, the fewer the number of mutations escape into the wild. (Common sense but always worth quoting Fauci: “Viruses don’t mutate unless they replicate,  And if you can suppress that by a very good vaccine campaign, then you could actually avoid this deleterious effect that you might get from the mutations.”)1
  3. We will have a lot more doses in a few more months. Actually, we will have more than enough mRNA doses to vaccinate everyone in the United States (and then some) by the end of the summer or early fall with massive increases coming starting in May. (And that doesn’t count new vaccines coming down the pike like Novavax’s!)

Ergo, why the ^%&$ aren’t we moving to a single dose of the mRNA vaccines now, save giving the second dose for when we have no supply constraints and, of course, completely stopping the idiotic reserving of any second doses if that is still being done?

Reserve the J&J Vaccine for people who are more likely to engage in high risk activities

We now have a one-dose vaccine that we are confident is reasonably effective on younger people while, as usual, being somewhat less confident that it is effective on older people. Moreover, not only does the J&J vaccine require only one dose, it has no fancy requirements for transporting it. Even more, reasoning by analogy with the similar Oxford/Astra Zenica vaccine, it may reduce transmission significantly. So what should we do with it?

Here’s a modest proposal and before you dismiss it as crazy, note that it is actually backed by some really interesting mathematical models of disease transmission and some really good empirical evidence on contacts between groups as well as data on how Covid-19 is transmitted by different age groups.

Use the J&J vaccine to vaccinate people more likely to engage in riskier behavior or less likely to get a second vaccine and start doing this immediately once the J&J vaccine is approved. More generally, reserve it for people under 50.

Here’s one way to implement this idea: every day pick a city with a reasonably large airport. Show up at the airport with a “swat” team and vaccinate everyone who passes through it with the J&J vaccine. Do it for the bus and train terminal in the same city on the same day if there is one. Extend the idea by showing up in front of bars and restaurants if they are open in that city and offer to vaccinate everyone in that bar or restaurant. Do this by mobilizing the national guard and the commissioned corps of the public health service – think of it as analogous to a military mission to “secure” a city1.

Picking the city at random helps a little to prevent people from gaming the system I suppose. But that isn’t really the point. We shouldn’t care if people game the system. There are actually two points to keep in mind. The first point is that since the J&J vaccine doesn’t require any fancy storage capabilities, it’s certainly practical. The needed logistics are well within the capabilities of the national guard and the public health service. 

The second point is the key though – it’s because it is the best way to break the back of the pandemic by greatly reducing transmission rates. Why? Because some really good models of disease transmission predict that lowering the infection rates among people who are more likely to transmit the disease is the best way to break the back of a pandemic! If you think about it for a second, you probably don’t need any fancy mathematics: this clearly works by lowering transmission rates quickly. So, yes, I really am advocating giving the people likely to engage in risky – even stupid – behavior the J&J vaccine and not giving it to people who might be at higher risk.  Save the Moderna and Pfizer vaccines for high-risk people of course but don’t give them the J&J vaccine even if it takes longer to vaccinate the high risk population as a result of this choice.

So why is this a really good idea from the point of view of turning the pandemic around in the quickest possible way? Well, it is certainly reasonable to conjecture these kinds of people are less likely to show up for the second dose of the Moderna or Pfizer vaccine, but that isn’t actually the reason to act quickly to vaccinate such people with the one-dose J&J vaccine. The real reason to vaccinate them goes back to mathematical models that were developed around 12 years ago. One of the best was done by Jan Medlock and Alison P. Galvani and was published in Science (Science  25 Sep 2009: Vol. 325, Issue 5948, pp. 1705-1708 DOI: 10.1126/science.1175570) but a far better treatment of their ideas may be found in Medlock’s powerpoint presentation here:http://people.oregonstate.edu/~medlockj/other/flu.pdf. (You do need some knowledge of differential equations though.)  And, in case you are wondering if this mathematical treatment leads to a result that is way too theoretical and not backed by “real” evidence, Mossong et al (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0050074) showed: “ a consistent pattern of contact frequency by age, with a gradual rise in the number of contacts in children, a peak among 10- to 19-y-olds, followed by a fall to a lower plateau in adults until the age of 50 and a sharp decrease after that age. And, while I suppose one can argue that Mossong et al is too old to trust fully, a recent paper (https://science.sciencemag.org/content/sci/early/2021/02/01/science.abe8372.full.pdf) showed that 65% of Covid-19 infections came from people between the ages of 20-49 and concluded that “Targeting interventions – including transmission-blocking vaccines – to adults aged 20-49 is an important consideration in halting resurgent epidemics and preventing COVID-19-attributable deaths.” 

So let’s start by using the, easily administered, one dose J&J vaccine on the people most likely to spread the disease and more generally people who are less than 50 while reserving the Moderna and Pfizer vaccine for people at higher risk! (If we ever run through those people, we can use the J&J vaccine for people who already have been infected by Covid of course: https://www.medrxiv.org/content/10.1101/2021.02.05.21251182v1.)

Bad %^$# happens a lot i.e. why a “vaccine side effect” probably isn’t one

As I write this more than the equivalent of a 9/11 catastrophe happened yesterday (12/10). It’s horrible and it’s going to get worse. The head of the CDC predicts this level of deaths for the next 60-90 days and even the conservative model used by the IHME says it is likely we will have more than 500,000 deaths in the United States by April 1. 

And yet, there is light at the end of this dark, dark tunnel: we are about to roll out a massive vaccination effort based on what can only be described as one of the greatest triumphs of modern science. We have two vaccines that are based on a new technique that will be applicable to many viruses, not just SAR-Covid 19. If widely adopted, these (and other vaccines that are coming soon) will stop the horror. Granted not fast enough, but it will happen and could (should?) be completed by the end of the 3rd or very early in the 4th quarter of 2021.

Unless people don’t get it. 

The problem is that surveys show many people will be reluctant to get the vaccine. Yes, it seems that the vaccine will probably make you feel awfully crappy for 48 hours after the second shot, but that isn’t the only problem. People worry about really bad things happening because of the vaccine. But the problem is that it is hard for people to understand that random ^%$& happens a lot. They say: “Oh my friend’s father got this vaccine and had a stroke two days later.” Or, I just saw on the news that some healthy 35 year old had a stroke a week after getting the vaccine.” 

They confuse correlation with causation. Why? Well unfortunately people of all ages get strokes and if you are vaccinating millions of people some of them will get strokes within a few days of getting the vaccine. How many? We can actually calculate roughly how many! From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3250269/: “Among adults ages 35 to 44, the incidence of stroke is 30 to 120 of 100,000 per year, and for those ages 65 to 74, the incidence is 670 to 970 of 100,000 per year over 75 years”

So, if we use these numbers and take the midpoint for people 35-44, about 75 people per 100,000 who are 35-44 will have a stroke in a year.  There are about 45 million people 35-44 in the United States. That means there will be about 75*(45million/100,000) = 33,750 strokes in people between 35-44 in the United States or almost 100 a day. These strokes have nothing to do with a person getting a vaccine. And the rate of strokes in people over 75 is about 10 times higher. Because there are about 35 million people over 75 in the United States, we would expect about (820*35million/100,000) or about 800 strokes a day that have nothing to do with a vaccine.

So please, please keep in mind when hearing anecdotes about side effects from these miracle vaccines that bad ^&%$ happens randomly a lot.

I want to end by showing you a table for background rates on a lot of bad ^%&$ you might see as being “caused” by these vaccines-even terrible ones like death. Every time you hear an anecdote about some bad side effect after a Covid 19 vaccine please think about this table (taken from:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2861912/) . For example, within one week after vaccinating 10,000,000 people, you will likely have around 98 people keel over and die for no apparent reason and if all of them were pregnant women, almost 27,800 miscarriages.

Predicted numbers of coincident, temporally associated events after a single dose of a hypothetical vaccine, based upon background incidence rates

Number of coincident events since a vaccine doseBaseline rate used for estimate
Within 1 dayWithin 7 daysWithin 6 weeks
Guillain-Barré syndrome (per 10 million vaccinated people)0·513·5821·501·87 per 100 000 person-years (all ages; UK Health Protection Agency data)
Optic neuritis (per 10 million female vaccinees)2·05144086·307·5 per 100 000 person-years in US females (table 2)16
Spontaneous abortions (per 1 million vaccinated pregnant women)397278016 684Based on data from the UK (12% of pregnancies)34
Sudden death within 1h of onset of any symptoms (per 10 million vaccinated people)0140·98575Based upon UK background rate of 0·5 per 100 000 person-years (table 2)28

Statistics in the Pfizer Data – how good do they show the vaccine to be?

Both the UK and the FDA have released enough information so that one can make a good bet on how the Pfizer vaccine worked. (See https://www.fda.gov/media/144245/download for example). It makes for fascinating and informative reading. I am not competent to comment on the medical aspects described there other than to say that when I first started reading about possible vaccines many months ago, I never found any virologist who predicted we would have a vaccine that was more than than 70% effective. To have a vaccine that is likely about 95% effective for people 18-64 is nothing short of a medical miracle: we really lucked out. 

However, when you look at the key statistical table (“Table 8: Subgroup Analyses of Second Primary Endpoint: First COVID-19 Occurrence From 7 Days After Dose 2, by Subgroup) things get murkier. More precisely, what one sees is exactly what I thought would happen, the signal becomes really bad for people over 65 and completely useless for people over 75.  Here’s an excerpt of that table, and then I will try to explain what is going on:

What I need to explain is how to think about what that “95% CI” in the last column means and why it is so important. “CI” stands for confidence interval and is the key when a statistician looks at data and tries to tease out signal from noise. The ideas behind a confidence interval are simple, although how to define it precisely and then calculate it, is a bit tricky. 

In a nutshell when we pull a single number from a bunch of measurements – whether it is the average weight of what’s in a bunch of boxes of cereal or how effective a vaccine is – we know that number isn’t going to be perfect. So what we want and, well, should do is not focus on that single number but give a range around that number and then ask when, say, the odds on average are that 19/20 times that we are within that range i.e. what happens if we do the experiment repeatedly. When we do this with a range you get what statisticians call a  “95% confidence interval1”. The more data you have, the tighter you can make your confidence interval!

So now let’s look at some individual lines from the table above and tease out just what the signal is. The major line is for people 18-64 and we had enough cases to say that our 94.6% efficacy average for the vaccine has a 95% confidence interval runs from 89.1 to 97.7. So what the biostatisicians who analyzed the data are telling us is that, roughly speaking, if we bet that this vaccine is between 89.1% and 97.7% effective for this group, this is an awfully good way to bet and we will win 95% of the time. These are astonishingly good numbers and we all have a lot to be thankful for. (Although having data by age deciles would have been better, they don’t have enough data to do that even in this bigger group I suspect.)

But then we have the next two lines and they unfortunately, confirm what I wrote about here (https://garycornell.com/2020/10/22/we-are-unlikely-to-have-a-vaccine-that-is-proven-effective-for-seniors-for-a-long-time-unless-dramatic-action-is-taken-now/). For people 65 to 74, while the average number (92.9%)  looks great, the confidence interval is not. It says that what we can say, roughly speaking, that a bet that the efficacy is between 53.2 to 99.8 is a good bet. Or, I would say you really don’t have a great way to bet. This kind of confidence interval says that didn’t have enough cases in this group to really say much at all and so the confidence range is too large to be really useful. 

And when we get to people over 75, what they describe isn’t a confidence interval, it’s a joke. A confidence interval of -12.1 to 100 is a lot like saying they threw a bunch of darts at a dart board at random and did everything from hit bystanders (i.e. the vaccine made things worse) to perfect protection. Who would make any bets on what is going on in this situation?They simply didn’t have enough cases to say anything meaningful and so what they say is just totally useless.

But I don’t want to end on a depressing note!  My friends who think about these questions feel pretty strongly that while the vaccine will likely be less effective in people over 65 than it is in younger people, the dropoff won’t be great enough to make a big difference. For example, if it is 20-25% less effective in these age groups (which they think is the worst case scenario), you still get a vaccine that is roughly between 70% and 75% effective – which is still pretty darn good.  

Still I wish they had enrolled enough people >65 to have a better signal!

Obviously great great news but we aren’t home free yet

Efficacy in the 18-55 year old group much higher than expected! That augers well for seniors even without efficacy data. But we are a long way from home free. Issues to keep in mind:

  • Only 25 million people (50 million doses/2) worldwide can be treated in the first few months. Only 500 million people worldwide in the first year (1 billion doses/2).
  • It’s an extremely difficult vaccine to transport: https://garycornell.com/2020/08/30/back-of-envelope-calculation-the-number-and-the-costs-of-freezers-needed-for-the-pfizer-vaccine/
  • Duration of immunity completely up in the air as is the effect on preventing serious cases and deaths – and again no real info on seniors for the reasons I ‘ve written about at length. But no reason not to be hopeful.
  • Long term safety day won’t be available for a while – probably mid 2021. First responders will be participants in one of the largest safety aka Phase 4 trials ever. But again there is no reason not to be hopeful.

But it couldn’t be better news, everything I have read indicated that people were hoping for a 70% efficacy signal at best, Pfizer got a 90% signal!

Just what is a “margin of error” anyway-Sampling 1

You have been probably seeing a lot of polls lately. They all end by saying something like “we sampled 1,000 people and our margin of error is 3.8% or 4.5%” or some other weird percentage. I thought I would take some time to explain where this number comes from and what it means. I want to start by saying that the technical term isn’t “margin of error” but rather “margin of sampling error.”  And the keywords are “sampling error” And, although it seems not directly connected to the pandemic, “sampling error” is a fundamental concept in statistics that must come up in dealing with trying to find Covid 19 prevalence for example, so I thought I would take some time to explain it. This post won’t get too much into the math but eventually math will rear its head when discussing sampling error so I will have some future posts that are a little more math centric.

Anyway, statisticians like to talk about a “population” – that’s what you are trying to understand by taking a “sample.”  We can’t test everybody in the United States for the antibodies to the virus that causes Covid, so we test a “sample”. From the results of that sample we try to estimate the “true” result – the actual number of people that have caught the disease. For example, suppose we find that 10% of our sample test positive and we “jump” to the conclusion that, heck, probably 10% of the whole population is positive. Are we really jumping?

The answer depends on how the sample was taken!  But if it was taken “randomly” – and I will have to have a post on what just that means, it’s actually a tricky concept, the answer is “probably no, we are really not jumping to conclusions” and this is true even if the sample seems so small compared to the actual population size. And yes, it seems magic that a sample size of a 1000 or so allows one to make reasoned judgements about populations in the 100’s of millions i.e. that you can poll 1,000 people and make reasoned judgements on how the 210 million adult population of the United States feels or is.

But it is true. A fundamental result – perhaps the fundamental result in statistics says that the results from relatively small-sized random samples come pretty close to the true result for the whole population under some pretty general and very reasonable assumptions. And this is true no matter how large the population you are sampling. And yes I’ll repeat it: it does seem like magic that it is the sample size rather than say the size of the sample relative to the size of the population is what matters.

In fact, if you take a “random” sample of about 1000 people from the adult population of the United States (about 210 million people), the odds of being off by more than 3%  in either direction is roughly speaking 1/20. Go to about 2400 people and then 19/20 times you are within about 1% of the correct answer. All this means is that if you had the time and money to increase the sample size to what still seems ridiculously small relative to the population size,  you can make the chance of you being wrong also ridiculously small. So I hope you can see why sampling can be so powerful in determining the hidden occurrence of Covid 19 infections for example and that polling, if done properly can work.

OK as a mathematician I need to say this: mathematics isn’t magic, it just seems that way sometimes. And for what it is worth, if I had to pick a single result in all of mathematics, that any mathematician can understand relatively easily why it is true and yet still have trouble believing it, it is this result.

But I need to reiterate that when looking at the results of any survey: (a)you need to be sure they did a random, unbiased sample, and (b)even if they did that, you need to keep in mind that almost all reported sampling results use a 1/20 chance of being off by more than their “margin of error.” Finally (c)it’s worth keeping in mind that if they did if they did a random unbiased sample, that there is only a very very small chance of them being off by twice their margin of error.

(Technical note: These calculations were done if you are looking at result of a more or less equally split population. The numbers needed would change slightly if you were doing a sample where you had a more extreme split such as (75-25%). But, roughly speaking, the error is proportional to the square root of the sample – and the population doesn’t figure into it!)

So stay tuned for more posts that go deeper into the magic and mystery of how sampling works!