Getting Started

You have a data sample. From it, you want to calculate a confidence interval for the population mean value. What’s the first thing you think about? It’s usually a t-test.

But the t-test has several requirements, one of which is that the sampling distribution of the mean is nearly normal (either the population is normal, or the sample is reasonably large). In practice, that’s not always true, and so the t-test may not always deliver optimal results.

To work around that kind of limitation, use the bootstrap method. It has only one important requirement: that the sample approximates the population…

As of right now (December 2020), the leading mRNA vaccines for COVID-19 are made by Pfizer-BioNTech and Moderna. Their efficacy is estimated to be most likely around 94% … 95%.

most likely vaccine efficacy
most likely vaccine efficacy
vaccine efficacy likelihood

But how was that number calculated? Turns out, the basic value is pretty easy. I’ll show you how to do that, and then I will make an estimate for how confident we are that the value is right.

Randomized trials

To test a vaccine, you need to do a randomized blind trial. Gather tens of thousands of people. Divide them into two nearly equal groups. One group will receive the vaccine. The other group (the control) will receive an injection that looks exactly like the vaccine, but doesn’t actually do anything.

The control group shows what happens when there is no vaccine

This is part 2 of this article:

To recap: there’s a pandemic going on, 1% of people have the virus. There’s a test that can detect the virus, and the test is 99% reliable (for both positive and negative results).

But this time, when you take the test, the result is negative. How much can you trust that result?

Ideal case

If you know nothing else besides the test result, then it’s very reliable: 99.9898%, which is basically 100%.

I will not repeat the analysis, please refer to Part 1. But, again, this is the ideal case scenario. What happens in reality?

In the real world

Let’s say there’s a virus pandemic sweeping through the population, and 1% of people have the virus. Let’s say there’s a test for this condition, and the test is 99% reliable, meaning — out of 100 tested cases, the test will be correct in 99 cases, and will be wrong in 1 case. The reliability is the same (99%) for both positive and negative results.

You take the test, and the result comes back positive — the test says you have the virus. And that’s all the information you have. What’s the probability you actually do have the virus? …

Sometimes trends need to be removed from timeseries data, in preparation for the next steps, or part of the data cleaning process. If you can identify a trend, then simply subtract it from the data, and the result is detrended data.

If the trend is linear, you can find it via linear regression. But what if the trend is not linear? We’ll see what we can do about that in a few moments.

But first, the simple case…

Linear trend

Here’s timeseries data with a trend:

Let’s load it up and see what does it look like:

import pandas as pd

Math is hard, let’s go shopping — for tutorials, that is. I definitely wish I had read this tutorial before trying some things in Python that involve extremely large numbers (binomial probability for large values of n) and my code started to crash.

But wait, I hear you saying, Python can handle arbitrarily large numbers, limited only by the amount of RAM. Sure, as long as those are all integers. Now try to mix some float values in, for good measure, and the snake starts barfing. Arbitrarily large numbers mixed with arbitrary precision floats are not fun in vanilla Python.

Florin Andrei

Graduated Physics. Working in the computer industry. Studying statistics, visualizations, data science.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store