C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1!
Han: Never tell me the odds!
Here's the scene for those who haven't seen it or may have forgotten. Superficially this is just a fun movie dismissing 'boring' data analysis, but there's actually an interesting dilemma here. Even the first time you watch Empire you know that Han can pull it off. But, despite deeply believing that Han will make it through, is C3PO's analysis wrong? Clearly Han believes it's dangerous, 'They'd have to be crazy to follow us.' None of the pursuing tie fighters make it through, which provides pretty strong evidence that C3PO's numbers are not off. So what are we missing?
What's missing is that we know Han is a badass! C3PO isn't wrong, he's just forgetting to add essential information to his calculation. The question now is: can we find a way to avoid C3PO's error without dismissing probability entirely as Han proposes? To answer this we'll have to model how both C3PO thinks and what we believe about Han, then find a method to blend those models.
We'll start by taking apart C3PO's reasoning. We know C3PO well enough by this point in the film to realize that he's not just making numbers up. C3PO is fluent in over 6 million forms of communication, and that takes a lot of data to support. We can assume then that he has actual data to back up his claim of 'approximately 3,720 to 1'. Because C3PO mentions 1:3720 is the approximate odds of successfully navigating an asteroid field we know that the data he has only gives him enough information to suggest a range of possible rates of success.
The only outcomes that C3PO is considering are successfully navigating the asteroid field or not. If we want to look at the various possible probabilities of success given the data C3PO has, the distribution we're going to use is the Beta distribution. We can define C3PO's reasoning with the following equation:
P(RateOfSuccess|Successes) = Beta(α,β)
The Beta distribution is parameterized with an α (number of times success observed) and a β (the number of times failure is observed). This distribution tells us which rates of success are most likely given the data we have. We can't really know what's in C3PO's head, but let's assume that not too many people have actually made it through an asteroid field and in general not that many people try (because it's crazy!). We're going to say that C3PO has records of two people surviving and 7,440 people ending their trip through the asteroid field in a glorious explosion. Below is a plot of the probability density function that represents C3PO's belief in the true rate of success when entering an asteroid field.
For any ordinary pilot entering an asteroid field this looks bad. In Bayesian terms, C3PO's estimate of the true rate of success given observed data is referred to as the likelihood.
But Han is a badass
The problem with C3PO's analysis is that his data is on all pilots, and Han is far from your average pilot. If we can't put a number to Han's 'badass' then our analysis is broken, not just because Han makes it (we have p-values to blame for that), but because we believe he's going to. Statistics is a tool to aid and organize our reasoning and beliefs about the world. If our statistical analysis not only contradicts our reasoning and beliefs but also fails to change them, then something is wrong with our analysis.
Why do we believe Han will make it? Because Han makes it through everything that's happened so far. What makes Han Solo, Han Solo is that no matter how unlikely he is to make it through something he still succeeds! We have a prior belief that Han will survive the asteroid field. The prior probability is something that is very controversial for people outside of Bayesian analysis. Many people feel that just 'making up' a prior is not objective. This scene from Empire is an object lesson in why it is even more absurd to throw out our prior beliefs. Imagine watching Empire the first time, getting to this scene and having a friend sincerely tell you that 'whelp, Han is dead now'. It is worth pointing out again, C3PO is not entirely wrong. If your friend said 'whelp, those Tie fighters are dead now', you would likely chuckle in agreement.
Now we have to come up with an estimate for our prior probability that Han will successfully navigate the asteroid field. We do have a real problem though, we have a lot of reasons for believing Han will survive but no numbers to back that up. We have to make a guess. Let's start with some sort of upper bound on his badassness. If we believe it was impossible for Han to die then the movie becomes boring. At the other end, I personally feel much more strongly about Han being able to make it than C3PO does about him failing. I'm going to say I roughly feel that Han has a 20,000:1 chance of making it through a situation like this. We'll use another beta distribution to express this for two reasons. First my beliefs are very approximate, so I'm okay with the true rate of survival being variable. Second, it makes calculations we need to do later much easier. Here is our distribution for our prior probability that Han will make it:
Creating suspense with a posterior
We have now established what C3PO believes (likelihood) and modeled our own beliefs in Han (prior probability), but we need a way to combine these. By combining beliefs we create what is called our posterior distribution. In this case the posterior models our sense of suspense upon learning the likelihood from C3PO. The purpose of C3PO's analysis is in part to poke fun at his analytical thinking, but also to create a sense of real danger. Our prior alone would leave us completely unconcerned for Han, but when we adjust it based on C3PO's data we get a new belief in the real danger. The formula for the posterior is actually very simple and intuitive:
Posterior = Likelihood⋅Prior
The only thing not explicitly stated in this formula is that we usually want to normalize everything so it sums up to 1. It also turns out that combining our two beta distributions in this way, including the normalization, is remarkably easy.
Beta(αposterior,βposterior) = Beta(αlikelihood + αprior,βlikelihood + βprior)
And here is what our final, posterior, belief looks like:
Combining our C3PO belief with our 'Han is badass' belief we find that we have a much less extreme position than either of these. Our Posterior belief is a roughly 75% chance of survival, which means we still think Han has a good shot of making it, but we're much more nervous.
This post has been pretty light on math, but the real aim was to introduce the idea of Bayesian priors and show that they are as rational as believing that Han Solo isn't facing certain doom by entering the asteroid field. At the same time we can't just throw away the information that C3PO has to share with us. The only sane way to understand the situation is to combine our belief in Han with the information C3PO has provided. This concept is a fundamental principle of Bayesian analysis.
This article first appeared on Will Kurt's Count Bayesie blog