Probability Theory

How to Become a Bayesian

On the fundamental difference between frequentist and Bayesian statistics

Christian Graf
Towards Data Science
6 min readJul 23, 2020

--

The difference between the Bayesian and the frequentist approach lies in the interpretation of probability. Photo by Burst on Unsplash

The Bayesian approach to statistics is a powerful alternative to the frequentist approach. In this post, we will explore the very foundations of the Bayesian viewpoint and how it can be distinguished from the frequentist viewpoint. We will not derive Bayes' theorem in a purely technical way, but try to understand the underlying principles.

It all starts with probability. Probability theory is the foundation of statistics and data analysis. It is already the definition, or the viewpoint, on probability where the way splits between the Bayesian and the frequentist approach. How can there be different viewpoints on probability, you may ask. After all probability should be something objective and not subjective. This is only partially true. Consider you are asked to assign a probability to the outcome of the following two series of coin flips (Heads H, Tails T):

  • T T T T T
  • H T T H T

This question is not well defined and allows for ambiguity. If you want to calculate the probability of the exact sequence you will arrive at the conclusion that each sequence of n coin flips has a probability of (1/2)^n, being 1/32 in the example. However, in our daily life we may be much more surprised by an outcome of 5 tails in a row. Maybe we rather want to answer the question: what is the probability of observing a certain number of tails? This question can be addressed using the binomial distribution. You will get a probability of 1/32 for the first sequence and 10/32 for the second. Suddenly, the second outcome is ten times more likely. In this example you can see that there is not an intrinsic probability to a problem, but it always has to be specified and may depend on the individual viewpoint, or the specific question you want to answer. Here, we have not even talked about the implicit assumption that we are using a fair coin, or that it may land on the side.

We established, that the interpretation of probability may depend on the questions you want to answer. How does this differ between the frequentist and the Bayesian approach?

  • The frequentist sees the probability as a limit of its relative frequency in a large number of trials.
  • The Bayesian sees the probability as a degree of belief.

What does it mean? Imagine you want to infer the probability of a coin to show heads. A typical frequentist conclusion after an experiment would be:

After observing 58 heads out of 100 coin flips, I estimate the probability of observing heads to be 58%.

A Bayesian may rather say:

From previous experience we most probably have a fair coin. After observing 58 heads out of 100 coin flips, I update my prior belief. The most probable value now is 54%.

We will look at another example in order to make this difference more clear.

The Raffle Problem

Photo by Joe Yates on Unsplash

Imagine you participate in a raffle where people may buy a certain number of tickets some of which are winning a prize. You want to observe the game for some time in order to estimate your winning chances. You observe 10 people buying a ticket with 5 of them winning. Following a frequentist approach a reasonable estimate for the chance of winning is 50% (You take your data into account and divide the 5 winning tickets by the 10 bought tickets to arrive at that conclusion). But what if you would have only observed one person buying a ticket and that ticket won? What would be your estimate for the chance of winning? Clearly the intuitive approach used before does not work anymore as you may know from experience that a winning rate of 100% is highly unlikely.

How can we incorporate this prior knowledge into our estimate? In 1772, David Hume arrived at the conclusion [1]:

If we be, therefore, engaged by arguments to put trust in past experience, and make it the standard of our future judgement, these arguments must be probable only […]

Meaning that we need a probabilistic viewpoint if we want to update our data with prior knowledge. In the following we will derive how this can be achieved within a Bayesian framework. A step in the right direction in order to address our raffle problem is stating the question: What is the probability of observing our data given a certain model? In order to do so, we first have to set up a model.

The Likelihood

The raffle problem can be described by a binomial model and we can write the probability of observing r winning tickets out of N bought tickets conditioned on the winning chance q as:

Binomial Distribution
Binomial Distribution

The probability of our data given a model is called the likelihood. For our two cases from above (N=10, r=5 and N=1, r=1) the likelihood looks as follows:

Likelihood for the two scenarios of the raffle problem

In the first case the value of q with the highest likelihood is q=0.5 which is also our result from our frequentist estimate, while extreme values close to 0, or 1 have low values. For the second case, the likelihood is highest for q=1, but also lower values of q have a non-zero likelihood. It is important to note that the likelihood is not a probability distribution. You can easily see that the area of the shown likelihoods are smaller than one.

Bayes' Theorem

Are we already done here? No, the likelihood gives us the probability
of our data given our model p(data|q) and is not a probability distribution. What we really want to know is the probability of our model given our data p(q|data). The difference between the two is substantial. You want to choose the model with the highest probability, given your data, not the model that describes the data best. For transforming the likelihood into a probability distribution, we need Bayes' theorem:

Bayes Theorem
Bayes Theorem

p(data) is called the marginal likelihood, or the model evidence. And p(q) is called the prior. The prior is exactly where we can bring in prior knowledge into the equation and it distinguishes p(q|data) from the likelihood. The marginal likelihood is normalizing the distribution properly.

Conclusions

We saw that the fundamental difference between the frequentist and the Bayesian approach to statistics lies in the interpretation of probability. In order to think in a Bayesian way, you have to adapt the way you interpret probability. While the frequentist defines probability as long-term frequencies, a Bayesian sees it as a degree of belief. This means that the Bayesian is especially open to include previous knowledge into his calculations via the prior and can also assign probabilities to events that are only occurring once. I hope that these concepts help you on your way to become a true bayesian.

You can find a full example of a bayesian analysis here.

Further Readings:

Edwin T. Jayens, Probability Theory

References:

[1] David Hume, An Enquiry Concerning Human Understanding, 1772

--

--