Bayesian Statistical Inference

Editorial Staff

25 october 2022

Statistical inference is the use of data analysis to say something about the probability distribution of the underlying data. A very common tool to say something about the likely distribution of data is the method of maximum likelihood. Here we make an assumption of the distribution type that the data follows and then estimate the parameters of that distribution that best fit the data. We often use this concept on econometric models as well. Sometimes, however, the model is too complex to use maximum likelihood to estimate the parameters of the model. A tool we can use in this case is Bayesian statistical inference. In this article I will explain what Bayesian statistical inference is and what it can be used for.

Bayes’ rule

The most important tool that is used in Bayesian statistical inference is Bayes’ rule. Bayes rule states the following: For events A and B, we have

$$ P(A | B) = \frac{P(B|A) P(A)}{P(B)}$$.

This follows directly from the definition of conditional probabilities because $P(A , B) = P(A | B) P(B) = P(B | A) P(A)$. Rearranging the last equality leads to Bayes’ rule. We will come back to Bayes’ rule when talking about priors and posteriors so keep this in mind.

Statistical analysis

Econometricians use models to try to explain certain phenomena based on data. The most simple and standard model is the linear regression model. Here we have independent data $X$. This can be for example height, weight, gender, years of education ect. We also have a dependent variable $y$. This is the variable we try to explain based on $X$. An example of $y$ is income.

There are many possible adjustments one can make to this standard model to suit the data better. For example, if $y$ is a binary variable a linear regression model is not appropriate, because the model could make predictions on $y$ that suggest $y$ to be negative or larger than 1 which is not possible. In this case we can model the probability that $y = 1$ and ensure that the model output is between 0 and 1 by using a functional transformation. The two most common choices are the normal cumulative distribution function and the logistic function. Both of these functions squeeze the output of the model between 0 and 1.

Other examples of extension to the standard model are for example the tobit model, the 2 stage model, the panel data model or the multinomial model. It is even possible to add a spatial component to the model to better capture the dependence between neighboring observations. An example of this is that behavior of members of the same family is often similar. Statistical analysis of most of the models mentioned here is possible using maximum likelihood but can also be done using Bayesian statistical methods.

The likelihood

A very important part of maximum likelihood is the likelihood function, but it is also vital for Bayesian methods to calculate the posterior distribution. The likelihood is a function that describes how likely it is to observe certain data given the parameter values. If we have independent observations this is just the product of the probabilities of observing each observation for discrete random variables. For continuous random variables we can substitute the value of the probability density function here instead.

From Bayes’ rule we know that $P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)}$. Here $D$ represents the data and $\theta$ represents the model parameters. If we then look in the equation, we see that the term $P(D | \theta)$ represents the probability of observing the data given the parameters of the distribution. This is exactly the likelihood. There are 3 more parts to the equation. These will be explained in the next section.

Priors and posteriors

We will start with $P(D)$. This is known as the marginal likelihood. This gets its name from the fact that it is the likelihood integrated over the parameter space. This is often a very hard integral to calculate and thus left out. We can do this by considering the data $D$ to be given and the equation to be only of the parameters. We then find $P(\theta | D) \propto P(D | \theta) P(\theta)$.

The second part on the right hand side of the equation is known as the prior distribution. The prior describes our belief of the distribution that the parameters follow. For example if we know that a parameter can only have values between 0 and 1, we can propose a beta distribution as a prior for this parameter. If we have no prior beliefs we can use something called uninformative priors. In the case of the beta distribution, we can use the standard uniform distribution. For problems with a free parameter space we can use for example a normal distribution with a very large variance. The large variance will make sure that a wide range of parameter values still have a non-negligible probability. If however, we do have a strong belief that the parameter should have a certain distribution, we can take the prior accordingly.

The part on the left hand side of the equation is called the posterior. This is the part that we are interested in. In maximum likelihood we only have a value for the parameter that makes the data most likely to appear. With the posterior distribution we know exactly how the parameters should behave given the data. This is far more powerful than maximum likelihood. If we know the posterior distribution, we can repeatedly sample from this distribution and take the mean to get a good estimate for the parameters.

Model estimation

In this article we will only discuss how to estimate the parameters of a linear regression using Bayesian statistical inference. Let us consider the following model:

$$y = X \beta + \varepsilon $$

Where we have $\varepsilon \sim N(0, I_n \sigma^2)$ with known $\sigma^2$ to make the calculations easier. We now assume a normal distribution as a prior for $\beta$. So we have $\beta \sim N(\underline{\beta}, \sigma^2 \underline{\Sigma}^{-1})$. Note that we denote prior parameters with an underline. We now calculate the likelihood as follows:

$$L(\mu) =\frac{1}{\sqrt{2\pi \sigma^2}} e^{- \frac{1}{2 \sigma^2} (y – X \beta)^T (y – X \beta)}$$

We then find the following posterior distribution for $\mu$:

$$ P(\mu | D) \propto L(\mu) P(\mu) \propto exp\bigg(- \frac{1}{2 \sigma^2} \bigg( (y – X \beta)^T (y – X \beta) + (\beta – \underline{\beta})^T \underline{\Sigma} (\beta – \underline{\beta})\bigg)\bigg) $$

$$ \propto exp\bigg(- \frac{1}{2 \sigma^2} \left(\beta – (X^T X + \underline{\Sigma})^{-1}(X^T y + \underline{\Sigma} \underline{\beta})\right)^T (X^T X + \underline{\Sigma}) \left(\beta – (X^T X + \underline{\Sigma})^{-1}(X^T y + \underline{\Sigma} \underline{\beta})\right)\bigg)$$

From this we find that the posterior distribution is a normal distribution with mean and variance that can be read off in the above equation. We can sample from this posterior distribution to get a sample of beta values. Taking the mean of these beta’s gives us a good estimate for beta in our model. Taking multiple of these samples we can even say something about the accuracy of the estimate using numerical methods.