# Bayesian inference

Bayesian inference is a type of statistical inference that makes use of Bayes rule to update the likelihood of a hypothesis as observations are made and more information becomes available.

# Bayes Rule

There are useful ways to interpret Bayes rule outside of its treatment as an equation for computing conditional probabilities (which one might consider a frequentist interpretation). Under the Bayesian interpretation, probabilities are considered “degrees of belief”.

$P(H|E) = \frac{P(E|H)P(H)}{P(E)}$

- $H$ represents a hypothesis
- $P(H)$ is the prior probability of the hypothesis $H$ (being correct) before observing the data $E$
- $E$ represents newly observed data, or evidence, information not included in the prior
- $P(H|E)$ is the posterior probability, the probability of $H$ given the newly observed data $E$
- $P(E|H)$ is the likelihood, the probability of observing the data $E$ under the hypothesis $H$. It is a function of $E$ with $H$ fixed
- $P(E)$ refers to the marginal likelihood. This factor is independent of any particular hypothesis,

# Formal Construction

- $x$: a data point, can be a vector
- $\theta$: parameter of the data point’s distribution $x \tilde p(x|\theta)$
- $\alpha$: hyperparameter of the parameter distribution $\theta \tilde p(\theta|\alpha)$, can be vector
- $X$: the data, a set of n observed data points
- $\tilde{x}$: a new data point whose distribution we want to predict

It is common to see Bayes rule written using these variables in place of the H and E as seen above. That is,

$p(\theta|X) = \frac{p(X|\theta)p(\theta)}{p(X)}$

Note here that we use a lower case p instead of P. This notational difference is used to indicate that instead of talking about strict probabilities P, we are now working with densities p (also commonly written as a function f).

Here we have the following components, now representing densities in the context of some model M (that do not state explicitly in the distributions):

## $p(\theta \vert X)$ :: Posterior Distribution

The posterior distribution is a density over the possible values of the parameter $\theta$ that take both the prior and the observed data into account. This distribution represents what we’ve learned from the data through our understanding of likely model parameters.

## $p(X \vert \theta)$ :: Likelihood

This is the probability distribution of the data under the model with the given parameters $\theta$. This value/function is typically interpreted in one of two ways:

- As a function of the data, $X$: when $\theta$ is fixed, this is a parametrized conditional density over the possible values of the data. For any particular set of observations we can compute the probability of having observed those data under the (fixed) model parameters.
- As a function of the parameter, $\theta$: when the data $X$ are fixed, this is typically termed the likelihood function. This is a function of the parameter, $\theta$, allowing it to vary as a random variable. In this case the function does not represent a probability density (as we know changing the conditional value is not considered by the axioms of probability), but it is often useful for observing the likelihood of the observed data under different model parameters.

## $p(\theta)$ :: Prior Distribution

The prior distribution is a distribution over the model parameters before any observations are made. This distribution encodes initial subjective beliefs about the parameters, and is one of the defining factors of Bayesian inference.

When the prior is approximately accurate, it can help a model remain useful even when there is a small amount of data available. In such a case the prior will be able to influence the posterior in a meaningful way. However, as more and more data become available, the prior becomes less and less meaningful. The data begin to reveal reality more clearly, and the prior has less influence on shaping the posterior.

## Marginal Likelihood $p(X)$

This is the marginal likelihood, or the prior predictive distribution, of the data $X$. This naming convention is due to the fact that it’s value is derived from *marginalizing over* the parameter $\theta$:

$p(X) = \int{p(X|\theta)p(\theta)d\theta}$

In other words, it is the likelihood function averaged over the parameters with respect to their prior. Intuitively, for some fixed data this value represents the probability of the data having been generated by the model at hand. Similarly, from the prior predictive distribution perspective it represents what we would expect X to look like before actually having made any observations.

One interesting question that comes to mind here is how does the model impact this value? If the value does not focus on any particular parameters $\theta$, is this value the same under all models? The answer here is no. Often times (and has been above) the model context is implicit in the probabilistic statements being made. Making this dependence explicit, the marginal likelihood for a model M is

$p(X | \mathcal{M}) = \int{p(X|\mathcal{M},\theta)p(\theta|\mathcal{M})d\theta}$

The conditioning on the model identity here is important, as it describes the context from which our priors and likelihood come. Under different models, these distributions will be different, and the presence of this model identity in the posterior explicitly shows why this is the case.

## Remarks

It’s important to keep the goal of Bayesian inference in mind when reasoning about all these components and their definitions. At the end of the day, Bayesian inference is a method of inference, *inferring* things about the way a particular process works and can be explained. As a result, the goal is to be able to describe a framework that allows one to use observations from the world to inform a model of how the world works (here the “world” can be any process). We typically phrase this learning scenario in terms of some model and its parameters. Loosely speaking, the model represents some structural relation between internal parameters and the real world.

[One might consider the “model” to incorporate global structure about a process and define its overall capacity, while tuning its parameters allows for local control. This global and local trade-off (i.e. how much control the high level model is willing to give to its parameters) can change across models, and is indicative of its capacity. Giving more influence to the parameters corresponds to higher model capacity and flexibility, as less structural assumptions are made by the high level model itself. However, the trade-off here is that the model can be quite difficult to train, as the parameter space is large (see learning theory for what the size of the parameter space says about generalization guarantees). For a model with less flexibility, stronger assumptions are made initially, and finding parameters that work well enough is a much easier task. So which model to choose then, a flexible one or one that’s easy to train? This is one of the core problems in machine learning, and more formally known as the bias variance trade-off. When you increase the capacity of a model you reduce the inherent structural *bias* present within the model, but you simultaneously increase the *variance* (or stochasticity of sorts) of the model. When you decrease the capacity of the model, you take on a less noisy view of the world where stronger assumptions are made, thus reducing the variance of the model. However, by making stronger assumptions you are inherently restricting yourself to fewer ways to think about the world, increasing the chance you are far off (i.e. the bias) from reality.]

When our prior is uniform, we are saying that the likelihood describes the posterior, or at least is directly proportional. In this case there is no prior information to use; any value of the parameters is equally likely.

$\text{posterior} \propto \text{likelihood} \times \text{prior}$

# Posterior Predictive Distribution

The posterior predictive distribution is the distribution of a new data point marginalized over the posterior:

$p(\tilde{x} | X) = \int{p(\tilde{x}|\theta,X) p(\theta|X)d\theta}$

When sampling a new data point from a distribution (the one from which the data $X$ came), the sample will depend on a particular value of the parameters $\theta$; $p(\tilde{x} | \theta)$ is the sample distribution. However, this distribution assumes a fixed parameter $\theta$ for the model, and thus to realize the distribution over $\tilde{x}$ we must choose a point estimate for $\theta$ to condition on. Choosing any single $\theta$, however, ignores our uncertainty about the value of $\theta$ under the data, which is captured by the posterior distribution. This is the goal of the posterior predictive distribution: to incorporate the uncertainty of the parameter under the data (captured by the posterior) into the distribution of a new data point sampled from the model. This distribution will generally be wider than a predictive distribution where only a point estimate of $\theta$ is used.

# Prior Predictive Distribution

The posterior predictive distribution is the distribution of a new data point marginalized over the prior:

$p(\tilde{x}) = \int{p(\tilde{x}|\theta) p(\theta)d\theta}$

Notice that this is exactly the same as the posterior predictive distribution, just without the conditional factor of $X$. The marginalization is still over $\theta$, but now we are not incorporating the observed data into the model. Note also that this looks very similar to the marginal likelihood above, just without the fixed observations $X$ and in its place a new sample. This is why the marginal likelihood of the data $X$ is also sometimes referred to as the prior predictive distribution of the data.

# Model Selection & Comparison

As alluded to above, although appearing independent of any specific model parameters, the marginal likelihood is typically *not* the same across models. The structural assumptions encoded in the likelihood and prior remain present in the estimate of the probability of having observed the data under the *model*. So the question is: what does it mean to compare marginal likelihoods across models? Who’s to say any model has reasonable ability to objectively evaluate the likelihood of data under itself?

## Model

It might be helpful to define a model. A **model** is a parametric family of probability distributions, any one of which could explain our data D.

# Maximum a posteriori Estimation

Maximum a posteriori estimation (MAP) is a method of **Bayesian inference** for estimating the parameters of a statistical model under some observed data. This method generalizes maximum likelihood estimation to incorporate prior information in the form of a **prior distribution** over the parameter in question. MLE is equivalent to the MAP estimate when a uniform prior is used; no prior assumptions about the value of the parameter are made.

MAP estimation differs from MLE in that we model the parameter in question as a random variable, taking on a value according to beliefs encoded in the prior distribution. The goal is to compute a posterior distribution $P(\theta|X)$ to find out how the observed data change our prior beliefs about the parameter in question. This posterior distribution incorporates the likelihood $P(X|\theta)$ and the prior $P(\theta)$ by means of Bayes rule:

$P(\theta | X) = \frac{P(X|\theta)P(\theta)}{P(X)} = \frac{P(X|\theta)P(\theta)}{\int_{\Theta}{P(X|\theta)P(\theta)d\theta}}$

MAP estimation then obtains a point estimate for $\theta$ by finding the mode of the posterior distribution $P(\theta|X)$:

$\hat{\theta} = {\arg\max}_{\theta \in \Theta} P(\theta | X) = {\arg\max}_{\theta \in \Theta}\frac{P(X|\theta)P(\theta)}{P(X)} = {\arg\max}_{\theta \in \Theta}P(X|\theta)P(\theta)$

where the denominator can be removed as it does not depend on $\theta$. The MAP procedure (for tractable cases) follows the same outline as described in MLE except the prior is included in steps one and two.