Mixture model
Mixture models are probabilistic latent variable models that incorporate the assumption that subpopulations exist amongst the overall population. As such, they model the data distribution as a mixture of multiple distributions (a mixture distribution), where each contributing distribution models a particular subpopulation. Mixture models are LVMs in that the presence of subpopulations is latent; there is no explicit information present in the data about how each point should be associated with any particular subpopulation. The goal is to make inferences about the properties of (hypothesized) subpopulations only from observations about the overall population, which in turn drives the model of the overall distribution.
Mixture Distribution
A mixture distribution is the probability distribution (of a random variable) formed from the weighted combination (“mixture”) of two or more component distributions. It is important to note that this is not the same as the combination/sum of two or more random variables; this is instead known as a convolution and is a different scenario entirely (i.e. the familiar setting of finding the density of a random variable Z = X+Y). To elaborate, when adding random variables, you can find the density by asking the question “how likely is it that Z=X+Y=z?” for some particular value z. One would have to consider multiple combinations of values that X and Y could take on to together sum to the value z, which often involves integrating over their joint density to sum up the likelihoods of all the satisfying scenarios. Contrast this with the case of the mixture density, where the likelihood we assign to the event Z=z is a weighted sum of each of the component likelihoods at z. Here the value that Z takes on has nothing to with the sum of multiple underlying variables; we are simply stating that its distribution can be described as the sum of a set of component distributions. There aren’t multiple possible cases to consider how any underlying variables sum together to get z as these variables do not exist. You can think about this in terms of modeling the distribution of news articles; each article has a latent topic that we assume it was sampled from. All of the latent topic distributions together can be combined via a weighted aggregation to form the entire distribution of news articles as a whole, but we never once treated a sampled news article as a “sum” of topics or any other variable. The variable we’re trying to model is not a sum of other variables, it just has a distribution that can be seen as the weighted sum of multiple individual distributions. This makes the difference crystal clear.
Formally, we have a set of density functions and weights such that . The mixture density of these components is then defined as
This is nothing more than the simple weight aggregation of each of the densities.
Sampling
To sample from a mixture distribution, one simply samples a particular component according to the initial (prior) weights, and then takes a sample from that component distribution. This is how the random variable that we are modeling is defined. Note that no sums of samples must be taken to arrive at the final sample; only one specific distribution is involved during the sampling of any particular point.
General Mixture Model Formulation
- observed random variables, each (assumed to be) distributed according a mixture of components belonging to the same parametric family
- latent random variables identifying the mixture component to which each observation belongs. Each latent variable is distributed according to a -dimensional categorical distribution
- mixture weights representing the prior probabilities of any particular component having been responsible for generating a data point. These values must sum to one.
- sets of parameters specifying the parameters belonging to each of the individual mixture components
Gaussian Mixture Model
A specific and commonly used mixture model is the Gaussian Mixture Model (GMM). The GMM is an LVM that assumes points in the dataset are sampled from one of Gaussian distributions. Whether or not this is perfectly accurate (e.g. data are sampled from more or less than K distributions, data are not Gaussian, etc), it is often practically useful as the model is approximately correct. Here the number of distributions is a latent variable under the model; it is an unobserved factor that can be concretely extracted from the data where each point is modeled as the pair
Here x is the observed point in Euclidean space and is the latent variable describing the implicit distribution from which is assumed to have been sampled.