The Dirichlet process provides a very interesting approach to understand group assignments and models for clustering effects. Here a complicated distribution is modelled as a mixture of simpler distributions. The use of countably infinite mixtures eludes the need to determine an “adequate” number of components.
Standard clustering algorithms like k-means or finite Gaussian mixture models assume a fixed number of components which has to be chosen a priori. This is problematic for at least two reasons: (a) it relies on the underlying assumption that there is a single “correct” number of components to partition the data into (b) it is extremely difficult to formulate a notion of the usefulness of a partition without strong assumptions about the nature of the problem.
Enter non-parametric Bayes.
In traditional Bayesian statistics model parameters are modelled as random variables. As the true value of the parameter is unknown, this uncertainty is expressed as randomness. For a parameter taking values in we make the modelling assumption that is distributed according to a specific distribution . The distribution is called the prior distribution of the model parameter . Bayesian models therefore consists of a family of models parametrized by the parameters (the observation models) and a prior distribution . Data is generated in two stages
The objective is then to determine the posterior distribution of given some data
Note that the improved model for the distribution of the parameter after observing the data still contains uncertainty and this uncertainty is expressed in the Bayesian framework by the posterior being a distribution rather than best guess based on the data.
Non-parametric Bayesian models are Bayesian models where the parameter space has infinite dimensions. Here the prior probability distribution is defined on an infinite-dimensional space . Such a distribution on an infinite-dimensional space is a stochastic process with paths in . These are typically a little harder to define than those on .
Clustering and mixture models
In clustering one makes the basic assumption that the data can be partitioned in a natural where each data point belongs to one cluster . This is often expressed by introducing a helper variable which indicates which cluster the th data point belongs to. As these variables are unobserved in the dataset they are referred to as latent variables. The distribution characterizing the cluster can then be described as
With the probability that a newly generated data point is generated by cluster given by the distribution of the random variable can be expressed in the form
Such models are called mixture models. If the index set is finite, the mixture is called a finite mixture, otherwise infinite mixture.
The clearly describe exclusive events therefore with . The set of all points satisfying these conditions is called the -dimensional simplex.
Under the assumption that all belong to a parametric family with a conditional probability density , has density
This can be written also as a discrete probability measure on in terms of atomic Dirac measures ,
with . is called the mixing measure. With this we can write the mixture as
If is a set of discrete probability measures on , then is model in the sense discussed above, a mixture model. If is such that no more than a finite number of coefficients is non-zero, it is a finite mixture model.
A Bayesian mixture model is a mixture model where the random mixing measure is distributed according to some prior . The prior is therefore the law of . In order to generate an infinite mixture model we have to produce a random and that means two random sequences and . By choosing a probability measure on we sample their elements i.i.d.
In order to generate the sequence we use the stick-breaking construction: We start with a sequence of probability distributions on . We generate by sampling . In order to ensure that , has to be drawn from a distribution over . In general, after observing must be drawn from a distribution over . We achieve this by sampling from and scaling with the width of the interval
The name stick-breaking construction stems from the fact that one may think of the interval as a stick where pieces of length are repeatedly broken off. The basic parameter distribution on is the so-called beta distribution . The homogenous random probability measure defined by choosing the beta distribution is called the dirichlet process
Definition: Dirichlet Process
For , a probability measure on and the beta distribution, the random discrete probability measure (as in (1)) generated by
is called a Dirichlet process with base measure and and concentration .