*If you don’t want to hear about all this and go straight to an example, check out this Javascript implementation of a Dirichlet process Gaussian mixture model used to cluster the McDonald’s menu based on the nutritional data: live demo.*

The Dirichlet process provides a very interesting approach to understand group assignments and models for clustering effects. Here a complicated distribution is modelled as a mixture of simpler distributions. The use of countably infinite mixtures eludes the need to determine an “adequate” number of components.

Standard clustering algorithms like **k-means** or **finite Gaussian mixture models** assume a *fixed* number of components which has to be chosen a priori. This is problematic for at least two reasons: (a) it relies on the underlying assumption that there is a single “correct” number of components to partition the data into (b) it is extremely difficult to formulate a notion of the usefulness of a partition without strong assumptions about the nature of the problem.

*Enter non-parametric Bayes.*

### Non-parametric Bayes

In traditional Bayesian statistics model parameters are modelled as random variables. As the true value of the parameter is unknown, this uncertainty is expressed as randomness. For a parameter taking values in we make the modelling assumption that is distributed according to a specific distribution . The distribution is called the **prior distribution** of the model parameter . Bayesian models therefore consists of a family of models parametrized by the parameters (the **observation models**) and a prior distribution . Data is generated in two stages

(1)

The objective is then to determine the **posterior distribution** of given some data

(2)

Note that the *improved* model for the distribution of the parameter *after* observing the data still contains uncertainty and this uncertainty is expressed in the Bayesian framework by the posterior being a distribution rather than best guess based on the data.

Non-parametric Bayesian models are Bayesian models where the parameter space has infinite dimensions. Here the prior probability distribution is defined on an infinite-dimensional space . Such a distribution on an infinite-dimensional space is a stochastic process with paths in . These are typically a little harder to define than those on .

### Clustering and mixture models

In clustering one makes the basic assumption that the data can be partitioned in a natural where each data point belongs to one cluster . This is often expressed by introducing a helper variable which indicates which cluster the th data point belongs to. As these variables are *unobserved* in the dataset they are referred to as **latent variables**. The distribution characterizing the cluster can then be described as

(3)

With the probability that a newly generated data point is generated by cluster given by the distribution of the random variable can be expressed in the form

(4)

Such models are called **mixture models**. If the index set is finite, the mixture is called a **finite mixture**, otherwise **infinite mixture**.

The clearly describe exclusive events therefore with . The set of all points satisfying these conditions is called the -dimensional **simplex**.

(5)

Under the assumption that all belong to a parametric family with a conditional probability density , has density

(6)

This can be written also as a discrete probability measure on in terms of **atomic Dirac measures** ,

(1)

with . is called the **mixing measure**. With this we can write the mixture as

(7)

If is a set of discrete probability measures on , then is model in the sense discussed above, a **mixture model**. If is such that no more than a finite number of coefficients is non-zero, it is a **finite mixture model**.

### Bayesian mixtures

A **Bayesian mixture model** is a mixture model where the random mixing measure is distributed according to some prior . The prior is therefore the law of . In order to generate an infinite mixture model we have to produce a random and that means two random sequences and . By choosing a probability measure on we sample their elements i.i.d.

(8)

In order to generate the sequence we use the **stick-breaking construction**: We start with a sequence of probability distributions on . We generate by sampling . In order to ensure that , has to be drawn from a distribution over . In general, after observing must be drawn from a distribution over . We achieve this by sampling from and scaling with the width of the interval

(9)

The name stick-breaking construction stems from the fact that one may think of the interval as a stick where pieces of length are repeatedly broken off. The basic parameter distribution on is the so-called **beta distribution** . The homogenous random probability measure defined by choosing the beta distribution is called the **dirichlet process**

**Definition: Dirichlet Process**

*For , a probability measure on and the beta distribution, the random discrete probability measure (as in (1)) generated by*

(10)

*is called a Dirichlet process with base measure and and concentration .*