Non-parametric Clustering and Dirichlet Processes

If you don’t want to hear about all this and go straight to an example, check out this Javascript implementation of a Dirichlet process Gaussian mixture model used to cluster the McDonald’s menu based on the nutritional data: live demo.

The Dirichlet process provides a very interesting approach to understand group assignments and models for clustering effects. Here a complicated distribution is modelled as a mixture of simpler distributions. The use of countably infinite mixtures eludes the need to determine an “adequate” number of components.

Standard clustering algorithms like k-means or finite Gaussian mixture models assume a fixed number of components which has to be chosen a priori. This is problematic for at least two reasons: (a) it relies on the underlying assumption that there is a single “correct” number of components to partition the data into (b) it is extremely difficult to formulate a notion of the usefulness of a partition without strong assumptions about the nature of the problem.

Enter non-parametric Bayes.

Non-parametric Bayes

In traditional Bayesian statistics model parameters are modelled as random variables. As the true value of the parameter is unknown, this uncertainty is expressed as randomness. For a parameter \theta taking values in T we make the modelling assumption that \theta is distributed according to a specific distribution Q. The distribution Q is called the prior distribution of the model parameter \theta. Bayesian models therefore consists of a family of models parametrized by the parameters \theta (the observation models) and a prior distribution Q. Data is generated in two stages

(1)   \begin{align*} \theta &\sim Q\\ x|\theta &\sim P_\theta \end{align*}

The objective is then to determine the posterior distribution of \theta given some data \mathcal{D}=\{x_1,\dots,x_n\}

(2)   \begin{equation*} Q(\theta|\mathcal{D}) \end{equation*}

Note that the improved model for the distribution of the parameter \theta after observing the data \mathcal{D} still contains uncertainty and this uncertainty is expressed in the Bayesian framework by the posterior being a distribution rather than best guess based on the data.

Non-parametric Bayesian models are Bayesian models where the parameter space T has infinite dimensions. Here the prior probability distribution Q is defined on an infinite-dimensional space T. Such a distribution on an infinite-dimensional space is a stochastic process with paths in T. These are typically a little harder to define than those on \mathbb{R}^d.

Clustering and mixture models

In clustering one makes the basic assumption that the data can be partitioned in a natural where each data point x_i belongs to one cluster k. This is often expressed by introducing a helper variable z_i = k which indicates which cluster k the ith data point belongs to. As these variables are unobserved in the dataset they are referred to as latent variables. The distribution characterizing the cluster k can then be described as

(3)   \begin{equation*} x_i \sim P_k := P(z_i=k) \end{equation*}

With the probability that a newly generated data point is generated by cluster k given by c_k the distribution of the random variable X can be expressed in the form

(4)   \begin{equation*} P = \sum_{k \in I} c_k P_k \end{equation*}

Such models are called mixture models. If the index set I is finite, the mixture is called a finite mixture, otherwise infinite mixture.

The c_k clearly describe exclusive events therefore \sum_k c_k = 1 with 0 \le c_k \le 1. The set of all points satisfying these conditions is called the |I|-1-dimensional simplex.

(5)   \begin{equation*} \Delta := \left\{ (c_k)_{k \in I} \Big| c_k \ge 0 \text{ and } \sum_k c_k = 1 \right\} \end{equation*}

Under the assumption that all P_k belong to a parametric family \{P_\theta| \theta\ \in \Omega_\theta \} with a conditional probability density p_k(x) = p(x|\theta_k), P has density

(6)   \begin{equation*} p(x) = \sum_{k \in I} c_k p(x| \theta_k) \end{equation*}

This can be written also as a discrete probability measure on \Omega_\theta in terms of atomic Dirac measures \delta_{\theta_k},

(1)   \begin{equation*} \theta \sim \Theta := \sum_{k \in I} c_k \delta_{\theta_k}  \end{equation*}

with (c_k)_{k \in I} \in \Delta. \theta is called the mixing measure. With this we can write the mixture as

(7)   \begin{equation*} p(x) = \int p(x|\theta)\Theta(d\theta) \end{equation*}

If T is a set of discrete probability measures on \Omega_\theta, then M = \{ P_\theta| \theta \in T\} is model in the sense discussed above, a mixture model. If T is such that no more than a finite number of coefficients K<\infty is non-zero, it is a finite mixture model.

Bayesian mixtures

A Bayesian mixture model is a mixture model where the random mixing measure \Theta is distributed according to some prior Q. The prior Q is therefore the law of \Theta. In order to generate an infinite mixture model we have to produce a random \Theta and that means two random sequences (c_k) and (\theta_k). By choosing a probability measure G on \Omega_\theta we sample their elements i.i.d.

(8)   \begin{equation*} \theta_1, \theta_2, \dots \sim_{\text{iid}} G \end{equation*}

In order to generate the sequence (c_k) we use the stick-breaking construction: We start with a sequence of probability distributions (H_k) on [0,1]. We generate c_1 by sampling H_1. In order to ensure that \sum_k c_k = 1, c_2 has to be drawn from a distribution over [0,1-c_1]. In general, after observing \{c_1,\dots,c_{k-1}\} c_k must be drawn from a distribution over [0,1-c_1\dots -c_{k-1}]. We achieve this by sampling c_k from H_k and scaling with the width of the interval I_k=[0,1-c_1\dots -c_{k-1}]

(9)   \begin{align*} \tilde{c}_k &\sim H_k \\ c_k &= \tilde{c}_k |I_k| \end{align*}

The name stick-breaking construction stems from the fact that one may think of the interval [0,1] as a stick where pieces of length c_k are repeatedly broken off. The basic parameter distribution on [0,1] is the so-called beta distribution \beta. The homogenous random probability measure defined by choosing H_1 = H_2 = \dots the beta distribution is called the dirichlet process

Definition: Dirichlet Process
For \alpha > 0, G a probability measure on \Omega_\theta and \beta the beta distribution, the random discrete probability measure \Theta (as in (1)) generated by

(10)   \begin{align*} &\tilde{c}_1, \tilde{c}_2, \dots \sim_{iid} \beta(1,\alpha) \text{ and }c_k = \tilde{c}_k \prod_{j=1}^{k-1} (1-\tilde{c}_j) \\ &\theta_1, \theta_2, \dots \sim_{iid} G \end{align*}

is called a Dirichlet process DP(\alpha,G) with base measure G and and concentration \alpha.