Multivariate Data Clustering for the Gaussian Mixture Model

This paper discusses a soft sample clustering problem for multivariate independent random data satisfying the mixture model of the Gaussian distribution. The theory recommends to estimate the parameters of model by the maximum likelihood method and to use " plug-in " approach for data clustering. Unfortunately, the calculation problem of the maximum likelihood estimate is not completely solved in multivariate case. This work proposes a new constructive a few stage procedure to solve this task. This procedure includes statistical distribution analysis of a large number of the univariate projections of observations, geometric clustering of a multivariate sample and application of EM algorithm. The results of the accuracy analysis of the proposed methods is made by means of Monte-Carlo simulation.


Introduction
This paper analyses a sample clustering problem in the case where distribution of the observed random vector satisfies a Gaussian mixture model.The clustering problem is closely connected with a mixture model identification problem.It is natural to use the MLE (maximum likelihood estimator) for calculation of parameters.Unfortunately, finding of the maximum likelihood estimate is complicated task in the multivariate case, because the dimension of parameter is large.Traditionally the EM (expectation maximisation) algorithm is used to accomplish this task.But this algorithm converges locally and precise enough initial value of parameter should be used to ensure convergence to the MLE, see (Boyles, 1983).There is no widely accepted good method for initializing the parameters, see (Vlassis and Likas, 2000), so the calculation of initial parameter value is the main problem in the clustering of the multivariate Gaussian mixture.Usually following initialization methods are used: random multistart method, k-means or some heuristic method and hierarchical clustering start.Some procedures combines EM algorithm with incremental adding components to the mixture.
In spite of a popularity of clustering methods, most clustering software are based on the datamining approach or heuristic procedures.Only a few probabilistic modelbased clustering packages designed for the Gaussian mixture model data are available: NORMIX, EMMIX and MCLUST.NORMIX uses a random start method, EMMIX uses a random start or k-means start from random centres and MCLUST package uses hierarchical model-based clustering for initialization of the EM algorithm, see (Wolfe, 1967), (McLachlan et al, 1999) and (Fraley and Raftery, 1999).
The authors of this paper propose the new probabilistic clustering algorithm for multivariate data satisfying a Gaussian mixture model.This constructive procedure is severalstage that embraces various ideas of data projection, geometric clustering and application of the EM algorithm.The large number of projection directions is used to avoid loss of data clustering structure.
Let us introduce some notation.Let O(1), ..., O(n) be the observed objects.Clustering is partitioning of these objects into several homogeneous groups, in a certain sense.In the case of geometric clustering sample data are matrices [ρ i,j ] i,j=1,...,n of distances (or pseudo-distances) between the observed objects O(i) and O(j).By denoting One of the most popular geometric clustering methods is minimization of the sum of mean distances inside the clusters: Here and later we denote the number of elements of set A by A .The probabilistic model is applied under the assumption that the observed objects belong to some aggregate that consists of q non-intersecting classes.If a d-dimensional features vector X corresponds to every object and ν denotes the number of a class to which the object belongs, then the vector X and number ν are random.When features X(k) of the observed objects O(k), k = 1, n are known and the class numbers ν(k) are unknown, we have a clustering problem.We'll restrict to the case, where objects are selected independently.So, we have a sample X = {X(1), ..., X(n)} consisting of an independent copies of the random column-vector X.In the case of hard clustering, we have to estimate the unknown values ν(k) ∈ {1, ..., q}, i.e., to obtain ν(k) = ν(k, X), k = 1, n.In case of soft clustering, the a posteriori probabilities π(j, X(k)), j = 1, q, k = 1, n are estimated.These probabilities are defined by the equality (2) Let Φ j denote a conditional distribution of X given ν = j.If Φ j is a Gaussian distribution with density ϕ j , j = 1, q, then the distribution density f of vector X satisfies the equality (3) Here p j are a priori probabilities of the observed object belonging to classes K j , i.e., p j = P{ν = j} and θ = (p j , M j , R j , j = 1, q) is the parameters vector of the multivariate model, where M j and R j denotes a mean and the covariance matrix corresponding to the normal density ϕ j .It follows from (3) and the Law of total probability that for every j = 1, ..., q, x ∈ R d In this case the "plug-in" approach could be applied in clustering: where θ is a statistical estimate of multivariate parameter θ.It is natural to use the maximum likelihood approach to estimate this parameter But if the data dimension d is high, the calculation of estimate ( 6) is a complicated task in practice.Usually a recurrent EM algorithm is used to solve this problem.This algorithm recalculates estimates π and θ using expression (5) and the formulas But the EM algorithm estimator converges to statistic (6) only if the initial value of estimate θ is close to θ ML .So, we have to get a precise enough value of initial θ or π(•) using some other method.This problem is partly solved in the univariate case.For example, in (Rudzkis and Radavicius, 1995) the recurrent components allocation procedure is proposed.This procedure combines the EM algorithm and non-parametric estimation of density f and it is effective enough.The usage of this procedure in a multivariate case is rather complicated, especially if the dimension of data is high.Dimension can be reduced by using the projection pursuit technique, see (Rudzkis and Radavicius, 1999) and (Rudzkis and Radavicius, 2003), however a part of information on the data clusters is lost after passing on to lower dimensional space in general case.If we do not restrict us to one projection, but will project a lot of times to various subspaces of lower dimension, the loss of information would be avoided.We are going to discuss a multivariate clustering method, based on the analysis of univariate data projections -a kind of tomography approach.

Employing the Inversion Formula
The probability function f can be obtained from the characteristic function using the inversion formula where f * (t) = Ee it X denotes the characteristic function of random vector X.
For each τ ∈ R d the scalar product τ X is a univariate random value.We denote the distribution density of this value by f τ .Let define f * and f * τ as the characteristic functions of f and f τ respectively.Since there is a one-to-one correspondence between densities and characteristic functions and the distribution density f is uniquely defined by set of densities {f τ , τ ∈ B} of univariate projections of the random vector X, where B denotes the unit sphere in space R d .Making use of the inversion formula (8) and replacing variables by a spherical coordinate system, we obtain an equality where B ds denotes the surface integral over τ ∈ B. >From the parametric estimation of univariate projected densities f τ , we also obtain the estimates of characteristic functions f * τ .Putting these estimates on the right side of expression ( 10) and replacing the surface integral by a sum we obtain the formula for calculating estimate f Here T is a finite set of points, uniformly distributed on the sphere ) , where Γ denotes the Gamma function.>From equality (3) it follows, that densities f τ are mixtures of the univariate Gaussian densities.Let ϕ j,τ denote the density of the univariate Gaussian distribution N (m j (τ ), σ 2 j (τ )), where where θ(τ ) = (p j , m j (τ ), σ 2 j (τ ), j = 1, ..., q).After estimating θ(τ ), τ ∈ T by some univariate Gaussian mixture analysis method (e.g., the method proposed in (Rudzkis and Radavicius, 1995)), from ( 12), and replacing the distribution functions by characteristic functions, we obtain estimates f * τ and it remains only to make use of ( 11).Unfortunately, the estimate of distribution density f (x) calculated this way is non-parametric and it can be used in data clustering only indirectly, together with other methods.We need to have estimates not only of the density f (x), but also of its components ϕ j for data clustering.The components can be also defined by expression (11) by replacing f and f * τ with ϕ j and ϕ * j,τ respectively, however, the numbering compatibility problem of components ϕ j,τ among various projection directions τ ∈ T can arise in practice.The mentioned procedure of consecutive component isolation (see (Rudzkis and Radavicius, 1995)) applied in the univariate projection τ X analysis assigns the first number to the component whose mean is closest to the mode of f τ .In the general case, while projecting in various directions, Gaussian distributions corresponding to different clusters can have this property.Just like in the case of using the EM algorithm for clustering multivariate data, the problem disappears if we succeed in obtaining precise enough data clustering which allows a compatible grouping of data projections τ X(t), t = 1, n.Later these clusters of univariate data can be improved using the EM algorithm and the inversion formula is applied in the above described way.So, the core problem is obtaining of the initial estimate π(•).The authors suggests using the geometric clustering approach for this task.The pseudo-distances matrices are based on the clustering results of univariate projections.

Geometric Clustering Procedure
Let us describe the proposed hard data clustering procedure.Let ρ(x, y) be a non-negative pseudo-distance function defined for all x, y ∈ R d .In the general case, this function does not satisfy the triangular inequality.Let us divide the set of observation numbers N into non-intersecting subsets K 1 , ..., K q so as to minimize value of the functional The minimization algorithms are widely analysed and we will not detail in this paper.
The main task is selection of the distance function ρ.Let us consider the observations x, y ∈ X to be closer to one another, the more similar are the estimates of a posteriori classification probabilities of their projections.So ρ(x, y) = max τ ∈T ρ τ (x, y), where and π τ (j, x) denotes the estimate of probability Thus, preliminary hard clustering of a sample consists of the following stages • selection of a projection directions set T ⊂ B; • calculation of the maximum likelihood estimates or their approximation for each τ ∈ T ; • estimation of a posteriori probabilities by means of the "plug-in" technique • calculation of a set of clusters Next, analogously as having the training samples, it is possible to calculate the initial multivariate estimate θ of parameter θ defined by the equalities ) where j = 1, q. Substituting X(s) for their projections τ X(s) in ( 16), we define the preliminary compatible estimates θ(τ ), τ ∈ T .
Using the "plug-in" approach (5), from θ we obtain a preliminary estimate of soft clustering π.This estimate can be improved by a recurrent EM algorithm until a new estimate becomes stable.Let us denote this new estimate by π EM .
Applying the inversion formula in clustering, we first must fix the estimates p j and improve other elements of θ(τ ) by the EM algorithm.We calculate the statistics ϕ j placing f * τ by the estimates of characteristic functions corresponding to the obtained estimates of θ(τ ) in formula (11).These statistics define the estimate of π by equalities , j = 1, q. (17)

Simulation Results
It is not difficult to make analysis of precision of the proposed estimates by means of simulation.We assume the mean error of soft classification as a precision measure of estimation of a posteriori probabilities.Having selected the parameter of the model θ and sample size n, we generate independent random samples X (s) = (X (s) (1), ..., X (s) (n)), s = 1, . . ., r, and obtain realizations π (1) , ..., π (r) of the estimates analysed.The empirical analogue of functional ( 18) is the statistics where the sum is taken over j = 1, q, t = 1, n, s = 1, r.We used r = 10 in all the experiments of this study.
In practice we can meet not only the clustering problem, but also the problem of estimation of a Gaussian distribution mixture parameters or distribution density.Using (7), from the classification probabilities π we can obtain the estimate of the mixture parameter θ.The error will be regarded as the precision measure for this parameter as well as for the parametric density estimate and its with empirical analogue would be which is not so difficult to calculate, because the integral can be expressed analytically.
The main purpose of this research is to compare the precision of estimates π, π EM and π ML .The latter estimate is obtained by applying the "plug-in" approach to the maximum In some cases, at small sample values, not only the above mentioned estimates, but also the value of statistic (17) were calculated.The maximum likelihood estimator is asymptotically efficient, but for a small sample size its errors can be on the average greater than those of other estimators including π I .
In the study of precision the maximum likelihood estimates of the parameters of a univariate Gaussian mixtures were used.The EM algorithm with the initial value equal to the true value of θ(τ ) was used to obtain estimates θ(τ ).Unfortunately, it is impossible to do in a real application, that's why we also used the software1 for estimating θ(τ ) based on the mentioned technique described in (Rudzkis and Radavicius, 1995).The latter cases are denoted by "*".Table 1 presents errors of the methods based on the maximum likelihood estimates pseudoestimates and the real estimates.
In the simulation process, it has been noticed that clustering probability estimates obtained by geometric clustering procedure are more stable if the maximum in formula ( 14) is calculated rejecting a part of directions τ , with the highest value of sum n t,k=1 ρ τ (X(t), X(k)).The number of rejected directions can be selected using maximum likelihood criterion.This modification of the method is natural, because without it even one direction with very inaccurate estimates of classification probabilities π τ (•, •) can determine the value of distance function ρ.
The proposed methods were analysed by a variety of mixtures having different properties: non-overlapping, partly overlapping or highly overlapping clusters; similar or significantly different cluster probabilities; small or large sample size; various structure of covariance matrices.Let us take one typical case of mixture with partly overlapping clusters and analyse the dependence of the method precision on the sample size n.The results of this research are given in the table.The parameters of the Gaussian mixture used are d = 5 and q = 4, i.e., dim θ = 83.With such a dimension of θ, we consider the sample where n = 100 to be very small, n = 200 -small, and n = 500 of medium size.The results for an extremely small size n = 50 will also be presented for comparison.The Monte-Carlo simulation study has shown that it is enough to use 200-500 univariate projections for geometric clustering of 5-dimensional data.Therefore we used a set of uniformly distributed directions T containing 500 directions in further research.
The comparison error plot of the maximum likelihood method and the proposed geometric clustering method π * is presented in Fig. 1.The plot shows, that geometric clustering method has larger δ π error than maximum likelihood estimator, but it is more precise by means of δ θ criterion.
The conclusions follow from the given error table and error plot: • the maximum likelihood method is the best method to estimate classification probabilities; • in the case of a small and medium sample, the proposed geometric clustering method, improved by the EM algorithm yields the same errors as the maximum likelihood estimator (pseudo-estimator, to be precise, because it is calculated using the true values of parameter θ).This conclusion is also valid for the estimator π EM which has been calculated relying on the assumption that the MLE could be found for univariate data, as well as for the estimator π * EM which is the sample function.Therefore the authors of the paper suggest using exactly this statistic combining the univariate data clustering method (Rudzkis and Radavicius, 1995), geometric clustering and the EM algorithm for the multivariate data; • the clustering method for univariate data described in (Rudzkis and Radavicius, 1995) is precise enough and it can be used to calculate pseudo-distances in the geometric clustering procedure because the errors of estimators π and π * differ but slightly; • though the maximum likelihood estimator π ML is the best one to estimate classification probabilities, it is not best to estimate the distribution density.Statistic π and π * often yields lower errors δ θ than π ML .The simulation results have also indicated that the geometric clustering algorithm works properly even without the EM algorithm improvement when there exist direction in which the clusters of projected data are well separated.
The proposed methods were compared with the maximum likelihood estimator and with a popular in a statistical software the random start EM procedure also.Random start procedure uses the random partitioning of observations for an initial parameter of the EM algorithm.For large data samples the subsampling of data were used as suggested in (McLachlan et al, 1999).This is to limit the effect of the centre limit theorem witch would have the randomly selected starts being similar for each component in large samples.In many cases the random start method should be repeated 20-200 times to calculate the MLE of parameter.But if a prior probabilities of the clusters are significantly different, the random start procedure does not find a cluster with a small a prior probability value even after a large number of start points (our analysis is based on 4000 start points).The simulation results are displayed in Fig. 2. If sample size is small and the probabilities of the clusters are significantly different the proposed EM algorithm initialization method is better than the random start method.Using the proposed procedure improved by EM algorithm the MLE estimate is calculated. 3The parameters of the mixtures are given in appendix.
We will briefly review the results analysis based on the Inversion formula method.After the first experiments it has already been observed that the distribution density estimate defined by formula ( 11) is not smooth and additional smoothing is needed.The authors of the paper propose to introduce an additional multiplier e −hu under the integral sign, where h is a small value, a so-called smoothing bandwidth.Such an expression of the smoothing multiplier is especially convenient in the case of Gaussian distribution mixtures, because we can still find the exact value of the integral (11).In practice, introduction of such a multiplier simply increases the estimates of cluster variance by the value 2h.The simulation analysis indicated that it is optimal to select the smoothing bandwidth value dependent on scale parameter of the data.The reasonable choice is c • λ max , where λ max is the maximum eigenvalue of the data covariance matrix.The parameter c can be selected using the maximum likelihood criterion.The Monte-Carlo research has also indicated that the same smoothing bandwidth value cannot be used for estimating density values of separate components of the mixture.It should be chosen adaptively, depending on the properties of the component.The smoothing bandwidth selection problem requires more comprehensive study.

Fig. 1 .
Fig. 1.Error plot of the maximum likelihood and geometric clustering procedures.

Table 1
Error dependence on the sample size 2 .When simulating, we calculated this estimate by means of the EM algorithm with the true value of θ as an initial value.