Clustering high-dimensional data is a challenging task in data mining, and clustering high-dimensional categorical data is even more challenging because it is more difficult to measure the similarity between categorical objects. Most algorithms assume feature independence when computing similarity between data objects, or make use of computationally demanding techniques such as PCA for numerical data. Hierarchical clustering algorithms are often based on similarity measures computed on a common feature space, which is not effective when clustering high-dimensional data. Subspace clustering algorithms discover feature subspaces for clusters, but are mostly partition-based; i.e. they do not produce a hierarchical structure of clusters. In this paper, we propose a hierarchical algorithm for clustering high-dimensional categorical data, based on a recently proposed information-theoretical concept named holo-entropy. The algorithm proposes new ways of exploring entropy, holo-entropy and attribute weighting in order to determine the feature subspace of a cluster and to merge clusters even though their feature subspaces differ. The algorithm is tested on UCI datasets, and compared with several state-of-the-art algorithms. Experimental results show that the proposed algorithm yields higher efficiency and accuracy than the competing algorithms and allows higher reproducibility.

The aim of clustering analysis is to group objects so that those within a cluster are much more similar than those in different clusters. Clustering has been studied extensively in the statistics, data-mining and database communities, and numerous algorithms have been proposed (Schwenkera and Trentin,

Hierarchical techniques are often used in cluster analysis. They aim to establish a hierarchy of partition structure, using either bottom-up or top-down approaches. In the bottom-up approach, at initialization, each object is represented by a single cluster. Clusters are then successively merged based on their similarities until all the objects are grouped into a single cluster, or until some special stopping conditions are satisfied. In the top-down approach, on the other hand, larger clusters are successively divided to generate smaller and more compact clusters until some stopping conditions are satisfied or each cluster contains a single object. Traditionally, in the bottom-up approach, the pairwise similarity (or distance) employed for merging clusters is often calculated on a common feature space. Feature relevance or feature selection is addressed prior to the clustering process. In this research, we will investigate feature relevance with respect to individual clusters and cluster merging where each cluster may have its own relevant features. This is a very important issue in hierarchical clustering of high-dimensional data. In this paper the terms ‘feature’, ‘attribute’ and sometimes ‘dimension’ are used interchangeably.

Automatically determining the relevancy of attributes in a categorical cluster will be investigated in this paper. Conventional similarity measures, defined on the whole feature space with the assumption that features are of equal importance, are not suitable for clustering high-dimensional data in many cases. In real-world applications, different clusters may lie in different feature subspaces with different dimensions. This means the significance or relevance of an attribute is not the same to different clusters. A cluster might be related to only a few dimensions (most relevant dimensions) while the other dimensions (unimportant dimensions) contain random values. Attribute weighting is employed to deal with these issues, but most weighting methods have been designed solely for numeric data clustering (Huang

In this paper, we propose a novel algorithm named Hierarchical Projected Clustering for Categorical Data (HPCCD). The HPCCD algorithm has been designed to deal with three main issues. The first is analysing attribute relevance based on the holo-entropy theory (Wu and Wang,

The rest of the paper is organized as follows: Section

In this section, we briefly review major existing works related to our research. Many clustering algorithms for categorical data are based on the partition approach. The most popular of these is Huangs K-modes algorithm (Huang,

Also proposed in addition to all the K-modes algorithms is k-ANMI (He

Projected clustering is a major technique for high-dimensional data clustering, whose aim is to discover the clusters and their relevant attributes simultaneously. A projected cluster is defined by both its data points and the relevant attributes (Bouguessa and Wang,

Many clustering algorithms based on the hierarchical technique have been proposed. ROCK (Guha

Information theory is also frequently used in many clustering models (Yao

Finally, the MGR algorithm (Qin

In this section, we provide a brief description of information entropy, and give a more detailed explanation of the concepts of mutual information, total correlation and holo-entropy (Wu and Wang,

Entropy is a measure of the uncertainty of a system state. As formulated in information theory (Shannon,

Many high-dimensional data clustering approaches are based on the attribute independence hypothesis (Barbar

In this paper, we develop a new method that utilizes the holo-entropy for hierarchical clustering. Based on the analysis of the intrinsic relevance of features and objects in subclusters, we develop a principled approach for selecting clusters to merge and determining the feature subspace of the merged cluster. The holo-entropy is used to measure the intrinsic relevance. The main idea is that the holo-entropy of two subclusters originating from the same class should be much smaller than if the two subclusters originate from different classes. Therefore, the two subclusters with minimal holo-entropy are selected to merge in the hierarchical clustering. In order to describe our algorithm, the following notation and definitions are introduced.

We use

The mutual information (He

The conditional mutual information (Watanabe,

According to Watanabe’s proof (Watanabe,

Based on Wu and Wang (

The example of Dataset1.

No. object | Label | |||

1 | 0 | 1 | 0 | |

2 | 0 | 2 | 2 | |

3 | 0 | 2 | 1 | |

4 | 0 | 2 | 1 | |

5 | 2 | 3 | 2 | |

6 | 2 | 3 | 3 | |

7 | 1 | 3 | 0 | |

8 | 0 | 3 | 3 |

Moreover,

From this example, we can see that the holo-entropy of two merged subclusters from the same (homogeneous) class is much smaller than that of merged subclusters from different (heterogeneous) classes. This indicates that holo-entropy is an effective measurement for the compactness of a subspace in a cluster. In what follows, we will use the holo-entropy for subspace detection. Moreover, we will employ the soft clustering method (Domeniconi

In this section, we present our Hierarchical Projected Clustering algorithm for Categorical Data (HPCCD) in detail. To illustrate the ideas underlying the algorithm, we also provide a working example with 11 objects from the dataset Soybean, as shown in Table

desired number of clusters;

Initialization (Grouping the data into subclusters

A: relevant subspace selection

A1: Detect the relevant subspace of each subcluster of

A2: Assign weights to the attributes in the relevant subspace;

B: compactness calculation (calculate compactness binding weight with

holo-entropy);

Choose the most compact pair of subclusters to merge and update set

until satisfaction of the termination condition of

The details of each step will be described in the following sub-sections.

Cluster initialization in our approach is a necessary step as it makes it meaningful to use the information-theoretic method to estimate attribute relevance and also reduces the number of cluster merging steps. For the initialization of our agglomerative clustering algorithm, the dataset is first divided into small subclusters. In order to ensure that objects in the same subcluster are as similar as possible, we use the following categorical data similarity measurement for initialization. The similarity between any two objects

Set

Use Eq. (

the clusters

Allocate

Allocate

In this algorithm, Eq. (

11 samples from the Soybean dataset.

No. object | Clusters | Class label | |||||

1 | 0 | 1 | 1 | 0 | 1 | 1 | |

2 | 0 | 1 | 2 | 1 | 1 | 1 | |

3 | 0 | 1 | 1 | 1 | 1 | 1 | |

4 | 0 | 1 | 2 | 0 | 1 | 1 | |

5 | 0 | 1 | 0 | 2 | 2 | 1 | |

6 | 0 | 1 | 1 | 0 | 2 | 1 | |

7 | 0 | 1 | 0 | 2 | 2 | 1 | |

8 | 1 | 0 | 1 | 4 | 0 | 2 | |

9 | 0 | 0 | 1 | 4 | 0 | 2 | |

10 | 0 | 0 | 1 | 4 | 0 | 2 | |

11 | 1 | 1 | 1 | 4 | 0 | 2 |

The similarity measure of Eq. (

In this subsection, we address the problem of optimally determining the subspace (Domeniconi

Subspace clustering (Gan and Wu,

In order to determine the relevant attributes for each cluster, we need to find an optimal division that separates characteristic attributes from noise ones. Although there has been a great deal of work reported on subspace clustering, attribute relevance analysis in the existing approaches is often based on analysing the variance of each individual attribute while assuming that attributes are mutually independent (Barbar

Let the whole feature space of a cluster (or cluster candidate)

In both cases, the smaller the value of

Obviously, this measure is designed to strike a balance between the sizes of the relevant and non-relevant subspaces.

The optimization (minimization, in fact) of Eq. (

For the Soybean dataset, after initialization, the dataset is divided into three subclusters

The normalized entropy of attributes for each subcluster.

Clusters | |||||

0 | 0 | 1 | 1 | 0 | |

0 | 0 | 0.6365 | 1 | 0 | |

1 | 0.8113 | 0 | 0 | 0 |

The optimization in Eq. (

Calculate attribute information entropy

Use Eq. (

Use Eq. (

To select the subclusters to merge, we introduce a new measure, named intrinsic compactness, to measure the compactness of a cluster. The concept of intrinsic compactness here differs from conventional variance-based compactness measures in that it incorporates the feature relevance in the relevant subspace and the size of the cluster. For the calculation of intrinsic compactness, we assign a weight to each relevant attribute.

The relevant subspace vectors of each subcluster.

Clusters | |||||

y | y | n | n | y | |

y | y | n | n | y | |

n | n | y | y | y |

Attribute weighting, widely used in soft subspace clustering (Domeniconi

In order to take the size of subcluster into account, we propose a simple approach for attributing a weighting associated with the relevant subspace. The proposed weighting approach calculates the weights from the entropy of attributes in the relevant subspace and the number of objects in the corresponding subcluster. This method is motivated by effectiveness in practical applications rather than by theoretical needs, as the attribute weight for a merged subcluster is closely related to the entropy of the attribute and the size of the two subclusters. Attribute weighting is proportional to the size of the subcluster and inversely proportional to the entropy. In other words, the formula can be written as

With the above formulas, we consider both the attribute entropy and the size of the subclusters in computing the weight of an attribute in a merged cluster. The initial importance of each attribute is modulated by the size of the subcluster. An important attribute coming from a large subcluster is assigned a larger weight. On the other hand, an important attribute coming from a small subcluster is assigned a proportionally smaller weight. The contribution of the selected relevant subspace of each subcluster to the merged cluster is thus better balanced.

Continuing with the example given in Section

Relevant subspace vectors of merged subclusters.

Clusters | |||||

y | y | n | n | y | |

y | y | y | y | y | |

y | y | y | y | y |

Relevant subspace weights for each merged subcluster.

Clusters | |||||

0.3333 | 0.3333 | * | * | 0.3333 | |

0.1667 | 0.1667 | 0.1667 | 0.1667 | 0.3333 | |

0.1429 | 0.1429 | 0.1905 | 0.1905 | 0.3333 |

The main phase of weight calculation is described as follows:

Calculate the size of

Calculate the size of

Find the union of

Use Eqs. (

Use Eq. (

As subcluster merging criterion, most existing hierarchical algorithms (Guha

Information entropy of attributes for each merged subcluster.

Clusters | |||||

0 | 0 | * | * | 0.6829 | |

0.5623 | 0.6616 | 0.5623 | 1.0397 | 0.6931 | |

0.5938 | 0.6829 | 0.5938 | 0.9557 | 0.6829 |

Compactness of merged subclusters.

Clusters | |||

– | 0.2276 | 0.7020 | |

– | – | 0.7067 | |

– | – | – |

Given the symmetry of the intrinsic compactness measure, the compactness matrix is an upper triangular matrix (see Table

Use Eq. (

The last step of our algorithm is merging clusters. First the minimal value in the inter-cluster compactness matrix is found, and the row and column of that minimal value then provide the labels of the corresponding subclusters. This pair of subclusters likely comes from a homogeneous class, and is selected for merging. The entropies, subspaces and weights and the inter-cluster compactness matrix are then updated in the next iteration. The main phase of cluster merging is described as follows:

Find minimal compactness

merge subclusters

Update

In this section, we report experimental results of HPCCD on nine datasets from UCI Machine Learning Repository and comparison with four state-of-the-art algorithms. We will first describe the experimental design (Section

Besides the HPCCD algorithm, we tested four state-of-the-art algorithms, MGR (Qin

Nine real-life datasets obtained from the UCI Machine Learning Repository (UCI Machine Learning Repository,

UCI dataset description.

Dataset name | Number of objects | Number of attributes | Number of classes |

Soybean | 47 | 35 | 4 |

Votes | 435 | 16 | 2 |

Mushroom | 8124 | 22 | 2 |

Nursery | 12960 | 8 | 4 |

Zoo | 101 | 16 | 7 |

Chess | 3196 | 36 | 2 |

Hayes-roth | 132 | 4 | 3 |

Balance scale | 625 | 4 | 3 |

Car evaluation | 1728 | 6 | 4 |

Our implementation runs on a Desktop Microcomputer with 2.4 GHz and 4 G memory. To eliminate the effect of random factors, we ran each algorithm 10 times (with random initialization) on every dataset, and all results shown are averages.

We use two criteria, clustering accuracy and

For the sake of comparing clustering results against external criteria, we introduce another clustering criterion, the Adjusted Rand Index (

The closer the clustering result is to the real classes, the larger the value of the corresponding

In this subsection, we will report and analyse the clustering results of HPCCD on the various datasets mentioned above. Tables

The Zoo dataset contains 101 objects and comprises the classes Mammal, Bird, Fish, Invertebrate, Insect, Amphibian and Reptile. HPCCD obtains four clusters that correspond perfectly to the original classes Mammal, Invertebrate, Insect and Reptile and three other clusters that are quite pure in terms of the majority class. The distribution of the data points in each cluster is given in Table

Results of HPCCD on the Zoo dataset.

Cluster | Instances | Classes | Accuracy | ARI | ||||||

Mam | Bir | Fih | Inv | Ins | Amp | Rep | ||||

1 | 41 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | ||

2 | 4 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | ||

3 | 10 | 0 | 0 | 0 | 0 | 0 | 8 | 2 | ||

4 | 21 | 0 | 20 | 1 | 0 | 0 | 0 | 0 | 0.960 | 0.963 |

5 | 13 | 0 | 0 | 0 | 13 | 0 | 0 | 0 | ||

6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | ||

7 | 5 | 0 | 0 | 4 | 0 | 0 | 0 | 1 |

Results of HPCCD on the Soybean dataset.

Cluster | Instances | Classes | Accuracy | ARI | |||

1 | 10 | 10 | 0 | 0 | 0 | ||

2 | 10 | 0 | 0 | 10 | 0 | 1.0 | 1.0 |

3 | 17 | 0 | 0 | 0 | 17 | ||

4 | 10 | 0 | 10 | 0 | 0 |

Clustering accuracy of five algorithms on nine datasets.

Algorithm | Zoo | Vote | Soybean | Mushroom | Chess | Nursery | Car- | Hayes- | Balance- |

HPCCD | 0.8641 | 0.652 | |||||||

MGR | 0.931 | 0.828 | * | 0.667 | 0.534 | 0.53 | 0.485 | 0.635 | |

K-modes | 0.6930 | 0.8344 | 0.8510 | 0.5179 | 0.5475 | 0.4287 | 0.5410 | 0.4621 | 0.6336 |

COOLCAT | 0.8900 | 0.7816 | 0.9362 | 0.5220 | 0.5228 | 0.3775 | 0.4394 | ||

LIMBO | 0.9109 | 0.8230 | 0.5222 | 0.4974 | 0.4621 | 0.6096 |

Clustering ARI of five algorithms on nine datasets.

Algorithm | Zoo | Vote | Soybean | Mushroom | Chess | Nursery | Car- | Hayes- | Balance- |

HPCCD | 0.5302 | 0.0428 | 0.0923 | ||||||

MGR | 0.9617 | 0.4279 | * | 0.1254 | 0.0036 | 0.1680 | 0.0129 | 0.0392 | 0.1011 |

K-modes | 0.4782 | 0.4463 | 0.6586 | 0.0533 | 0.0082 | 0.0565 | 0.0252 | 0.0915 | |

COOLCAT | 0.8586 | 0.3154 | 0.8214 | 0.0018 | 0.0018 | 0.0083 | 0.0500 | .00261 | |

LIMBO | 0.8318 | 0.4159 | 0.0082 | 0.0793 | 0.0285 | 0.0381 | 0.0684 |

For the Soybean dataset, the accuracy of our algorithm achieves 1, and the

Tables

The good performance of HPCCD shown in the previous session is due largely to effective selection of the relevant subspace for each cluster. In this section, we provide some preliminary analysis of the benefit that these relevant subspaces bring to characterization of the clusters. For this purpose, we build global feature subspaces from the individual feature subspaces and perform clustering using a different clustering algorithm to test whether the feature subspaces from the results of HPCCD improve the results of other clustering algorithms. Without loss of generality, we will report clustering results obtained by the K-modes algorithm.

For simplicity, we consider the intersection and union sets of all the clusters relevant subspaces. The features in the intersection set, named the core features, are relevant to all the clusters. Features in the union set are named principal features. Principal features thus include core features. Principal features that are not core features may be relevant only to some clusters and contribute to these clusters structures. The set of principal features thus corresponds to global feature selection, and the set of core features, if not empty, provides common features relevant to the clusters structure. It is expected that using the set of principal features will improve the clustering results compared to using the full feature space, because it allows one to avoid the use of noise features from the full feature space. It is also expected that using only the set of core features will still generate high-quality clusters, as these features are essential to all clusters.

The following experiment confirms the above expectations. We tested the K-modes algorithm on the principal feature subspace and the core feature subspace of the nine datasets. For the Zoo data, Table

Relevant subspace of each cluster for the Zoo dataset.

Cluster | Relevant subspace |

1 | 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 14 |

2 | 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 13, 15, 16 |

3 | 2, 3, 4, 6, 7, 8, 9, 10, 12, 14, 15, 16 |

4 | 1, 2, 3, 4, 8, 9, 10, 11, 12, 13, 14 |

5 | 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14 |

6 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16 |

7 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16 |

This result also seems compatible with our a priori knowledge of the data. Based on the details of the Zoo data from the UCI Machine Learning Repository (Andritsos

Table

Relevant subspace of each cluster for the Soybean dataset.

Cluster | Dimensions of relevant subspace |

1 | 2, 3, 4, 5, 7, 11, 12, 20, 21, 22, 23, 24, 26, 27, 28 |

2 | 2, 3, 4, 7, 11, 12, 20, 21, 23, 24, 25, 26, 27, 28, 35 |

3 | 2, 3 ,8 ,11, 12,21, 22, 23, 24, 25, 26, 27, 28, 35 |

4 | 2, 3, 7, 11, 12, 20, 22, 23, 25, 26, 27, 28, 35 |

Clustering accuracy on full feature space, principal feature space and core feature space.

Dataset | Full feature space | Principal feature space | Core feature space |

Zoo | 0.6930 | 0.6930 | 0.8811 |

Votes | 0.8344 | 0.8344 | 0.8851 |

Soybean | 0.8510 | 1.0 | 0.8085 |

Chess | 0.5475 | 0.5663 | 0.5222 |

Nursery | 0.4287 | 0.5301 | 0.342 |

Car evaluation | 0.5410 | 0.6887 | 0.7 |

Hayes-roth | 0.4621 | 0.5075 | |

Balance scale | 0.6336 | 0.5056 |

Based on the discussions above, our approach not only precisely detects the clusters and their relevant features, but also discovers the principal feature space, and the core feature space. The core feature space determines the key cluster structure, because it is relevant to all clusters. This is very important in knowledge mining from high-dimensional data.

Most hierarchical clustering algorithms tend to be based on similarity measures computed on a common feature space, which is not effective for clustering high-dimensional data. In this paper, we have proposed a new way of exploring information entropy, holo-entropy, attribute selection and attribute weighting to extract the feature subspace and merge clusters that have different feature subspaces. The new algorithm differs from existing mainstream hierarchical clustering algorithms in its use of a weighted holo-entropy to replace the pairwise-similarity-based measures for merging two subclusters. The advantages of our algorithm are as follows: first, it takes interrelationships of the attributes into account and avoids the conditional independence hypothesis, which is an implicit hypothesis made by most existing hierarchical clustering algorithms. Secondly, it employs the entropy and holo-entropy to detect the relevant subspace, and find the principal feature space and the core feature space from the whole feature space of corresponding subclusters. Thirdly, it uses intra-class compactness as a standard for merging subclusters rather than the traditional similarity measurement. We performed experiments that demonstrate the effectiveness of the new algorithm in terms of clustering accuracy and analysis of the relevant subspaces obtained.

This project is supported by the National Natural Science Foundation of China (61170130) and the State Key Laboratory of Rail Traffic Control and Safety (Contract No. RCS2012K005), Beijing Jiaotong University).