## 1 Introduction and Problem Statement

*et al.*, 2011; Yu

*et al.*, 2012; Fukunaga, 2013; Cover and Hart, 1967; Derrac

*et al.*, 2012; Santos

*et al.*, 2013). It has been widely used for data analysis in many fields, including anthropology, biology, economics, marketing, and medicine. Typical applications include disease classification, document retrieval, image processing, market segmentation, scene analysis, and web access pattern analysis (Guan

*et al.*, 2013; Li

*et al.*, 2013; Shrivastava and Tyagi, 2014; Hruschka

*et al.*, 2006; Li

*et al.*, 2006; Lingras

*et al.*, 2005; Choi

*et al.*, 2012).

*et al.*, 2005; Lu

*et al.*, 2011). For categorical data, the main difficulty is estimating the attribute weights based on the statistics of categories in a cluster. In fact, in the existing methods (Bai

*et al.*, 2011a; Xiong

*et al.*, 2011), an attribute is weighted solely according to the mode category for that attribute. Consequently, the weights easily yield a biased indication of the relevance of attributes to clusters. To solve this problem, it is necessary to analyse the relationship between attributes and clusters.

## 2 Related Work

*et al.*develop the Classification EM algorithm (CEM) to estimate the parameters of a mixture model based on the classification likelihood approach. In Gan

*et al.*(2005), Gan

*et al.*present a genetic K-modes algorithm named GKMODE. It introduces a K-modes operator in place of the normal crossover operator and finds a globally optimal partition of a given categorical dataset into a specified number of clusters. Other extensions of the K-modes algorithm include fuzzy centroids (Kim

*et al.*, 2004) using fuzzy logic, more effective initialization methods (Cao

*et al.*, 2009; Bai

*et al.*, 2011b) for the K-modes and fuzzy K-modes, attribute weighting (He

*et al.*, 2011) and the use of genetic programming techniques for fuzzy K-modes (Gan

*et al.*, 2009).

*et al.*, 2005), a K-means-like clustering algorithm for categorical data that optimizes an objective function based on mutual information sharing. Most of the conventional algorithms (Shrivastava and Tyagi, 2014; Hruschka

*et al.*, 2006; Xiong

*et al.*, 2011; Wu and Wang, 2013) involve optimizing an objective function defined on a pairwise measure of the similarity between objects. Unfortunately, this optimization problem is usually NP-complete and requires the use of heuristic methods in practice. In such solutions, the focus is primarily on the relationship between objects and clusters, while attribute relevance within a cluster is often ignored (Barbar

*et al.*, 2002; Qin

*et al.*, 2014; Ganti

*et al.*, 1999; Greenacre and Blasius, 2006). Moreover, the lack of an intuitive method for determining the number of clusters and high time complexity are common challenges for this type of algorithm.

*et al.*, 2004; Parsons

*et al.*, 2004) forming its feature subspace. For example, HARP (Yip

*et al.*, 2004) is a hierarchical projected clustering algorithm based on the assumption that two data points are likely to belong to the same cluster if they are very similar to each other along many dimensions. However, when the number of relevant dimensions per cluster is much lower than the dataset dimensionality, such an assumption may not be valid, because relevant information concerning these data points in a large subspace is lost. Some projected clustering algorithms, such as PROCLUS (Aggarwal

*et al.*, 1999) and ORCLUS (Aggarwal and Yu, 2002), require the user to provide the average dimensionality of the subspaces, which is very difficult to establish in real-life applications. PCKA (Bouguessa and Wang, 2009), a distance-based projected clustering algorithm, was recently proposed to improve the quality of clustering when the dimensionalities of the clusters are much lower than that of the dataset. However, it requires users to provide values for some input parameters, such as the number of nearest neighbours of a 1D point, which may significantly affect its performance. These algorithms are dependent on pairwise similarity measures which do not take into account correlations between attributes.

*et al.*, 1999) is an agglomerative hierarchical clustering algorithm based on the extension of the pairwise similarity measure. It extends the Jaccard coefficient similarity measure by exploiting the concept of neighbourhood. Its performance depends heavily on the neighbour threshold and the time complexity depends on the number of neighbours. However, these parameters are difficult to estimate in real applications. Instead of using pairwise similarity measures, the K-modes algorithm (Guha

*et al.*, 1999; Zhang and Fang, 2013) defines a similarity between an individual categorical object and a set of categorical objects. When the clusters are well established, this approach has the advantage of being more meaningful. The performance of the K-modes algorithms relies heavily on the initialization of the K modes. DHCC (Xiong

*et al.*, 2011) proposes a divisive hierarchical clustering algorithm in which a meaningful object-to-clusters similarity measure is defined. DHCC is capable of discovering clusters embedded in subspaces, and is parameter-free with linear time complexity. However, DHCCPs space complexity is a limitation, as it depends on the square of the number of values of all the categorical attributes.

*et al.*, 2000; Barbar

*et al.*, 2002). The goal of these approaches is to seek an optimum grouping of the objects such that the entropy is the smallest. An entropy-based fuzzy clustering algorithm (EFC) is proposed in Yao

*et al.*(2000). EFC calculates entropy values and implements clustering analysis based on the degree of similarity. It requires the setting of a similarity threshold to control the similarity among the data points in a cluster. This parameter, which affects the number of clusters and the clustering accuracy, is very difficult to determine. The COOLCAT algorithm (Barbar

*et al.*, 2002) employs the notion of entropy in assigning unclustered objects. A given data object is assigned to a cluster such that the entropy of the resulting clustering is minimal. The incremental assignment terminates when every object has been placed in some cluster. The clustering quality depends heavily on the input order of the data objects. The LIMBO algorithm (Andritsos

*et al.*, 2004) is a hierarchical clustering algorithm based on the concept of an Information Bottleneck (IB) which quantifies the relevant information preserved in clustering results. It proposes a novel measurement for the similarity among subclusters by way of the Jensen–Shannon divergence.

*et al.*, 2014) searches equivalence classes from attribute partitions to form the clustering of objects which can share the greatest possible quantity of information with the attribute partitions. First, MGR selects the clustering attribute whose partition shares the most information with the partitions defined by other attributes. Then, on the clustering attribute, the equivalence class with the highest intra-class similarity is output as a cluster, and the rest of the objects form the new current dataset. The above two steps are repeated on the new current dataset until all objects are output. In MGR, because the attributes are selected one by one, the relevancy of attributes (or combinations of attributes) is not considered sufficiently. For example, say attribute A1 has the most information, and A2 is in second place. It is not necessarily true that the combination A1, A2 has more information than some other combination A3, A4. How to select the subcluster of attributes which shares the most information with the partitions will be analysed in this paper.

## 3 The Holo-Entropy Theory

*S*is defined as $E(S)=-{\textstyle\sum _{i=1}^{m}}p({s_{i}})\ln (p({s_{i}}))$. The entropy allows to assess the structure of an attribute, i.e. whether its values are distributed compactly or sparsely. Therefore, it can be used as a criterion on an attribute that expresses the degree to which an attribute is or is not characteristic for a cluster.

*et al.*, 2002; Ganti

*et al.*, 1999; Greenacre and Blasius, 2006). Such a hypothesis not only ignores the degree of correlation among attributes, but also fails to consider attribute relevance and heterogeneity in the data. Methods derived from these approaches do not satisfy the requirements of many practical applications. The holo-entropy (Wu and Wang, 2013) is a compactness measure that incorporates not only the distribution of individual attributes but also correlations between attributes. It has been effectively used in evaluating the likelihood of a data object being an outlier. There is as yet no reported work on how holo-entropy can contribute to high-dimensional categorical data clustering. Our hypothesis in this work is that the holo-entropy may contribute to relevant subspace detection and compactness measurement in cluster analysis. Actually, the goal of subspace detection is to find a set of attributes on which the data of a cluster are distributed compactly. This attribute set expresses the features of the cluster.

*n*samples, where each ${x_{i}}$ has

*m*categorical attributes. The

*m*attributes ${[{A_{1}},{A_{2}},\dots ,{A_{m}}]^{T}}$ are also represented by the attribute vector

*A*. Each attribute ${A_{i}}$ has a value domain defined by $[{a_{i,1}},{a_{i,2}},\dots ,{a_{i,{n_{i}}}}]$ $(1\leqslant i\leqslant m)$, where ${n_{i}}$ is the number of distinct values in attribute ${A_{i}}$. From the information-theoretic perspective, ${A_{i}}$ is considered a random variable, and $A={[{A_{1}},{A_{2}},{A_{3}},\dots ,{A_{m}}]^{T}}$ is considered a random vector. The entropy $E(A)$ of the random vector

*A*on the set

*X*is defined, according to the chain rule for the entropy (Cover and Thomas, 2012), by:

##### (1)

\[\begin{aligned}{}E(A)=& E({A_{1}},{A_{2}},\dots ,{A_{m}})={\sum \limits_{i=1}^{m}}E({A_{i}}\mid {A_{i-1}},\dots ,{A_{1}})\\ {} =& E({A_{1}})+H({A_{2}}\mid {A_{1}})+\cdots +E({A_{m}}\mid {A_{m-1}},\dots ,{A_{1}}),\end{aligned}\]*X*.

##### Definition 3 (*Total correlation*).

*X*is equal to the sum of all mutual information among random variables: where

*m*, $I({A_{{r_{1}}}},\dots ,{A_{{r_{i}}}})=I({A_{{r_{1}}}},\dots $ , ${A_{{r_{i-1}}}})-I({A_{{r_{1}}}},\dots ,{A_{{r_{i-1}}}}\mid {A_{{r_{i}}}})$ is the multivariate mutual information of ${A_{{r_{1}}}},\dots ,{A_{{r_{i}}}}$, and $I({A_{{r_{1}}}},\dots ,{A_{{r_{i-1}}}}\mid {A_{{r_{i}}}})=E(I({A_{{r_{1}}}},\dots ,{A_{{r_{i-1}}}}|{A_{{r_{i}}}}))$ is the conditional mutual information. Thus, the total correlation can be used for estimating the interrelationships among the attributes or shared information of subclusters.

*A*, and can be expressed by the sum of the entropies on all attributes.

##### Table 1

No. object | ${A_{1}}$ | ${A_{2}}$ | ${A_{3}}$ | Label |

1 | 0 | 1 | 0 | ${C_{1}}$ |

2 | 0 | 2 | 2 | ${C_{1}}$ |

3 | 0 | 2 | 1 | ${C_{1}}$ |

4 | 0 | 2 | 1 | ${C_{1}}$ |

5 | 2 | 3 | 2 | ${C_{2}}$ |

6 | 2 | 3 | 3 | ${C_{2}}$ |

7 | 1 | 3 | 0 | ${C_{2}}$ |

8 | 0 | 3 | 3 | ${C_{2}}$ |

*et al.*, 2004; Gan and Wu, 2004; Nemalhabib and Shiri, 2006) and the properties of holo-entropy for merging subclusters in the hierarchical clustering.

## 4 HPCCD Algorithm

**Algorithm HPCCD**

**Input:**Dataset

*X*, threshold

*r*and the terminal condition ${K_{\mathit{init}}}$ which is the

**Output:**clusters of Dataset

*X*.

**Begin**

*C*);

**For**each subcluster in set

*C*

*C*;

*C*;

**End**

*K*desired clusters.

### 4.1 Initialization

*m*is the number of dimensions of object

*X*and ${x_{id}}$, ${x_{jd}}$ are the

*d*th attribute values of ${X_{i}}$, ${X_{j}}$, respectively. Drawing inspiration from Nemalhabib and Shiri (2006), we extend this similarity measure to calculate the similarity between an object and a cluster. The definition is as follows:

##### (8)

\[ \mathit{Sim}({X_{i}},C)=\frac{{\textstyle\textstyle\sum _{a=1}^{|C|}}{\textstyle\textstyle\sum _{d=1}^{m}}\| {x_{id}},{x_{ad}}\| }{|C|\ast m},\]*C*, ${X_{a}}$ is one of the objects in

*C*and the first sum in the numerator covers all the objects in

*C*. The main phase of initialization is described as follows:

**Initialization**

**Input:**Dataset $X=\{{x_{1}},{x_{2}},\dots ,{x_{n}}\}$, threshold

*r*

**Output:**${K_{\mathit{init}}}$ subclusters $C=\{{C_{1}},{C_{2}},\dots ,{C_{{K_{\mathit{init}}}}}\}$.

**Begin**

**For**each ${x_{i}}$ in

*R*

**If**${S_{l}}=\max \{{S_{1}},\dots ,{S_{k}}\}>r$;

**Else**

**End**if

**End**for

**End**

*r*, then ${x_{i}}$ is assigned to an existing cluster; otherwise, a new cluster ${C_{k+1}}$ will be created. The final number of initial clusters is thus controlled by the similarity threshold

*r*, which ranges in $(0,1)$. In fact, the larger

*r*will result in more sub-clusters with less objects, on the other hand, the small

*r*results in less sub-clusters with more objects. In theory, the choice of

*r*is application dependent, however, we found from our experiments that the results are not sensitive to the choice of

*r*as long as its value is large enough to ensure creation of a sufficient number of initial clusters. Our guideline is that if the attribute values of an object are the same as those of a sub-cluster at about three quarters of the dimensions or more, the object can be regarded as coming from the same sub-cluster. To ensure that all the objects in each cluster are sufficiently similar to each other, we recommend choosing

*r*between 0.7 and 0.95. In our experiments, we have chosen

*r*to be 0.80 for different datasets.

##### Table 2

No. object | Clusters | ${A_{1}}$ | ${A_{2}}$ | ${A_{3}}$ | ${A_{4}}$ | ${A_{5}}$ | Class label |

1 | ${C_{1}}$ | 0 | 1 | 1 | 0 | 1 | 1 |

2 | ${C_{1}}$ | 0 | 1 | 2 | 1 | 1 | 1 |

3 | ${C_{1}}$ | 0 | 1 | 1 | 1 | 1 | 1 |

4 | ${C_{1}}$ | 0 | 1 | 2 | 0 | 1 | 1 |

5 | ${C_{2}}$ | 0 | 1 | 0 | 2 | 2 | 1 |

6 | ${C_{2}}$ | 0 | 1 | 1 | 0 | 2 | 1 |

7 | ${C_{2}}$ | 0 | 1 | 0 | 2 | 2 | 1 |

8 | ${C_{3}}$ | 1 | 0 | 1 | 4 | 0 | 2 |

9 | ${C_{3}}$ | 0 | 0 | 1 | 4 | 0 | 2 |

10 | ${C_{3}}$ | 0 | 0 | 1 | 4 | 0 | 2 |

11 | ${C_{3}}$ | 1 | 1 | 1 | 4 | 0 | 2 |

*r*makes the initialization procedure significantly less sensitive to the input order. This is very important for the hierarchical clustering proposed in this paper, as the final result depends on the initial clusters. An example is shown in Table 2 in which three clusters can be considered to reside in the two-class dataset with 11 objects. Objects 1 to 7 from class 1 form two clusters: objects 1 to 4 as one cluster and 5 to 7 as another. The rest of the objects, 8 to 11 from class 2, form the third cluster. Each object is represented by five attributes ${A_{1}}$ to ${A_{5}}$, and the corresponding class number is given in the Label column. Remark that significant differences exist between data objects belonging to the same cluster. After the initialization phase, the dataset will be divided into three subclusters ${C_{1}}$, ${C_{2}}$ and ${C_{3}}$, using Eqs. (6), (7), (8): objects 1 to 4, 5 to 7 and 8 to 11 are assigned to subclusters ${C_{1}},{C_{2}}$ and ${C_{3}}$, respectively. Such a result is obtained regardless of the order in which the data objects are presented to the initialization procedure.

### 4.2 Relevant Subspace Selection

*et al.*, 2004; Gan and Wu, 2004) for a given cluster. The idea of our approach is to optimally separate the attributes into two groups, of which one generates a subspace that pertains to the cluster. In our method, the entropy is used for attribute evaluation and holo-entropy is employed as the criterion for subspace detection.

*et al.*, 2004) is extensively applied for clustering high-dimensional data in the field because of the curse of dimensionality (Parsons

*et al.*, 2004). In subspace clustering, finding the relevant subspaces in a cluster is of great significance. Informally, a relevant or characteristic attribute of a cluster is an attribute on which the data distribution is concentrated as compared to a non-characteristic attribute of the cluster. Characteristic attributes of a cluster form the relevant subspace of the cluster. Generally, different clusters have different characteristic subspaces. The non-characteristic or noise attributes are distributed in a diffuse way; i.e. data should be sparse in the subspace generated by noise attributes. We also call these attributes irrelevant w.r.t. the cluster structure. How to separate the attribute space into relevant and noise subspaces is a key issue in subspace clustering.

*et al.*, 2002; Ganti

*et al.*, 1999; Greenacre and Blasius, 2006). Moreover, existing entropy-based algorithms such as Barbar

*et al.*(2002), Qin

*et al.*(2014) usually use

*ad-hoc*values of parameters to determine the separation between relevant and non relevant attributes, and such methods lack flexibility for practical applications. The proposed strategy makes use of the holo-entropy in Eq. (5), as it provides an efficient and effective way to estimate the quality of cluster structure of a given data subset on a given set of attributes. It establishes the separation based on an automatic process.

*D*be

*Q*, and let

*Q*be separated into two feature subspaces

*S*and

*N*, where

*S*is a candidate for the relevant subspace of

*D*and

*N*is a candidate for its non-relevant subspace, $Q=N\cup S$ and $(N\cap S=\varnothing )$. We want to evaluate the quality of the feature-space separation in order to find an optimal

*S*as the relevant subspace of

*D*. In fact, by using holo-entropy, the quality of the feature subspace

*S*can be evaluated by $\mathit{HL}(D|S)$. This measure can be equivalently written as and where $\mathit{entropy}(i)$, min and max respectively denote the entropy of attribute ${A_{i}}$ and the minimum and maximum values of attribute entropy in the whole space of a subcluster. $\mathit{std}(i)$ refers to the normalized information entropy of attribute ${A_{i}}$. An advantage of using normalized entropies is that it allows another seemingly close function to be defined to evaluate the quality of

*N*as a non-relevant subspace.

*S*(or

*N*) as a (non-)relevant subspace. From these two functions, we define a new measure for evaluating the quality of the feature-space separation by

*S*that leads to the optimal value of $AF(D,Q)$. This optimization can be performed by a demarcation detection process. In fact, if we first sort all the dimensions in increasing order of entropies; i.e. supposing that for any pair of dimensions

*i*and

*j*, we have $\mathit{entropy}(i),\mathit{entropy}(j)$, then the optimization of (12) consists in finding a demarcation point

*e*such that the S composed of dimensions from 1 to

*e*(and entropy from $e+1$ to

*m*) minimizes $AF(D,Q)$. In this optimization, we use in Eq. (12) the average of the holo-entropy measures in Eqs. (9) and (11) to compute the demarcation point. This choice is made to favour a balanced separation.

##### Table 3

Clusters | ${A_{1}}$ | ${A_{2}}$ | ${A_{3}}$ | ${A_{4}}$ | ${A_{5}}$ |

${C_{1}}$ | 0 | 0 | 1 | 1 | 0 |

${C_{2}}$ | 0 | 0 | 0.6365 | 1 | 0 |

${C_{3}}$ | 1 | 0.8113 | 0 | 0 | 0 |

*y*indicates that the corresponding attribute is a characteristic attribute while

*n*indicates a noisy attribute. For instance, ${A_{1}},{A_{2}},{A_{5}}$ are characteristic attributes of ${C_{1}}$, while ${A_{3}}$ and ${A_{4}}$ are noisy attributes. In other words, the relevant subspace for ${C_{1}}$ is composed of ${A_{1}},{A_{2}},{A_{5}}$, and the noisy subspace is composed of ${A_{3}}$, and ${A_{4}}$. The relevant subspace vector of ${C_{1}}$ is thus $[1,1,0,0,1]$, ${C_{2}}$’s is $[1,1,0,0,1]$ and ${C_{3}}$’s is $[0,0,1,1,1]$. Finally, the relevant subspace vector shown in Table 4. The main phase of relevant subspace detection is described as follows:

**Subspace Detection**

**Input:**

*K*subclusters $C={C_{1}},{C_{2}},\dots ,{C_{{K_{\mathit{init}}}}}$

**Output:**$en\_ \mathit{mat}$ (attribute entropy matrix for subclusters)

**Begin**

**For**each subcluster ${C_{i}}$ in

*C*

**For**each attribute ${A_{j}}$ in ${C_{i}}$

**End**for

**End**for

**End**

### 4.3 Intrinsic Compactness

#### 4.3.1 Attribute Weighting

##### Table 4

Clusters | ${A_{1}}$ | ${A_{2}}$ | ${A_{3}}$ | ${A_{4}}$ | ${A_{5}}$ |

${C_{1}}$ | y | y | n | n | y |

${C_{2}}$ | y | y | n | n | y |

${C_{3}}$ | n | n | y | y | y |

*et al.*, 2004; Gan and Wu, 2004; Parsons

*et al.*, 2004; Tan

*et al.*, 2005), allows feature selection to be performed using more effective optimization methods for continuous objective functions. All attributes are not equally important for characterizing the cluster structure of a dataset. Even within the relevant subspace of a cluster, attributes may still differ from each other in importance. In this paper, we consider attribute weighting in order to take both the entropy and the size of the cluster into account. It is not enough to assign weights based solely on the entropy of the attributes: it is also necessary to take the size of the subcluster into consideration. As an example, let a dataset contain subclusters denoted by ${C_{1}}$ (with 11 objects) and ${C_{2}}$ (with 2 objects), and an attribute

*A*. Suppose

*A*gets the values $[1,1,1,1,1,1,2,2,3,3,4]$ in ${C_{1}}$ and $[1,2]$ in ${C_{2}}$. The entropy of attribute

*A*is 1.168 in subcluster ${C_{1}}$ and 0.69 in subcluster ${C_{2}}$. Based on the entropy values, it appears that attribute

*A*in ${C_{2}}$ is more important than in ${C_{1}}$. But actually, attribute

*A*appears more important to ${C_{1}}$ than ${C_{2}}$ because

*a*expresses more centrality at value 1 in ${C_{1}}$, while this value expresses dispersity in ${C_{2}}$. An important reason for this is that the number of objects in ${C_{1}}$ is much larger than in ${C_{2}}$, and attribute

*A*in ${C_{1}}$ should thus be assigned a greater weight. Obviously, it is reasonable and rational that the cluster size should be considered in weight allocation, rather than performing it solely on the basis of entropy.

##### (13)

\[ w(i)=\frac{\frac{|{C_{1}}|}{(\mathit{std}({C_{1}},i)+\alpha }+\frac{|{C_{2}}|}{(\mathit{std}({C_{2}},i)+\alpha )}}{\mathit{Total}({C_{1}})+\mathit{Total}({C_{2}})}\]*S*is the union subspace of the two subclusters, $|{C_{1}}|$ and $|{C_{2}}|$ denote the respective sizes of subclusters ${C_{1}}$ and ${C_{2}}$, and $\mathit{std}({C_{1}},i)$ and $\mathit{std}({C_{2}},i)$ denote the normalized entropies of subclusters ${C_{1}}$ and ${C_{2}}$, respectively.

*y*indicates that the corresponding attribute is a characteristic attribute while

*n*indicates a noisy attribute. The relevant subspaces for ${C_{1}}$ and ${C_{2}}$ are composed of $[{A_{1}},{A_{2}},{A_{5}}]$ and $[{A_{1}},{A_{2}},{A_{5}}]$, respectively. Choosing the characteristic attribute union set, $[{A_{1}},{A_{2}},{A_{5}}]$ serves as the relevant subspace of ${C_{1}},{C_{2}}$. The relevant subspace weights for each merged subcluster, according to formulas (13), (14) and (15), are given in Table 6 (∗ denotes that the attribute is a noise attribute for the merged subcluster).

##### Table 5

Clusters | ${A_{1}}$ | ${A_{2}}$ | ${A_{3}}$ | ${A_{4}}$ | ${A_{5}}$ |

${C_{1}}\cup {C_{2}}$ | y | y | n | n | y |

${C_{1}}\cup {C_{3}}$ | y | y | y | y | y |

${C_{2}}\cup {C_{3}}$ | y | y | y | y | y |

##### Table 6

Clusters | ${A_{1}}$ | ${A_{2}}$ | ${A_{3}}$ | ${A_{4}}$ | ${A_{5}}$ |

${C_{1}}\cup {C_{2}}$ | 0.3333 | 0.3333 | * | * | 0.3333 |

${C_{1}}\cup {C_{3}}$ | 0.1667 | 0.1667 | 0.1667 | 0.1667 | 0.3333 |

${C_{2}}\cup {C_{3}}$ | 0.1429 | 0.1429 | 0.1905 | 0.1905 | 0.3333 |

**Weight Calculation**

**Input:**

*K*subclusters $C=\{{C_{1}},{C_{2}},\dots ,{C_{{K_{\mathit{init}}}}}\}$

**Output:**$w\_ \mathit{mat}$ (weight matrix for merged subclusters)

**Begin**

**For**each subcluster ${C_{i}}$ in

*C*

**For**each subcluster ${C_{j}}$ in

*C*, $(i<j)$

**For**each attribute

*s*in $\mathit{subV}$

**End**for

**End**for

**End**for

**End**

#### 4.3.2 Intrinsic Compactness

*et al.*, 1999; Do and Kim, 2008; Zhang and Fang, 2013) use similarity measures that do not consider correlations between attributes or variations in the importance of each attribute. For this reason, we propose the concept of intrinsic compactness, defined as a weighted holo-entropy computed on attributes in the relevant subspace. The intrinsic compactness $\mathit{IC}$ is defined on a potential cluster resulting from the merging of two subclusters, and will be used to measure the quality of the merge. Let ${C_{1}}$ and ${C_{2}}$ be the two subclusters and

*C*the potential cluster resulting from merging ${C_{1}}$ and ${C_{2}}$. The intrinsic compactness is defined as follows: where $\mathit{RS}(C)$ stands for the relevant subspace of

*C*, $w(i)$ is the weight of attribute

*i*, and $E(i)$ is the entropy of attribute

*i*. The above intrinsic compactness $\mathit{IC}({C_{1}},{C_{2}})$ degenerates to a measure equivalent to the holo-entropy if all the weights are equally important in the relevant subspace of

*C*. Similar to the holo-entropy, $\mathit{IC}({C_{1}},{C_{2}})$ measures the compactness of

*C*while considering only the relevant subspace and taking into account the contribution of individual attributes. The smaller $\mathit{IC}({C_{1}},{C_{2}})$ is, the more similar the intrinsic structures of ${C_{1}}$ and ${C_{2}}$, and the more likely it is that they originated from a homogeneous class. To continue with the above example, based on information theory, the value of each $E(i)$ is shown Table 7 (∗ denotes a noise attribute). The intrinsic compactness of the merged subclusters, computed by formula (16), is shown in Table 8.

##### Table 7

Clusters | ${A_{1}}$ | ${A_{2}}$ | ${A_{3}}$ | ${A_{4}}$ | ${A_{5}}$ |

${C_{1}}\cup {C_{2}}$ | 0 | 0 | * | * | 0.6829 |

${C_{1}}\cup {C_{3}}$ | 0.5623 | 0.6616 | 0.5623 | 1.0397 | 0.6931 |

${C_{2}}\cup {C_{3}}$ | 0.5938 | 0.6829 | 0.5938 | 0.9557 | 0.6829 |

##### Table 8

Clusters | ${C_{1}}$ | ${C_{2}}$ | ${C_{3}}$ |

${C_{1}}$ | – | 0.2276 | 0.7020 |

${C_{2}}$ | – | – | 0.7067 |

${C_{3}}$ | – | – | – |

**Intrinsic compactness calculation**

**Input:**$w\_ \mathit{mat}$ (weight matrix for subclusters)

**Output:**$C\_ \mathit{matrix}$ (inter-subcluster compactness matrix)

**Begin**

**End**

**Cluster merging**

**Input:**$C=\{{C_{1}},{C_{2}},\dots ,{C_{k}}\}$ and $C\_ \mathit{matrix}$

**Output:**$C=\{{C_{1}},{C_{2}},\dots ,{C_{k-1}}\}$

**Begin**

*i*and

*j*to ${C_{\min (i,j)}}$;

**End**

## 5 Experiments and Results

### 5.1 Experimental Design

*et al.*, 2014), K-modes (Aggarwal

*et al.*, 1999), COOLCAT (Barbar

*et al.*, 2002) and LIMBO (Andritsos

*et al.*, 2004), for comparison with HPCCD. The choice of these algorithms for comparison is based on the following considerations. MGR (Mean Gain Ratio) is a divisive hierarchical clustering algorithm based on information theory, which performs clustering by selecting a clustering attribute based on the mean gain ratio and detecting an equivalence class on the clustering attribute using the entropy of clusters. The partition-based K-modes algorithm is one of the first algorithms for clustering categorical data and is widely considered to be the benchmark algorithm. COOLCAT is an incremental heuristic algorithm based on information theory, which explores the relationships between dataset and entropy, since clusters of similar POIs (Points Of Interest) yield lower entropy than clusters of dissimilar ones. LIMBO is an Information Bottleneck (IB)-based hierarchical clustering algorithm which quantifies the relevant information preserved when clustering.

##### Table 9

Dataset name | Number of objects | Number of attributes | Number of classes |

Soybean | 47 | 35 | 4 |

Votes | 435 | 16 | 2 |

Mushroom | 8124 | 22 | 2 |

Nursery | 12960 | 8 | 4 |

Zoo | 101 | 16 | 7 |

Chess | 3196 | 36 | 2 |

Hayes-roth | 132 | 4 | 3 |

Balance scale | 625 | 4 | 3 |

Car evaluation | 1728 | 6 | 4 |

### 5.2 Clustering Performance Index

*K*is the number of classes in the test dataset, and ${C_{i}}$ is the maximum number of objects in cluster

*i*belonging to the same original class in the test dataset, i.e. the majority class.

*n*objects, suppose $U=\{{u_{1}},{u_{2}},\dots ,{u_{s}}\}$ and $V=\{{v_{1}},{v_{2}},\dots ,{v_{t}}\}$ represent the original classes and the clustering result, respectively. ${n_{ij}}$ denotes the number of objects that are in both class ${u_{i}}$ and cluster ${v_{i}}$, while ${u_{i}}$ and ${v_{j}}$ are the numbers of objects in class ${u_{i}}$ and cluster ${v_{i}}$, respectively.

##### (18)

\[ \mathit{ARI}=\frac{{\textstyle\sum _{ij}}\big(\substack{{n_{ij}}\\ {} 2}\big)-\big[{\textstyle\sum _{i}}\big(\substack{{u_{i}}\\ {} 2}\big){\textstyle\sum _{j}}\big(\substack{{v_{j}}\\ {} 2}\big)\big]/\big(\substack{n\\ {} 2}\big)}{\frac{1}{2}\ast \big[{\textstyle\sum _{i}}\big(\substack{{u_{i}}\\ {} 2}\big)+{\textstyle\sum _{j}}\big(\substack{{v_{j}}\\ {} 2}\big)\big]-\big[{\textstyle\sum _{i}}\big(\substack{{u_{i}}\\ {} 2}\big){\textstyle\sum _{j}}\big(\substack{{v_{j}}\\ {} 2}\big)\big]/\big(\substack{n\\ {} 2}\big)}.\]### 5.3 Analysis of Clustering Accuracy

##### Table 10

Cluster | Instances | Classes | Accuracy | ARI | ||||||

Mam | Bir | Fih | Inv | Ins | Amp | Rep | ||||

1 | 41 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | ||

2 | 4 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | ||

3 | 10 | 0 | 0 | 0 | 0 | 0 | 8 | 2 | ||

4 | 21 | 0 | 20 | 1 | 0 | 0 | 0 | 0 | 0.960 | 0.963 |

5 | 13 | 0 | 0 | 0 | 13 | 0 | 0 | 0 | ||

6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | ||

7 | 5 | 0 | 0 | 4 | 0 | 0 | 0 | 1 |

##### Table 11

Cluster | Instances | Classes | Accuracy | ARI | |||

${D_{1}}$ | ${D_{2}}$ | ${D_{3}}$ | ${D_{4}}$ | ||||

1 | 10 | 10 | 0 | 0 | 0 | ||

2 | 10 | 0 | 0 | 10 | 0 | 1.0 | 1.0 |

3 | 17 | 0 | 0 | 0 | 17 | ||

4 | 10 | 0 | 10 | 0 | 0 |

##### Table 12

Algorithm | Zoo | Vote | Soybean | Mushroom | Chess | Nursery | Car- | Hayes- | Balance- |

HPCCD | 0.9406 |
0.9218 |
1.0 |
0.8641 | 0.5823 |
0.5595 |
0.7 |
0.5152 |
0.652 |

MGR | 0.931 | 0.828 | * | 0.667 | 0.534 | 0.53 | 0.7 |
0.485 | 0.635 |

K-modes | 0.6930 | 0.8344 | 0.8510 | 0.5179 | 0.5475 | 0.4287 | 0.5410 | 0.4621 | 0.6336 |

COOLCAT | 0.8900 | 0.7816 | 0.9362 | 0.5220 | 0.5228 | 0.3775 | 0.7 |
0.4394 | 1.0 |

LIMBO | 0.9109 | 0.8230 | 1.0 |
0.8902 |
0.5222 | 0.4974 | 0.7 |
0.4621 | 0.6096 |

##### Table 13

Algorithm | Zoo | Vote | Soybean | Mushroom | Chess | Nursery | Car- | Hayes- | Balance- |

HPCCD | 0.9630 |
0.7109 |
1.0 |
0.5302 | 0.0260 |
0.2181 |
0.0428 | 0.0742 |
0.0923 |

MGR | 0.9617 | 0.4279 | * | 0.1254 | 0.0036 | 0.1680 | 0.0129 | 0.0392 | 0.1011 |

K-modes | 0.4782 | 0.4463 | 0.6586 | 0.0533 | 0.0082 | 0.0565 | 0.0540 |
0.0252 | 0.0915 |

COOLCAT | 0.8586 | 0.3154 | 0.8214 | 0.0018 | 0.0018 | 0.0083 | 0.0500 | .00261 | 1.0 |

LIMBO | 0.8318 | 0.4159 | 1.0 |
0.6090 |
0.0082 | 0.0793 | 0.0285 | 0.0381 | 0.0684 |

### 5.4 Analysis of Feature Relevance

##### Table 14

Cluster | Relevant subspace |

1 | 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 14 |

2 | 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 13, 15, 16 |

3 | 2, 3, 4, 6, 7, 8, 9, 10, 12, 14, 15, 16 |

4 | 1, 2, 3, 4, 8, 9, 10, 11, 12, 13, 14 |

5 | 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14 |

6 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16 |

7 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16 |

*et al.*, 2004), the 7 core features include feathers, eggs, milk, toothed, backbone, breathes and fins. This means that these features are relevant for all animals. For the Mammal class (cluster 1), additional features such as hair, airborne, venomous and tail are relevant. In the real world, to judge an animal, the core features are necessary. However, to judge if an animal is a mammal, the core features alone are insufficient, and features such as hair, airborne, venomous and tail are necessary. On the other hand, the features aquatic, predator, legs, domestic and cat size are not relevant for mammals. For the Bird class (cluster 4), the features hair, venomous, legs, and tail are relevant, whereas the features aquatic, airborne, predator, domestic and cat size are not.

##### Table 15

Cluster | Dimensions of relevant subspace |

1 | 2, 3, 4, 5, 7, 11, 12, 20, 21, 22, 23, 24, 26, 27, 28 |

2 | 2, 3, 4, 7, 11, 12, 20, 21, 23, 24, 25, 26, 27, 28, 35 |

3 | 2, 3 ,8 ,11, 12,21, 22, 23, 24, 25, 26, 27, 28, 35 |

4 | 2, 3, 7, 11, 12, 20, 22, 23, 25, 26, 27, 28, 35 |

##### Table 16

Dataset | Full feature space | Principal feature space | Core feature space |

Zoo | 0.6930 | 0.6930 | 0.8811 |

Votes | 0.8344 | 0.8344 | 0.8851 |

Soybean | 0.8510 | 1.0 | 0.8085 |

Chess | 0.5475 | 0.5663 | 0.5222 |

Nursery | 0.4287 | 0.5301 | 0.342 |

Car evaluation | 0.5410 | 0.6887 | 0.7 |

Hayes-roth | 0.4621 | 0.5075 | ${^{\ast }}$ |

Balance scale | 0.6336 | 0.5056 | ${^{\ast }}$ |