The details of each step will be described in the following sub-sections.
4.1 Initialization
Cluster initialization in our approach is a necessary step as it makes it meaningful to use the information-theoretic method to estimate attribute relevance and also reduces the number of cluster merging steps. For the initialization of our agglomerative clustering algorithm, the dataset is first divided into small subclusters. In order to ensure that objects in the same subcluster are as similar as possible, we use the following categorical data similarity measurement for initialization. The similarity between any two objects
${X_{i}}$ and
${X_{j}}$ can be defined as follows:
where
m is the number of dimensions of object
X and
${x_{id}}$,
${x_{jd}}$ are the
dth attribute values of
${X_{i}}$,
${X_{j}}$, respectively. Drawing inspiration from Nemalhabib and Shiri (
2006), we extend this similarity measure to calculate the similarity between an object and a cluster. The definition is as follows:
where
$|C|$ is the size of cluster
C,
${X_{a}}$ is one of the objects in
C and the first sum in the numerator covers all the objects in
C. The main phase of initialization is described as follows:
Initialization
Input: Dataset $X=\{{x_{1}},{x_{2}},\dots ,{x_{n}}\}$, threshold r
Output: ${K_{\mathit{init}}}$ subclusters $C=\{{C_{1}},{C_{2}},\dots ,{C_{{K_{\mathit{init}}}}}\}$.
Begin
Set ${x_{1}}\in {C_{1}}$ and $R=X-\{{x_{1}}\},k=1$;
For each ${x_{i}}$ in R
Use Eq. (
8) to calculate similarities
${S_{1}},\dots ,{S_{k}}$ of
${x_{i}}$ to each of
the clusters $\{{C_{1}},{C_{2}},\dots ,{C_{k}}\}$;
If ${S_{l}}=\max \{{S_{1}},\dots ,{S_{k}}\}>r$;
Allocate ${x_{i}}$ to ${C_{l}}$;
Else
$k=k+1$;
Allocate ${x_{i}}$ to ${C_{k}}$;
End if
End for
${K_{\mathit{init}}}=k$
End
In this algorithm, Eq. (
8) is used to calculate the similarities
${S_{1}},\dots ,{S_{k}}$ of
${x_{i}}$ to each of
${C_{1}}$ to
${C_{k}}$. If at least one of these similarities is larger than
r, then
${x_{i}}$ is assigned to an existing cluster; otherwise, a new cluster
${C_{k+1}}$ will be created. The final number of initial clusters is thus controlled by the similarity threshold
r, which ranges in
$(0,1)$. In fact, the larger
r will result in more sub-clusters with less objects, on the other hand, the small
r results in less sub-clusters with more objects. In theory, the choice of
r is application dependent, however, we found from our experiments that the results are not sensitive to the choice of
r as long as its value is large enough to ensure creation of a sufficient number of initial clusters. Our guideline is that if the attribute values of an object are the same as those of a sub-cluster at about three quarters of the dimensions or more, the object can be regarded as coming from the same sub-cluster. To ensure that all the objects in each cluster are sufficiently similar to each other, we recommend choosing
r between 0.7 and 0.95. In our experiments, we have chosen
r to be 0.80 for different datasets.
Table 2
11 samples from the Soybean dataset.
No. object |
Clusters |
${A_{1}}$ |
${A_{2}}$ |
${A_{3}}$ |
${A_{4}}$ |
${A_{5}}$ |
Class label |
1 |
${C_{1}}$ |
0 |
1 |
1 |
0 |
1 |
1 |
2 |
${C_{1}}$ |
0 |
1 |
2 |
1 |
1 |
1 |
3 |
${C_{1}}$ |
0 |
1 |
1 |
1 |
1 |
1 |
4 |
${C_{1}}$ |
0 |
1 |
2 |
0 |
1 |
1 |
5 |
${C_{2}}$ |
0 |
1 |
0 |
2 |
2 |
1 |
6 |
${C_{2}}$ |
0 |
1 |
1 |
0 |
2 |
1 |
7 |
${C_{2}}$ |
0 |
1 |
0 |
2 |
2 |
1 |
8 |
${C_{3}}$ |
1 |
0 |
1 |
4 |
0 |
2 |
9 |
${C_{3}}$ |
0 |
0 |
1 |
4 |
0 |
2 |
10 |
${C_{3}}$ |
0 |
0 |
1 |
4 |
0 |
2 |
11 |
${C_{3}}$ |
1 |
1 |
1 |
4 |
0 |
2 |
The similarity measure of Eq. (
8) coupled with a high similarity threshold
r makes the initialization procedure significantly less sensitive to the input order. This is very important for the hierarchical clustering proposed in this paper, as the final result depends on the initial clusters. An example is shown in Table
2 in which three clusters can be considered to reside in the two-class dataset with 11 objects. Objects 1 to 7 from class 1 form two clusters: objects 1 to 4 as one cluster and 5 to 7 as another. The rest of the objects, 8 to 11 from class 2, form the third cluster. Each object is represented by five attributes
${A_{1}}$ to
${A_{5}}$, and the corresponding class number is given in the Label column. Remark that significant differences exist between data objects belonging to the same cluster. After the initialization phase, the dataset will be divided into three subclusters
${C_{1}}$,
${C_{2}}$ and
${C_{3}}$, using Eqs. (
6), (
7), (
8): objects 1 to 4, 5 to 7 and 8 to 11 are assigned to subclusters
${C_{1}},{C_{2}}$ and
${C_{3}}$, respectively. Such a result is obtained regardless of the order in which the data objects are presented to the initialization procedure.
4.2 Relevant Subspace Selection
In this subsection, we address the problem of optimally determining the subspace (Domeniconi
et al.,
2004; Gan and Wu,
2004) for a given cluster. The idea of our approach is to optimally separate the attributes into two groups, of which one generates a subspace that pertains to the cluster. In our method, the entropy is used for attribute evaluation and holo-entropy is employed as the criterion for subspace detection.
Subspace clustering (Gan and Wu,
2004; Parsons
et al.,
2004) is extensively applied for clustering high-dimensional data in the field because of the curse of dimensionality (Parsons
et al.,
2004). In subspace clustering, finding the relevant subspaces in a cluster is of great significance. Informally, a relevant or characteristic attribute of a cluster is an attribute on which the data distribution is concentrated as compared to a non-characteristic attribute of the cluster. Characteristic attributes of a cluster form the relevant subspace of the cluster. Generally, different clusters have different characteristic subspaces. The non-characteristic or noise attributes are distributed in a diffuse way; i.e. data should be sparse in the subspace generated by noise attributes. We also call these attributes irrelevant w.r.t. the cluster structure. How to separate the attribute space into relevant and noise subspaces is a key issue in subspace clustering.
In order to determine the relevant attributes for each cluster, we need to find an optimal division that separates characteristic attributes from noise ones. Although there has been a great deal of work reported on subspace clustering, attribute relevance analysis in the existing approaches is often based on analysing the variance of each individual attribute while assuming that attributes are mutually independent (Barbar
et al.,
2002; Ganti
et al.,
1999; Greenacre and Blasius,
2006). Moreover, existing entropy-based algorithms such as Barbar
et al. (
2002), Qin
et al. (
2014) usually use
ad-hoc values of parameters to determine the separation between relevant and non relevant attributes, and such methods lack flexibility for practical applications. The proposed strategy makes use of the holo-entropy in Eq. (
5), as it provides an efficient and effective way to estimate the quality of cluster structure of a given data subset on a given set of attributes. It establishes the separation based on an automatic process.
Let the whole feature space of a cluster (or cluster candidate)
D be
Q, and let
Q be separated into two feature subspaces
S and
N, where
S is a candidate for the relevant subspace of
D and
N is a candidate for its non-relevant subspace,
$Q=N\cup S$ and
$(N\cap S=\varnothing )$. We want to evaluate the quality of the feature-space separation in order to find an optimal
S as the relevant subspace of
D. In fact, by using holo-entropy, the quality of the feature subspace
S can be evaluated by
$\mathit{HL}(D|S)$. This measure can be equivalently written as
and
where
$\mathit{entropy}(i)$, min and max respectively denote the entropy of attribute
${A_{i}}$ and the minimum and maximum values of attribute entropy in the whole space of a subcluster.
$\mathit{std}(i)$ refers to the normalized information entropy of attribute
${A_{i}}$. An advantage of using normalized entropies is that it allows another seemingly close function to be defined to evaluate the quality of
N as a non-relevant subspace.
In both cases, the smaller the value of
$qcs()$ (or
$qncs()$), the better the quality of
S (or
N) as a (non-)relevant subspace. From these two functions, we define a new measure for evaluating the quality of the feature-space separation by
Obviously, this measure is designed to strike a balance between the sizes of the relevant and non-relevant subspaces.
The optimization (minimization, in fact) of Eq. (
12) aims to find
S that leads to the optimal value of
$AF(D,Q)$. This optimization can be performed by a demarcation detection process. In fact, if we first sort all the dimensions in increasing order of entropies; i.e. supposing that for any pair of dimensions
i and
j, we have
$\mathit{entropy}(i),\mathit{entropy}(j)$, then the optimization of (
12) consists in finding a demarcation point
e such that the S composed of dimensions from 1 to
e (and entropy from
$e+1$ to
m) minimizes
$AF(D,Q)$. In this optimization, we use in Eq. (
12) the average of the holo-entropy measures in Eqs. (
9) and (
11) to compute the demarcation point. This choice is made to favour a balanced separation.
For the Soybean dataset, after initialization, the dataset is divided into three subclusters
${C_{1}}$,
${C_{2}}$, and
${C_{3}}$. Based on the information entropy, Eq. (
10) is used for normalization, and the normalized entropy of each attribute of each subcluster is shown in Table
3. The values in Table
3 are the related normalized entropy of the corresponding attribute of each subcluster. For instance, the value of 0.8113 is the standard information entropy of
${C_{3}}$’s attribute
${A_{2}}$.
Table 3
The normalized entropy of attributes for each subcluster.
Clusters |
${A_{1}}$ |
${A_{2}}$ |
${A_{3}}$ |
${A_{4}}$ |
${A_{5}}$ |
${C_{1}}$ |
0 |
0 |
1 |
1 |
0 |
${C_{2}}$ |
0 |
0 |
0.6365 |
1 |
0 |
${C_{3}}$ |
1 |
0.8113 |
0 |
0 |
0 |
The optimization in Eq. (
12) allows us to determine the relevant subspace for each subcluster. The results are shown in Table
4, where
y indicates that the corresponding attribute is a characteristic attribute while
n indicates a noisy attribute. For instance,
${A_{1}},{A_{2}},{A_{5}}$ are characteristic attributes of
${C_{1}}$, while
${A_{3}}$ and
${A_{4}}$ are noisy attributes. In other words, the relevant subspace for
${C_{1}}$ is composed of
${A_{1}},{A_{2}},{A_{5}}$, and the noisy subspace is composed of
${A_{3}}$, and
${A_{4}}$. The relevant subspace vector of
${C_{1}}$ is thus
$[1,1,0,0,1]$,
${C_{2}}$’s is
$[1,1,0,0,1]$ and
${C_{3}}$’s is
$[0,0,1,1,1]$. Finally, the relevant subspace vector shown in Table
4. The main phase of relevant subspace detection is described as follows:
Subspace Detection
Input: K subclusters $C={C_{1}},{C_{2}},\dots ,{C_{{K_{\mathit{init}}}}}$
Output: $en\_ \mathit{mat}$ (attribute entropy matrix for subclusters)
$\mathit{sub}\_ \mathit{mat}$ (relevant subspace vector matrix for subclusters)
Begin
For each subcluster ${C_{i}}$ in C
For each attribute ${A_{j}}$ in ${C_{i}}$
Calculate attribute information entropy $\mathit{IE}(j)$ $(1\leqslant j\leqslant m)$
End for
Use Eq. (
10) to normalize
$\mathit{IE}$
$en\_ \mathit{mat}(i)=\mathit{IE}$;
Use Eq. (
12) to determine subspace vector
$V(i)$;
$\mathit{sub}\_ \mathit{mat}(i)=V$
End for
End
4.3 Intrinsic Compactness
To select the subclusters to merge, we introduce a new measure, named intrinsic compactness, to measure the compactness of a cluster. The concept of intrinsic compactness here differs from conventional variance-based compactness measures in that it incorporates the feature relevance in the relevant subspace and the size of the cluster. For the calculation of intrinsic compactness, we assign a weight to each relevant attribute.
4.3.1 Attribute Weighting
Table 4
The relevant subspace vectors of each subcluster.
Clusters |
${A_{1}}$ |
${A_{2}}$ |
${A_{3}}$ |
${A_{4}}$ |
${A_{5}}$ |
${C_{1}}$ |
y |
y |
n |
n |
y |
${C_{2}}$ |
y |
y |
n |
n |
y |
${C_{3}}$ |
n |
n |
y |
y |
y |
Attribute weighting, widely used in soft subspace clustering (Domeniconi
et al.,
2004; Gan and Wu,
2004; Parsons
et al.,
2004; Tan
et al.,
2005), allows feature selection to be performed using more effective optimization methods for continuous objective functions. All attributes are not equally important for characterizing the cluster structure of a dataset. Even within the relevant subspace of a cluster, attributes may still differ from each other in importance. In this paper, we consider attribute weighting in order to take both the entropy and the size of the cluster into account. It is not enough to assign weights based solely on the entropy of the attributes: it is also necessary to take the size of the subcluster into consideration. As an example, let a dataset contain subclusters denoted by
${C_{1}}$ (with 11 objects) and
${C_{2}}$ (with 2 objects), and an attribute
A. Suppose
A gets the values
$[1,1,1,1,1,1,2,2,3,3,4]$ in
${C_{1}}$ and
$[1,2]$ in
${C_{2}}$. The entropy of attribute
A is 1.168 in subcluster
${C_{1}}$ and 0.69 in subcluster
${C_{2}}$. Based on the entropy values, it appears that attribute
A in
${C_{2}}$ is more important than in
${C_{1}}$. But actually, attribute
A appears more important to
${C_{1}}$ than
${C_{2}}$ because
a expresses more centrality at value 1 in
${C_{1}}$, while this value expresses dispersity in
${C_{2}}$. An important reason for this is that the number of objects in
${C_{1}}$ is much larger than in
${C_{2}}$, and attribute
A in
${C_{1}}$ should thus be assigned a greater weight. Obviously, it is reasonable and rational that the cluster size should be considered in weight allocation, rather than performing it solely on the basis of entropy.
In order to take the size of subcluster into account, we propose a simple approach for attributing a weighting associated with the relevant subspace. The proposed weighting approach calculates the weights from the entropy of attributes in the relevant subspace and the number of objects in the corresponding subcluster. This method is motivated by effectiveness in practical applications rather than by theoretical needs, as the attribute weight for a merged subcluster is closely related to the entropy of the attribute and the size of the two subclusters. Attribute weighting is proportional to the size of the subcluster and inversely proportional to the entropy. In other words, the formula can be written as
$\frac{|C|}{\mathit{entropy}}$, where
$|C|$ is the size of the subcluster and
$\mathit{entropy}$ is the information entropy of the attribute. Since information entropy can take a value of zero, we set the fixed parameter
$\alpha =0.0001$ to guarantee that the denominator of
$\frac{|C|}{\mathit{entropy}}$ is not zero. The formula can be rewritten as
$\frac{|C|}{\mathit{entropy}+\alpha }$. Thus, the weight of attribute
${A_{i}}$ is defined as follows:
where
Here
$w(i)$ refers to the weight of attribute
${A_{i}}$ which is a member of the relevant subspace of
${C_{1}}\cup {C_{2}}$,
S is the union subspace of the two subclusters,
$|{C_{1}}|$ and
$|{C_{2}}|$ denote the respective sizes of subclusters
${C_{1}}$ and
${C_{2}}$, and
$\mathit{std}({C_{1}},i)$ and
$\mathit{std}({C_{2}},i)$ denote the normalized entropies of subclusters
${C_{1}}$ and
${C_{2}}$, respectively.
With the above formulas, we consider both the attribute entropy and the size of the subclusters in computing the weight of an attribute in a merged cluster. The initial importance of each attribute is modulated by the size of the subcluster. An important attribute coming from a large subcluster is assigned a larger weight. On the other hand, an important attribute coming from a small subcluster is assigned a proportionally smaller weight. The contribution of the selected relevant subspace of each subcluster to the merged cluster is thus better balanced.
Continuing with the example given in Section
4.2, the subspaces corresponding to the merged subclusters are shown in Table
5.
y indicates that the corresponding attribute is a characteristic attribute while
n indicates a noisy attribute. The relevant subspaces for
${C_{1}}$ and
${C_{2}}$ are composed of
$[{A_{1}},{A_{2}},{A_{5}}]$ and
$[{A_{1}},{A_{2}},{A_{5}}]$, respectively. Choosing the characteristic attribute union set,
$[{A_{1}},{A_{2}},{A_{5}}]$ serves as the relevant subspace of
${C_{1}},{C_{2}}$. The relevant subspace weights for each merged subcluster, according to formulas (
13), (
14) and (
15), are given in Table
6 (∗ denotes that the attribute is a noise attribute for the merged subcluster).
Table 5
Relevant subspace vectors of merged subclusters.
Clusters |
${A_{1}}$ |
${A_{2}}$ |
${A_{3}}$ |
${A_{4}}$ |
${A_{5}}$ |
${C_{1}}\cup {C_{2}}$ |
y |
y |
n |
n |
y |
${C_{1}}\cup {C_{3}}$ |
y |
y |
y |
y |
y |
${C_{2}}\cup {C_{3}}$ |
y |
y |
y |
y |
y |
Table 6
Relevant subspace weights for each merged subcluster.
Clusters |
${A_{1}}$ |
${A_{2}}$ |
${A_{3}}$ |
${A_{4}}$ |
${A_{5}}$ |
${C_{1}}\cup {C_{2}}$ |
0.3333 |
0.3333 |
* |
* |
0.3333 |
${C_{1}}\cup {C_{3}}$ |
0.1667 |
0.1667 |
0.1667 |
0.1667 |
0.3333 |
${C_{2}}\cup {C_{3}}$ |
0.1429 |
0.1429 |
0.1905 |
0.1905 |
0.3333 |
The main phase of weight calculation is described as follows:
Weight Calculation
Input: K subclusters $C=\{{C_{1}},{C_{2}},\dots ,{C_{{K_{\mathit{init}}}}}\}$
$\mathit{en}\_ \mathit{mat}$ (attribute entropy matrix for subclusters)
$\mathit{sub}\_ \mathit{mat}$ (relevant subspace vector matrix for subclusters)
Output: $w\_ \mathit{mat}$ (weight matrix for merged subclusters)
Begin
For each subcluster ${C_{i}}$ in C
Calculate the size of ${C_{i}}$, $|{C_{i}}|$
For each subcluster ${C_{j}}$ in C, $(i<j)$
Calculate the size of ${C_{j}}$, $|{C_{j}}|$
Find the union of $\mathit{sub}\_ \mathit{mat}(i)$ and $\mathit{sub}\_ \mathit{mat}(j)$, $\mathit{subV}$
For each attribute s in $\mathit{subV}$
Use Eqs. (
14), (
15) to get
$S({C_{i}})$ and
$S({C_{j}})$
Use Eq. (
13) to calculate
$w(s)$
$w\_ \mathit{mat}(i,j,s)=w(s)$
End for
End for
End for
End
4.3.2 Intrinsic Compactness
As subcluster merging criterion, most existing hierarchical algorithms (Guha
et al.,
1999; Do and Kim,
2008; Zhang and Fang,
2013) use similarity measures that do not consider correlations between attributes or variations in the importance of each attribute. For this reason, we propose the concept of intrinsic compactness, defined as a weighted holo-entropy computed on attributes in the relevant subspace. The intrinsic compactness
$\mathit{IC}$ is defined on a potential cluster resulting from the merging of two subclusters, and will be used to measure the quality of the merge. Let
${C_{1}}$ and
${C_{2}}$ be the two subclusters and
C the potential cluster resulting from merging
${C_{1}}$ and
${C_{2}}$. The intrinsic compactness is defined as follows:
where
$\mathit{RS}(C)$ stands for the relevant subspace of
C,
$w(i)$ is the weight of attribute
i, and
$E(i)$ is the entropy of attribute
i. The above intrinsic compactness
$\mathit{IC}({C_{1}},{C_{2}})$ degenerates to a measure equivalent to the holo-entropy if all the weights are equally important in the relevant subspace of
C. Similar to the holo-entropy,
$\mathit{IC}({C_{1}},{C_{2}})$ measures the compactness of
C while considering only the relevant subspace and taking into account the contribution of individual attributes. The smaller
$\mathit{IC}({C_{1}},{C_{2}})$ is, the more similar the intrinsic structures of
${C_{1}}$ and
${C_{2}}$, and the more likely it is that they originated from a homogeneous class. To continue with the above example, based on information theory, the value of each
$E(i)$ is shown Table
7 (∗ denotes a noise attribute). The intrinsic compactness of the merged subclusters, computed by formula (
16), is shown in Table
8.
Table 7
Information entropy of attributes for each merged subcluster.
Clusters |
${A_{1}}$ |
${A_{2}}$ |
${A_{3}}$ |
${A_{4}}$ |
${A_{5}}$ |
${C_{1}}\cup {C_{2}}$ |
0 |
0 |
* |
* |
0.6829 |
${C_{1}}\cup {C_{3}}$ |
0.5623 |
0.6616 |
0.5623 |
1.0397 |
0.6931 |
${C_{2}}\cup {C_{3}}$ |
0.5938 |
0.6829 |
0.5938 |
0.9557 |
0.6829 |
Table 8
Compactness of merged subclusters.
Clusters |
${C_{1}}$ |
${C_{2}}$ |
${C_{3}}$ |
${C_{1}}$ |
– |
0.2276 |
0.7020 |
${C_{2}}$ |
– |
– |
0.7067 |
${C_{3}}$ |
– |
– |
– |
Given the symmetry of the intrinsic compactness measure, the compactness matrix is an upper triangular matrix (see Table
8). The main phase of intrinsic compactness calculation is described as follows:
Intrinsic compactness calculation
Input: $w\_ \mathit{mat}$ (weight matrix for subclusters)
$en\_ \mathit{mat}$ (attribute entropy matrix for subclusters)
Output: $C\_ \mathit{matrix}$ (inter-subcluster compactness matrix)
Begin
Use Eq. (
16) to generate the
$C\_ \mathit{matrix}$
End
The last step of our algorithm is merging clusters. First the minimal value in the inter-cluster compactness matrix is found, and the row and column of that minimal value then provide the labels of the corresponding subclusters. This pair of subclusters likely comes from a homogeneous class, and is selected for merging. The entropies, subspaces and weights and the inter-cluster compactness matrix are then updated in the next iteration. The main phase of cluster merging is described as follows:
Cluster merging
Input: $C=\{{C_{1}},{C_{2}},\dots ,{C_{k}}\}$ and $C\_ \mathit{matrix}$
Output: $C=\{{C_{1}},{C_{2}},\dots ,{C_{k-1}}\}$
Begin
Find minimal compactness $C(i,j)$ in the ${C_{\mathit{matrix}}}$ and
merge subclusters i and j to ${C_{\min (i,j)}}$;
Update $C=\{{C_{1}},{C_{2}},\dots ,{C_{k}}\}-{C_{\max (i,j)}}$ for the next iteration
End