<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn><issn pub-type="ppub">0868-4952</issn><issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFO1146</article-id>
<article-id pub-id-type="doi">10.15388/Informatica.2017.131</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>Holo-Entropy Based Categorical Data Hierarchical Clustering</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Sun</surname><given-names>Haojun</given-names></name><email xlink:href="haojunsun@stu.edu.cn">haojunsun@stu.edu.cn</email><xref ref-type="aff" rid="j_info1146_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>H. Sun</bold> is a professor at the Department of Computer, Shantou University, China. His main research interests are in data mining, machine learning, pattern recognition, etc.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Chen</surname><given-names>Rongbo</given-names></name><email xlink:href="13rbchen@stu.edu.cn">13rbchen@stu.edu.cn</email><xref ref-type="aff" rid="j_info1146_aff_001">1</xref><bio>
<p><bold>R. Chen</bold> was awarded the candidate of master’s degree at computer science, Shantou University, His main research interests include data mining, machine learning, pattern recognition, etc.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Qin</surname><given-names>Yong</given-names></name><email xlink:href="yqin@bjtu.edu.cn">yqin@bjtu.edu.cn</email><xref ref-type="aff" rid="j_info1146_aff_002">2</xref><bio>
<p><bold>Y. Qin</bold> is a professor in State Key Laboratory of Rail Traffic Control and Safety Beijing Jiaotong University. The main research interest is intelligent transportation system, traffic safety engineering, intelligent control theory.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname><given-names>Shengrui</given-names></name><email xlink:href="shengrui.wang@usherbrooke.ca">shengrui.wang@usherbrooke.ca</email><xref ref-type="aff" rid="j_info1146_aff_003">3</xref><bio>
<p><bold>S. Wang</bold> is a professor in Department of Computer Science University of Sherbrooke, Canada. In research, he is interested in pattern recognition, data mining, bio-informatics, neural networks, image processing, remote sensing, GIS. His current projects include high-dimensional data clustering, categorical data clustering, data streams mining, protein and RNA sequences mining, graph matching and graph clustering, fuzzy clustering and variable selection for data mining, location-based services, bankruptcy prediction, business intelligence.</p></bio>
</contrib>
<aff id="j_info1146_aff_001"><label>1</label>Department of Computer Science, <institution>Shantou University</institution>, Shantou, <country>China</country></aff>
<aff id="j_info1146_aff_002"><label>2</label>State Key Laboratory of Rail Traffic Control and Safety, <institution>Beijing Jiaotong University</institution>, Beijing, <country>China</country></aff>
<aff id="j_info1146_aff_003"><label>3</label>Department of Computer Science, <institution>University of Sherbrooke</institution>, Sherbrooke, QC, <country>Canada</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2017</year></pub-date><pub-date pub-type="epub"><day>1</day><month>1</month><year>2017</year></pub-date><volume>28</volume><issue>2</issue><fpage>303</fpage><lpage>328</lpage><history><date date-type="received"><month>1</month><year>2016</year></date><date date-type="accepted"><month>3</month><year>2017</year></date></history>
<permissions><copyright-statement>© 2017 Vilnius University</copyright-statement><copyright-year>2017</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Clustering high-dimensional data is a challenging task in data mining, and clustering high-dimensional categorical data is even more challenging because it is more difficult to measure the similarity between categorical objects. Most algorithms assume feature independence when computing similarity between data objects, or make use of computationally demanding techniques such as PCA for numerical data. Hierarchical clustering algorithms are often based on similarity measures computed on a common feature space, which is not effective when clustering high-dimensional data. Subspace clustering algorithms discover feature subspaces for clusters, but are mostly partition-based; i.e. they do not produce a hierarchical structure of clusters. In this paper, we propose a hierarchical algorithm for clustering high-dimensional categorical data, based on a recently proposed information-theoretical concept named holo-entropy. The algorithm proposes new ways of exploring entropy, holo-entropy and attribute weighting in order to determine the feature subspace of a cluster and to merge clusters even though their feature subspaces differ. The algorithm is tested on UCI datasets, and compared with several state-of-the-art algorithms. Experimental results show that the proposed algorithm yields higher efficiency and accuracy than the competing algorithms and allows higher reproducibility.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>hierarchical clustering</kwd>
<kwd>holo-entropy</kwd>
<kwd>subspace</kwd>
<kwd>categorical data</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_info1146_s_001">
<label>1</label>
<title>Introduction and Problem Statement</title>
<p>The aim of clustering analysis is to group objects so that those within a cluster are much more similar than those in different clusters. Clustering has been studied extensively in the statistics, data-mining and database communities, and numerous algorithms have been proposed (Schwenkera and Trentin, <xref ref-type="bibr" rid="j_info1146_ref_041">2014</xref>; Sabit <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_039">2011</xref>; Yu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_053">2012</xref>; Fukunaga, <xref ref-type="bibr" rid="j_info1146_ref_016">2013</xref>; Cover and Hart, <xref ref-type="bibr" rid="j_info1146_ref_010">1967</xref>; Derrac <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_012">2012</xref>; Santos <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_040">2013</xref>). It has been widely used for data analysis in many fields, including anthropology, biology, economics, marketing, and medicine. Typical applications include disease classification, document retrieval, image processing, market segmentation, scene analysis, and web access pattern analysis (Guan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_022">2013</xref>; Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_033">2013</xref>; Shrivastava and Tyagi, <xref ref-type="bibr" rid="j_info1146_ref_043">2014</xref>; Hruschka <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_026">2006</xref>; Li <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_032">2006</xref>; Lingras <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_034">2005</xref>; Choi <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_009">2012</xref>).</p>
<p>Hierarchical techniques are often used in cluster analysis. They aim to establish a hierarchy of partition structure, using either bottom-up or top-down approaches. In the bottom-up approach, at initialization, each object is represented by a single cluster. Clusters are then successively merged based on their similarities until all the objects are grouped into a single cluster, or until some special stopping conditions are satisfied. In the top-down approach, on the other hand, larger clusters are successively divided to generate smaller and more compact clusters until some stopping conditions are satisfied or each cluster contains a single object. Traditionally, in the bottom-up approach, the pairwise similarity (or distance) employed for merging clusters is often calculated on a common feature space. Feature relevance or feature selection is addressed prior to the clustering process. In this research, we will investigate feature relevance with respect to individual clusters and cluster merging where each cluster may have its own relevant features. This is a very important issue in hierarchical clustering of high-dimensional data. In this paper the terms ‘feature’, ‘attribute’ and sometimes ‘dimension’ are used interchangeably.</p>
<p>Automatically determining the relevancy of attributes in a categorical cluster will be investigated in this paper. Conventional similarity measures, defined on the whole feature space with the assumption that features are of equal importance, are not suitable for clustering high-dimensional data in many cases. In real-world applications, different clusters may lie in different feature subspaces with different dimensions. This means the significance or relevance of an attribute is not the same to different clusters. A cluster might be related to only a few dimensions (most relevant dimensions) while the other dimensions (unimportant dimensions) contain random values. Attribute weighting is employed to deal with these issues, but most weighting methods have been designed solely for numeric data clustering (Huang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_028">2005</xref>; Lu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_035">2011</xref>). For categorical data, the main difficulty is estimating the attribute weights based on the statistics of categories in a cluster. In fact, in the existing methods (Bai <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_004">2011a</xref>; Xiong <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_049">2011</xref>), an attribute is weighted solely according to the mode category for that attribute. Consequently, the weights easily yield a biased indication of the relevance of attributes to clusters. To solve this problem, it is necessary to analyse the relationship between attributes and clusters.</p>
<p>In this paper, we propose a novel algorithm named Hierarchical Projected Clustering for Categorical Data (HPCCD). The HPCCD algorithm has been designed to deal with three main issues. The first is analysing attribute relevance based on the holo-entropy theory (Wu and Wang, <xref ref-type="bibr" rid="j_info1146_ref_048">2013</xref>). The holo-entropy is defined as the sum of the entropy and the total correlation of the random vector, and can be expressed by the sum of the entropies on all attributes. It will be used to analyse the relationship between attributes and cluster structure. The second issue is how to decide to merge two clusters in the absence of an effective pairwise similarity measure. The problem arises due to the fact that different subclusters may have their own relevant subspaces. And finally, the third issue is finding the projected clusters based on the intra-class compactness. Our algorithm has two phases, based on the conventional steps in agglomerative clustering. The first is to divide the dataset into <inline-formula id="j_info1146_ineq_001"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">init</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${K_{\mathit{init}}}$]]></tex-math></alternatives></inline-formula> (the initial number of clusters) subclusters in an initialization step. And the second is to iteratively merge these small clusters to yield larger ones, until all the objects are grouped in one cluster or a desired number of clusters. Experimental results on nine UCI real-life datasets show that our algorithm performs effectively compared to state-of-the-art algorithms.</p>
<p>The rest of the paper is organized as follows: Section <xref rid="j_info1146_s_002">2</xref> is a brief review of related work. In Section <xref rid="j_info1146_s_003">3</xref>, we introduce the mutual information, total correlation, and holo-entropy concepts used in the paper. Section <xref rid="j_info1146_s_004">4</xref> describes the details of our algorithm HPCCD, with an illustrative example. In Section <xref rid="j_info1146_s_010">5</xref>, we present experimental results, in comparison with other state-of-the-art algorithms. Finally, we conclude our paper and suggest future work in Section <xref rid="j_info1146_s_015">6</xref>.</p>
</sec>
<sec id="j_info1146_s_002">
<label>2</label>
<title>Related Work</title>
<p>In this section, we briefly review major existing works related to our research. Many clustering algorithms for categorical data are based on the partition approach. The most popular of these is Huangs K-modes algorithm (Huang, <xref ref-type="bibr" rid="j_info1146_ref_027">1998</xref>), which is an extension of the K-means paradigm to the categorical domain. It replaces the mean of a cluster by the mode and updates the mode based on the maximal frequency of the value of each attribute in a cluster. A number of partition-based algorithms have been developed based on the K-modes approach. In Jollois and Nadif (<xref ref-type="bibr" rid="j_info1146_ref_030">2002</xref>), Jollois <italic>et al.</italic> develop the Classification EM algorithm (CEM) to estimate the parameters of a mixture model based on the classification likelihood approach. In Gan <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1146_ref_018">2005</xref>), Gan <italic>et al.</italic> present a genetic K-modes algorithm named GKMODE. It introduces a K-modes operator in place of the normal crossover operator and finds a globally optimal partition of a given categorical dataset into a specified number of clusters. Other extensions of the K-modes algorithm include fuzzy centroids (Kim <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_031">2004</xref>) using fuzzy logic, more effective initialization methods (Cao <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_008">2009</xref>; Bai <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_005">2011b</xref>) for the K-modes and fuzzy K-modes, attribute weighting (He <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_025">2011</xref>) and the use of genetic programming techniques for fuzzy K-modes (Gan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_019">2009</xref>).</p>
<p>Also proposed in addition to all the K-modes algorithms is k-ANMI (He <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_024">2005</xref>), a K-means-like clustering algorithm for categorical data that optimizes an objective function based on mutual information sharing. Most of the conventional algorithms (Shrivastava and Tyagi, <xref ref-type="bibr" rid="j_info1146_ref_043">2014</xref>; Hruschka <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_026">2006</xref>; Xiong <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_049">2011</xref>; Wu and Wang, <xref ref-type="bibr" rid="j_info1146_ref_048">2013</xref>) involve optimizing an objective function defined on a pairwise measure of the similarity between objects. Unfortunately, this optimization problem is usually NP-complete and requires the use of heuristic methods in practice. In such solutions, the focus is primarily on the relationship between objects and clusters, while attribute relevance within a cluster is often ignored (Barbar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_006">2002</xref>; Qin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_038">2014</xref>; Ganti <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_020">1999</xref>; Greenacre and Blasius, <xref ref-type="bibr" rid="j_info1146_ref_021">2006</xref>). Moreover, the lack of an intuitive method for determining the number of clusters and high time complexity are common challenges for this type of algorithm.</p>
<p>Projected clustering is a major technique for high-dimensional data clustering, whose aim is to discover the clusters and their relevant attributes simultaneously. A projected cluster is defined by both its data points and the relevant attributes (Bouguessa and Wang, <xref ref-type="bibr" rid="j_info1146_ref_007">2009</xref>; Domeniconi <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_014">2004</xref>; Parsons <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_037">2004</xref>) forming its feature subspace. For example, HARP (Yip <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_052">2004</xref>) is a hierarchical projected clustering algorithm based on the assumption that two data points are likely to belong to the same cluster if they are very similar to each other along many dimensions. However, when the number of relevant dimensions per cluster is much lower than the dataset dimensionality, such an assumption may not be valid, because relevant information concerning these data points in a large subspace is lost. Some projected clustering algorithms, such as PROCLUS (Aggarwal <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_002">1999</xref>) and ORCLUS (Aggarwal and Yu, <xref ref-type="bibr" rid="j_info1146_ref_001">2002</xref>), require the user to provide the average dimensionality of the subspaces, which is very difficult to establish in real-life applications. PCKA (Bouguessa and Wang, <xref ref-type="bibr" rid="j_info1146_ref_007">2009</xref>), a distance-based projected clustering algorithm, was recently proposed to improve the quality of clustering when the dimensionalities of the clusters are much lower than that of the dataset. However, it requires users to provide values for some input parameters, such as the number of nearest neighbours of a 1D point, which may significantly affect its performance. These algorithms are dependent on pairwise similarity measures which do not take into account correlations between attributes.</p>
<p>Many clustering algorithms based on the hierarchical technique have been proposed. ROCK (Guha <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_023">1999</xref>) is an agglomerative hierarchical clustering algorithm based on the extension of the pairwise similarity measure. It extends the Jaccard coefficient similarity measure by exploiting the concept of neighbourhood. Its performance depends heavily on the neighbour threshold and the time complexity depends on the number of neighbours. However, these parameters are difficult to estimate in real applications. Instead of using pairwise similarity measures, the K-modes algorithm (Guha <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_023">1999</xref>; Zhang and Fang, <xref ref-type="bibr" rid="j_info1146_ref_054">2013</xref>) defines a similarity between an individual categorical object and a set of categorical objects. When the clusters are well established, this approach has the advantage of being more meaningful. The performance of the K-modes algorithms relies heavily on the initialization of the K modes. DHCC (Xiong <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_049">2011</xref>) proposes a divisive hierarchical clustering algorithm in which a meaningful object-to-clusters similarity measure is defined. DHCC is capable of discovering clusters embedded in subspaces, and is parameter-free with linear time complexity. However, DHCCPs space complexity is a limitation, as it depends on the square of the number of values of all the categorical attributes.</p>
<p>Information theory is also frequently used in many clustering models (Yao <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_050">2000</xref>; Barbar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_006">2002</xref>). The goal of these approaches is to seek an optimum grouping of the objects such that the entropy is the smallest. An entropy-based fuzzy clustering algorithm (EFC) is proposed in Yao <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1146_ref_050">2000</xref>). EFC calculates entropy values and implements clustering analysis based on the degree of similarity. It requires the setting of a similarity threshold to control the similarity among the data points in a cluster. This parameter, which affects the number of clusters and the clustering accuracy, is very difficult to determine. The COOLCAT algorithm (Barbar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_006">2002</xref>) employs the notion of entropy in assigning unclustered objects. A given data object is assigned to a cluster such that the entropy of the resulting clustering is minimal. The incremental assignment terminates when every object has been placed in some cluster. The clustering quality depends heavily on the input order of the data objects. The LIMBO algorithm (Andritsos <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_003">2004</xref>) is a hierarchical clustering algorithm based on the concept of an Information Bottleneck (IB) which quantifies the relevant information preserved in clustering results. It proposes a novel measurement for the similarity among subclusters by way of the Jensen–Shannon divergence.</p>
<p>Finally, the MGR algorithm (Qin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_038">2014</xref>) searches equivalence classes from attribute partitions to form the clustering of objects which can share the greatest possible quantity of information with the attribute partitions. First, MGR selects the clustering attribute whose partition shares the most information with the partitions defined by other attributes. Then, on the clustering attribute, the equivalence class with the highest intra-class similarity is output as a cluster, and the rest of the objects form the new current dataset. The above two steps are repeated on the new current dataset until all objects are output. In MGR, because the attributes are selected one by one, the relevancy of attributes (or combinations of attributes) is not considered sufficiently. For example, say attribute A1 has the most information, and A2 is in second place. It is not necessarily true that the combination A1, A2 has more information than some other combination A3, A4. How to select the subcluster of attributes which shares the most information with the partitions will be analysed in this paper.</p>
</sec>
<sec id="j_info1146_s_003">
<label>3</label>
<title>The Holo-Entropy Theory</title>
<p>In this section, we provide a brief description of information entropy, and give a more detailed explanation of the concepts of mutual information, total correlation and holo-entropy (Wu and Wang, <xref ref-type="bibr" rid="j_info1146_ref_048">2013</xref>) used by the algorithm proposed in this paper.</p>
<p>Entropy is a measure of the uncertainty of a system state. As formulated in information theory (Shannon, <xref ref-type="bibr" rid="j_info1146_ref_042">1948</xref>), the concept is often used to measure the degree of disorder or chaos of a dataset, or to describe the uncertainty of a random variable. For a given discrete random variable <inline-formula id="j_info1146_ineq_002"><alternatives><mml:math>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$S=({s_{1}},{s_{2}},\dots ,{s_{m}})$]]></tex-math></alternatives></inline-formula>, let the corresponding probability of appearance be <inline-formula id="j_info1146_ineq_003"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{p({s_{1}}),p({s_{2}}),\dots ,p({s_{m}})\}$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_info1146_ineq_004"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$p({s_{i}})$]]></tex-math></alternatives></inline-formula> satisfies <inline-formula id="j_info1146_ineq_005"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[${\textstyle\sum _{i=1}^{m}}p({s_{i}})=1$]]></tex-math></alternatives></inline-formula>. The entropy of <italic>S</italic> is defined as <inline-formula id="j_info1146_ineq_006"><alternatives><mml:math>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>−</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo movablelimits="false">ln</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$E(S)=-{\textstyle\sum _{i=1}^{m}}p({s_{i}})\ln (p({s_{i}}))$]]></tex-math></alternatives></inline-formula>. The entropy allows to assess the structure of an attribute, i.e. whether its values are distributed compactly or sparsely. Therefore, it can be used as a criterion on an attribute that expresses the degree to which an attribute is or is not characteristic for a cluster.</p>
<p>Many high-dimensional data clustering approaches are based on the attribute independence hypothesis (Barbar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_006">2002</xref>; Ganti <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_020">1999</xref>; Greenacre and Blasius, <xref ref-type="bibr" rid="j_info1146_ref_021">2006</xref>). Such a hypothesis not only ignores the degree of correlation among attributes, but also fails to consider attribute relevance and heterogeneity in the data. Methods derived from these approaches do not satisfy the requirements of many practical applications. The holo-entropy (Wu and Wang, <xref ref-type="bibr" rid="j_info1146_ref_048">2013</xref>) is a compactness measure that incorporates not only the distribution of individual attributes but also correlations between attributes. It has been effectively used in evaluating the likelihood of a data object being an outlier. There is as yet no reported work on how holo-entropy can contribute to high-dimensional categorical data clustering. Our hypothesis in this work is that the holo-entropy may contribute to relevant subspace detection and compactness measurement in cluster analysis. Actually, the goal of subspace detection is to find a set of attributes on which the data of a cluster are distributed compactly. This attribute set expresses the features of the cluster.</p>
<p>In this paper, we develop a new method that utilizes the holo-entropy for hierarchical clustering. Based on the analysis of the intrinsic relevance of features and objects in subclusters, we develop a principled approach for selecting clusters to merge and determining the feature subspace of the merged cluster. The holo-entropy is used to measure the intrinsic relevance. The main idea is that the holo-entropy of two subclusters originating from the same class should be much smaller than if the two subclusters originate from different classes. Therefore, the two subclusters with minimal holo-entropy are selected to merge in the hierarchical clustering. In order to describe our algorithm, the following notation and definitions are introduced.</p>
<p>We use <inline-formula id="j_info1146_ineq_007"><alternatives><mml:math>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$X=\{{x_{1}},{x_{2}},\dots ,{x_{n}}\}$]]></tex-math></alternatives></inline-formula> to represent the dataset with <italic>n</italic> samples, where each <inline-formula id="j_info1146_ineq_008"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{i}}$]]></tex-math></alternatives></inline-formula> has <italic>m</italic> categorical attributes. The <italic>m</italic> attributes <inline-formula id="j_info1146_ineq_009"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${[{A_{1}},{A_{2}},\dots ,{A_{m}}]^{T}}$]]></tex-math></alternatives></inline-formula> are also represented by the attribute vector <italic>A</italic>. Each attribute <inline-formula id="j_info1146_ineq_010"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{i}}$]]></tex-math></alternatives></inline-formula> has a value domain defined by <inline-formula id="j_info1146_ineq_011"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[{a_{i,1}},{a_{i,2}},\dots ,{a_{i,{n_{i}}}}]$]]></tex-math></alternatives></inline-formula> <inline-formula id="j_info1146_ineq_012"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>⩽</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>⩽</mml:mo>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(1\leqslant i\leqslant m)$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_info1146_ineq_013"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${n_{i}}$]]></tex-math></alternatives></inline-formula> is the number of distinct values in attribute <inline-formula id="j_info1146_ineq_014"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{i}}$]]></tex-math></alternatives></inline-formula>. From the information-theoretic perspective, <inline-formula id="j_info1146_ineq_015"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{i}}$]]></tex-math></alternatives></inline-formula> is considered a random variable, and <inline-formula id="j_info1146_ineq_016"><alternatives><mml:math>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$A={[{A_{1}},{A_{2}},{A_{3}},\dots ,{A_{m}}]^{T}}$]]></tex-math></alternatives></inline-formula> is considered a random vector. The entropy <inline-formula id="j_info1146_ineq_017"><alternatives><mml:math>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$E(A)$]]></tex-math></alternatives></inline-formula> of the random vector <italic>A</italic> on the set <italic>X</italic> is defined, according to the chain rule for the entropy (Cover and Thomas, <xref ref-type="bibr" rid="j_info1146_ref_011">2012</xref>), by: 
<disp-formula id="j_info1146_eq_001">
<label>(1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mo>=</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">H</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">⋯</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}E(A)=& E({A_{1}},{A_{2}},\dots ,{A_{m}})={\sum \limits_{i=1}^{m}}E({A_{i}}\mid {A_{i-1}},\dots ,{A_{1}})\\ {} =& E({A_{1}})+H({A_{2}}\mid {A_{1}})+\cdots +E({A_{m}}\mid {A_{m-1}},\dots ,{A_{1}}),\end{aligned}\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_info1146_ineq_018"><alternatives><mml:math>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo movablelimits="false">ln</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo movablelimits="false">…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$E({A_{i}}\mid {A_{i-1}},\dots ,{A_{1}})=-{\textstyle\sum _{{A_{i}},{A_{i-1}},\dots ,{A_{1}}}}p({A_{i}},{A_{i-1}},\dots ,{A_{1}})\ln p({A_{i}}\mid {A_{i-1}},\dots ,{A_{1}})$]]></tex-math></alternatives></inline-formula> and the probability functions <inline-formula id="j_info1146_ineq_019"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$p()$]]></tex-math></alternatives></inline-formula> are estimated from <italic>X</italic>.</p><statement id="j_info1146_stat_001"><label>Definition 1</label>
<title>(<italic>Mutual information</italic>).</title>
<p>The mutual information (He <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_024">2005</xref>; Srinivasa, <xref ref-type="bibr" rid="j_info1146_ref_044">2005</xref>) of random variables <inline-formula id="j_info1146_ineq_020"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_021"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{2}}$]]></tex-math></alternatives></inline-formula> is: 
<disp-formula id="j_info1146_eq_002">
<label>(2)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo movablelimits="false">ln</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ I({A_{1}},{A_{2}})=\sum \limits_{{A_{1}};{A_{2}}}p({A_{1}},{A_{2}})\ln \frac{p({A_{1}}\mid {A_{2}})}{p({A_{1}})\ast p({A_{2}})}=E({A_{1}})-E({A_{1}}\mid {A_{2}}).\]]]></tex-math></alternatives>
</disp-formula>
</p></statement><statement id="j_info1146_stat_002"><label>Definition 2</label>
<title>(<italic>Conditional mutual information</italic>).</title>
<p>The conditional mutual information (Watanabe, <xref ref-type="bibr" rid="j_info1146_ref_047">1960</xref>; Filippone and Sanguinetti, <xref ref-type="bibr" rid="j_info1146_ref_015">2010</xref>) between two random variables <inline-formula id="j_info1146_ineq_022"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_023"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{2}}$]]></tex-math></alternatives></inline-formula> on condition of <inline-formula id="j_info1146_ineq_024"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{3}}$]]></tex-math></alternatives></inline-formula> is: 
<disp-formula id="j_info1146_eq_003">
<label>(3)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">H</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">H</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ I({A_{1}},{A_{2}}\mid {A_{3}})=H({A_{1}}\mid {A_{3}})-H({A_{1}}\mid {A_{2}},{A_{3}}).\]]]></tex-math></alternatives>
</disp-formula>
</p></statement><statement id="j_info1146_stat_003"><label>Definition 3</label>
<title>(<italic>Total correlation</italic>).</title>
<p>According to Watanabe’s proof (Watanabe, <xref ref-type="bibr" rid="j_info1146_ref_047">1960</xref>) that total correlation <inline-formula id="j_info1146_ineq_025"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">Y</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$C(Y)$]]></tex-math></alternatives></inline-formula> on the set <italic>X</italic> is equal to the sum of all mutual information among random variables: 
<disp-formula id="j_info1146_eq_004">
<label>(4)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ C(A)={\sum \limits_{i=1}^{m}}E({A_{i}})-E(A),\]]]></tex-math></alternatives>
</disp-formula> 
where 
<disp-formula id="j_info1146_eq_005">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mo fence="true" stretchy="false">}</mml:mo>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mo>=</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mo fence="true" stretchy="false">}</mml:mo>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">⋯</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}C(A)=& {\sum \limits_{i=2}^{m}}\sum \limits_{\{{r_{1}},{r_{2}},\dots ,{r_{i}}\}\in \{1,2,\dots ,m\}}I({A_{{r_{1}}}},\dots ,{A_{{r_{i}}}})\\ {} =& \sum \limits_{\{{r_{1}},{r_{2}},\dots ,{r_{i}}\}\in \{1,2,\dots ,m\}}I({A_{{r_{1}}}},{A_{{r_{2}}}})+\cdots +I({A_{{r_{1}}}},\dots ,{A_{{r_{i}}}}),\end{aligned}\]]]></tex-math></alternatives>
</disp-formula> 
<inline-formula id="j_info1146_ineq_026"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${r_{1}},{r_{2}},\dots ,{r_{i}}$]]></tex-math></alternatives></inline-formula> are attribute numbers varying from 1 to <italic>m</italic>, <inline-formula id="j_info1146_ineq_027"><alternatives><mml:math>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo></mml:math><tex-math><![CDATA[$I({A_{{r_{1}}}},\dots ,{A_{{r_{i}}}})=I({A_{{r_{1}}}},\dots $]]></tex-math></alternatives></inline-formula> , <inline-formula id="j_info1146_ineq_028"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${A_{{r_{i-1}}}})-I({A_{{r_{1}}}},\dots ,{A_{{r_{i-1}}}}\mid {A_{{r_{i}}}})$]]></tex-math></alternatives></inline-formula> is the multivariate mutual information of <inline-formula id="j_info1146_ineq_029"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{{r_{1}}}},\dots ,{A_{{r_{i}}}}$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_info1146_ineq_030"><alternatives><mml:math>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$I({A_{{r_{1}}}},\dots ,{A_{{r_{i-1}}}}\mid {A_{{r_{i}}}})=E(I({A_{{r_{1}}}},\dots ,{A_{{r_{i-1}}}}|{A_{{r_{i}}}}))$]]></tex-math></alternatives></inline-formula> is the conditional mutual information. Thus, the total correlation can be used for estimating the interrelationships among the attributes or shared information of subclusters.</p></statement>
<p>Based on Wu and Wang (<xref ref-type="bibr" rid="j_info1146_ref_048">2013</xref>), the definition of holo-entropy is as follows: The holo-entropy <inline-formula id="j_info1146_ineq_031"><alternatives><mml:math>
<mml:mi mathvariant="italic">H</mml:mi>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$HL(A)$]]></tex-math></alternatives></inline-formula> is defined as the sum of the entropy and the total correlation of the random vector <italic>A</italic>, and can be expressed by the sum of the entropies on all attributes. 
<disp-formula id="j_info1146_eq_006">
<label>(5)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">H</mml:mi>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ HL(A)=E(A)+C(A)={\sum \limits_{i=1}^{m}}E({A_{i}}).\]]]></tex-math></alternatives>
</disp-formula>
</p>
<table-wrap id="j_info1146_tab_001">
<label>Table 1</label>
<caption>
<p>The example of Dataset1.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">No. object</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_032"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_033"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_034"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Label</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_035"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_036"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_037"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_038"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_039"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_040"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_041"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">8</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">3</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_042"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Moreover, <inline-formula id="j_info1146_ineq_043"><alternatives><mml:math>
<mml:mi mathvariant="italic">H</mml:mi>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$HL(A)=E(A)$]]></tex-math></alternatives></inline-formula> holds if and only if all the attributes are independent. From what has been discussed above, the holo-entropy can be used to measure the compactness of a dataset or a cluster more effectively, since it evaluates not only the disorder of the objects in the dataset but also correlation between variables. In fact, the values of <inline-formula id="j_info1146_ineq_044"><alternatives><mml:math>
<mml:mi mathvariant="italic">H</mml:mi>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$HL()$]]></tex-math></alternatives></inline-formula> calculated on subsets of attributes and/or on subsets (groups) of data reveal cluster structures hidden in the data. As a simple example, let us look at a dataset (Dataset1) shown in Table <xref rid="j_info1146_tab_001">1</xref>. Intuitively, this dataset has two classes (or clusters), one comprising objects <inline-formula id="j_info1146_ineq_045"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{1,2,3,4\}$]]></tex-math></alternatives></inline-formula> and the other objects <inline-formula id="j_info1146_ineq_046"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>6</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>7</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>8</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{5,6,7,8\}$]]></tex-math></alternatives></inline-formula>. We can also easily observe that the subspace <inline-formula id="j_info1146_ineq_047"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{{A_{1}},{A_{2}}\}$]]></tex-math></alternatives></inline-formula> more strongly reflects an intrinsic cluster structure with objects <inline-formula id="j_info1146_ineq_048"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{1,2,3,4\}$]]></tex-math></alternatives></inline-formula> than do other attribute combinations. Indeed, the holo-entropies on the three non-single-dimensional subspaces are respectively <inline-formula id="j_info1146_ineq_049"><alternatives><mml:math>
<mml:mi mathvariant="italic">HL</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>0.5623</mml:mn></mml:math><tex-math><![CDATA[$\mathit{HL}({A_{1}},{A_{2}})=0.5623$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_050"><alternatives><mml:math>
<mml:mi mathvariant="italic">HL</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1.0397</mml:mn></mml:math><tex-math><![CDATA[$\mathit{HL}({A_{1}},{A_{3}})=1.0397$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_051"><alternatives><mml:math>
<mml:mi mathvariant="italic">HL</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1.6021</mml:mn></mml:math><tex-math><![CDATA[$\mathit{HL}({A_{2}},{A_{3}})=1.6021$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_052"><alternatives><mml:math>
<mml:mi mathvariant="italic">HL</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1.6021</mml:mn></mml:math><tex-math><![CDATA[$\mathit{HL}({A_{1}},{A_{2}},{A_{3}})=1.6021$]]></tex-math></alternatives></inline-formula>. These indicate clearly that <inline-formula id="j_info1146_ineq_053"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{{A_{1}},{A_{2}}\}$]]></tex-math></alternatives></inline-formula> is the subspace of choice, given that the holo-entropy on it is the smallest. On the other hand, if we calculate the holo-entropy on different data subsets such as <inline-formula id="j_info1146_ineq_054"><alternatives><mml:math>
<mml:mi mathvariant="italic">H</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1.6021</mml:mn></mml:math><tex-math><![CDATA[$H{L_{\{1,2,3,4\}}}({A_{1}},{A_{2}},{A_{3}})=1.6021$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_055"><alternatives><mml:math>
<mml:mi mathvariant="italic">H</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>6</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>2.0794</mml:mn></mml:math><tex-math><![CDATA[$H{L_{\{3,4,5,6\}}}({A_{1}},{A_{2}},{A_{3}})=2.0794$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_056"><alternatives><mml:math>
<mml:mi mathvariant="italic">H</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>6</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>2.9776</mml:mn></mml:math><tex-math><![CDATA[$H{L_{\{1,2,3,4,5,6\}}}({A_{1}},{A_{2}},{A_{3}})=2.9776$]]></tex-math></alternatives></inline-formula>, we also observe that the subset <inline-formula id="j_info1146_ineq_057"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{1,2,3,4\}$]]></tex-math></alternatives></inline-formula> is clearly a much better cluster candidate than other subsets.</p>
<p>From this example, we can see that the holo-entropy of two merged subclusters from the same (homogeneous) class is much smaller than that of merged subclusters from different (heterogeneous) classes. This indicates that holo-entropy is an effective measurement for the compactness of a subspace in a cluster. In what follows, we will use the holo-entropy for subspace detection. Moreover, we will employ the soft clustering method (Domeniconi <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_014">2004</xref>; Gan and Wu, <xref ref-type="bibr" rid="j_info1146_ref_017">2004</xref>; Nemalhabib and Shiri, <xref ref-type="bibr" rid="j_info1146_ref_036">2006</xref>) and the properties of holo-entropy for merging subclusters in the hierarchical clustering.</p>
</sec>
<sec id="j_info1146_s_004">
<label>4</label>
<title>HPCCD Algorithm</title>
<p>In this section, we present our Hierarchical Projected Clustering algorithm for Categorical Data (HPCCD) in detail. To illustrate the ideas underlying the algorithm, we also provide a working example with 11 objects from the dataset Soybean, as shown in Table <xref rid="j_info1146_tab_002">2</xref>. The process of HPCCD draws on a conventional hierarchical clustering approach for its major steps, including initially grouping the data into small clusters and iterating between searching for the closest pair of subclusters and merging the pair. Our contribution is the design of new methods based on holo-entropy for detecting the relevant subspace of each subcluster and evaluating the structure compactness of a pair of subclusters in order to select the most similar subclusters to merge. Our algorithm is described as follows:</p>
<p><bold>Algorithm HPCCD</bold></p>
<p><bold>Input:</bold> Dataset <italic>X</italic>, threshold <italic>r</italic> and the terminal condition <inline-formula id="j_info1146_ineq_058"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">init</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${K_{\mathit{init}}}$]]></tex-math></alternatives></inline-formula> which is the</p>
<p> desired number of clusters;</p>
<p><bold>Output:</bold> clusters of Dataset <italic>X</italic>.</p>
<p><bold>Begin</bold></p>
<p>Initialization (Grouping the data into subclusters <italic>C</italic>);</p>
<p><bold>For</bold> each subcluster in set <italic>C</italic></p>
<p> A: relevant subspace selection</p>
<p>  A1: Detect the relevant subspace of each subcluster of <italic>C</italic>;</p>
<p>  A2: Assign weights to the attributes in the relevant subspace;</p>
<p> B: compactness calculation (calculate compactness binding weight with</p>
<p>  holo-entropy);</p>
<p>Choose the most compact pair of subclusters to merge and update set <italic>C</italic>;</p>
<p><bold>End</bold></p>
<p>until satisfaction of the termination condition of <italic>K</italic> desired clusters.</p>
<p>The details of each step will be described in the following sub-sections.</p>
<sec id="j_info1146_s_005">
<label>4.1</label>
<title>Initialization</title>
<p>Cluster initialization in our approach is a necessary step as it makes it meaningful to use the information-theoretic method to estimate attribute relevance and also reduces the number of cluster merging steps. For the initialization of our agglomerative clustering algorithm, the dataset is first divided into small subclusters. In order to ensure that objects in the same subcluster are as similar as possible, we use the following categorical data similarity measurement for initialization. The similarity between any two objects <inline-formula id="j_info1146_ineq_059"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{i}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_060"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{j}}$]]></tex-math></alternatives></inline-formula> can be defined as follows: <disp-formula-group id="j_info1146_dg_001">
<disp-formula id="j_info1146_eq_007">
<label>(6)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="left">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mi mathvariant="italic">sim</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">∥</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∥</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathit{sim}({X_{i}},{X_{j}})=\frac{{\textstyle\textstyle\sum _{d=1}^{m}}\parallel {x_{id}},{x_{jd}}\parallel }{m},\]]]></tex-math></alternatives>
</disp-formula>
<disp-formula id="j_info1146_eq_008">
<label>(7)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="left">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mo stretchy="false">‖</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">‖</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfenced separators="" open="{" close="">
<mml:mrow>
<mml:mtable columnspacing="4.0pt" equalrows="false" columnlines="none" equalcolumns="false" columnalign="left left">
<mml:mtr>
<mml:mtd class="array">
<mml:mn>1</mml:mn>
<mml:mspace width="1em"/>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>0</mml:mn>
<mml:mspace width="1em"/>
</mml:mtd>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">≠</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \| {x_{id}},{x_{jd}}\| =\left\{\begin{array}{l@{\hskip4.0pt}l}1\hspace{1em}& {x_{id}}={x_{jd}},\\ {} 0\hspace{1em}& {x_{id}}\ne {x_{jd}},\end{array}\right.\]]]></tex-math></alternatives>
</disp-formula>
</disp-formula-group> where <italic>m</italic> is the number of dimensions of object <italic>X</italic> and <inline-formula id="j_info1146_ineq_061"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{id}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_062"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{jd}}$]]></tex-math></alternatives></inline-formula> are the <italic>d</italic>th attribute values of <inline-formula id="j_info1146_ineq_063"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{i}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_064"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{j}}$]]></tex-math></alternatives></inline-formula>, respectively. Drawing inspiration from Nemalhabib and Shiri (<xref ref-type="bibr" rid="j_info1146_ref_036">2006</xref>), we extend this similarity measure to calculate the similarity between an object and a cluster. The definition is as follows: 
<disp-formula id="j_info1146_eq_009">
<label>(8)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">Sim</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">‖</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">‖</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mi mathvariant="italic">m</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathit{Sim}({X_{i}},C)=\frac{{\textstyle\textstyle\sum _{a=1}^{|C|}}{\textstyle\textstyle\sum _{d=1}^{m}}\| {x_{id}},{x_{ad}}\| }{|C|\ast m},\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_info1146_ineq_065"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|C|$]]></tex-math></alternatives></inline-formula> is the size of cluster <italic>C</italic>, <inline-formula id="j_info1146_ineq_066"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{a}}$]]></tex-math></alternatives></inline-formula> is one of the objects in <italic>C</italic> and the first sum in the numerator covers all the objects in <italic>C</italic>. The main phase of initialization is described as follows:</p>
<p><bold>Initialization</bold></p>
<p><bold>Input:</bold> Dataset <inline-formula id="j_info1146_ineq_067"><alternatives><mml:math>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$X=\{{x_{1}},{x_{2}},\dots ,{x_{n}}\}$]]></tex-math></alternatives></inline-formula>, threshold <italic>r</italic></p>
<p><bold>Output:</bold> <inline-formula id="j_info1146_ineq_068"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">init</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${K_{\mathit{init}}}$]]></tex-math></alternatives></inline-formula> subclusters <inline-formula id="j_info1146_ineq_069"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">init</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$C=\{{C_{1}},{C_{2}},\dots ,{C_{{K_{\mathit{init}}}}}\}$]]></tex-math></alternatives></inline-formula>.</p>
<p><bold>Begin</bold></p>
<p>Set <inline-formula id="j_info1146_ineq_070"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{1}}\in {C_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_071"><alternatives><mml:math>
<mml:mi mathvariant="italic">R</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo>−</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$R=X-\{{x_{1}}\},k=1$]]></tex-math></alternatives></inline-formula>;</p>
<p> <bold>For</bold> each <inline-formula id="j_info1146_ineq_072"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{i}}$]]></tex-math></alternatives></inline-formula> in <italic>R</italic></p>
<p> Use Eq. (<xref rid="j_info1146_eq_009">8</xref>) to calculate similarities <inline-formula id="j_info1146_ineq_073"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{1}},\dots ,{S_{k}}$]]></tex-math></alternatives></inline-formula> of <inline-formula id="j_info1146_ineq_074"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{i}}$]]></tex-math></alternatives></inline-formula> to each of</p>
<p> the clusters <inline-formula id="j_info1146_ineq_075"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{{C_{1}},{C_{2}},\dots ,{C_{k}}\}$]]></tex-math></alternatives></inline-formula>;</p>
<p>  <bold>If</bold> <inline-formula id="j_info1146_ineq_076"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo movablelimits="false">max</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo movablelimits="false">…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:mi mathvariant="italic">r</mml:mi></mml:math><tex-math><![CDATA[${S_{l}}=\max \{{S_{1}},\dots ,{S_{k}}\}>r$]]></tex-math></alternatives></inline-formula>;</p>
<p>   Allocate <inline-formula id="j_info1146_ineq_077"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{i}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_info1146_ineq_078"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{l}}$]]></tex-math></alternatives></inline-formula>;</p>
<p>  <bold>Else</bold></p>
<p>   <inline-formula id="j_info1146_ineq_079"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$k=k+1$]]></tex-math></alternatives></inline-formula>;</p>
<p>   Allocate <inline-formula id="j_info1146_ineq_080"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{i}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_info1146_ineq_081"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{k}}$]]></tex-math></alternatives></inline-formula>;</p>
<p>  <bold>End</bold> if</p>
<p> <bold>End</bold> for</p>
<p> <inline-formula id="j_info1146_ineq_082"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">init</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi></mml:math><tex-math><![CDATA[${K_{\mathit{init}}}=k$]]></tex-math></alternatives></inline-formula></p>
<p><bold>End</bold></p>
<p>In this algorithm, Eq. (<xref rid="j_info1146_eq_009">8</xref>) is used to calculate the similarities <inline-formula id="j_info1146_ineq_083"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${S_{1}},\dots ,{S_{k}}$]]></tex-math></alternatives></inline-formula> of <inline-formula id="j_info1146_ineq_084"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{i}}$]]></tex-math></alternatives></inline-formula> to each of <inline-formula id="j_info1146_ineq_085"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_info1146_ineq_086"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{k}}$]]></tex-math></alternatives></inline-formula>. If at least one of these similarities is larger than <italic>r</italic>, then <inline-formula id="j_info1146_ineq_087"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{i}}$]]></tex-math></alternatives></inline-formula> is assigned to an existing cluster; otherwise, a new cluster <inline-formula id="j_info1146_ineq_088"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{k+1}}$]]></tex-math></alternatives></inline-formula> will be created. The final number of initial clusters is thus controlled by the similarity threshold <italic>r</italic>, which ranges in <inline-formula id="j_info1146_ineq_089"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(0,1)$]]></tex-math></alternatives></inline-formula>. In fact, the larger <italic>r</italic> will result in more sub-clusters with less objects, on the other hand, the small <italic>r</italic> results in less sub-clusters with more objects. In theory, the choice of <italic>r</italic> is application dependent, however, we found from our experiments that the results are not sensitive to the choice of <italic>r</italic> as long as its value is large enough to ensure creation of a sufficient number of initial clusters. Our guideline is that if the attribute values of an object are the same as those of a sub-cluster at about three quarters of the dimensions or more, the object can be regarded as coming from the same sub-cluster. To ensure that all the objects in each cluster are sufficiently similar to each other, we recommend choosing <italic>r</italic> between 0.7 and 0.95. In our experiments, we have chosen <italic>r</italic> to be 0.80 for different datasets.</p>
<table-wrap id="j_info1146_tab_002">
<label>Table 2</label>
<caption>
<p>11 samples from the Soybean dataset.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">No. object</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Clusters</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_090"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_091"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_092"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_093"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{4}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_094"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{5}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Class label</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_095"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_096"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_097"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_098"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_099"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_100"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_101"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">8</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_102"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">9</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_103"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">10</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_104"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">11</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_105"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">4</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The similarity measure of Eq. (<xref rid="j_info1146_eq_009">8</xref>) coupled with a high similarity threshold <italic>r</italic> makes the initialization procedure significantly less sensitive to the input order. This is very important for the hierarchical clustering proposed in this paper, as the final result depends on the initial clusters. An example is shown in Table <xref rid="j_info1146_tab_002">2</xref> in which three clusters can be considered to reside in the two-class dataset with 11 objects. Objects 1 to 7 from class 1 form two clusters: objects 1 to 4 as one cluster and 5 to 7 as another. The rest of the objects, 8 to 11 from class 2, form the third cluster. Each object is represented by five attributes <inline-formula id="j_info1146_ineq_106"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_info1146_ineq_107"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{5}}$]]></tex-math></alternatives></inline-formula>, and the corresponding class number is given in the Label column. Remark that significant differences exist between data objects belonging to the same cluster. After the initialization phase, the dataset will be divided into three subclusters <inline-formula id="j_info1146_ineq_108"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_109"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_110"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula>, using Eqs. (<xref rid="j_info1146_eq_007">6</xref>), (<xref rid="j_info1146_eq_008">7</xref>), (<xref rid="j_info1146_eq_009">8</xref>): objects 1 to 4, 5 to 7 and 8 to 11 are assigned to subclusters <inline-formula id="j_info1146_ineq_111"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}},{C_{2}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_112"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula>, respectively. Such a result is obtained regardless of the order in which the data objects are presented to the initialization procedure.</p>
</sec>
<sec id="j_info1146_s_006">
<label>4.2</label>
<title>Relevant Subspace Selection</title>
<p>In this subsection, we address the problem of optimally determining the subspace (Domeniconi <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_014">2004</xref>; Gan and Wu, <xref ref-type="bibr" rid="j_info1146_ref_017">2004</xref>) for a given cluster. The idea of our approach is to optimally separate the attributes into two groups, of which one generates a subspace that pertains to the cluster. In our method, the entropy is used for attribute evaluation and holo-entropy is employed as the criterion for subspace detection.</p>
<p>Subspace clustering (Gan and Wu, <xref ref-type="bibr" rid="j_info1146_ref_017">2004</xref>; Parsons <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_037">2004</xref>) is extensively applied for clustering high-dimensional data in the field because of the curse of dimensionality (Parsons <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_037">2004</xref>). In subspace clustering, finding the relevant subspaces in a cluster is of great significance. Informally, a relevant or characteristic attribute of a cluster is an attribute on which the data distribution is concentrated as compared to a non-characteristic attribute of the cluster. Characteristic attributes of a cluster form the relevant subspace of the cluster. Generally, different clusters have different characteristic subspaces. The non-characteristic or noise attributes are distributed in a diffuse way; i.e. data should be sparse in the subspace generated by noise attributes. We also call these attributes irrelevant w.r.t. the cluster structure. How to separate the attribute space into relevant and noise subspaces is a key issue in subspace clustering.</p>
<p>In order to determine the relevant attributes for each cluster, we need to find an optimal division that separates characteristic attributes from noise ones. Although there has been a great deal of work reported on subspace clustering, attribute relevance analysis in the existing approaches is often based on analysing the variance of each individual attribute while assuming that attributes are mutually independent (Barbar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_006">2002</xref>; Ganti <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_020">1999</xref>; Greenacre and Blasius, <xref ref-type="bibr" rid="j_info1146_ref_021">2006</xref>). Moreover, existing entropy-based algorithms such as Barbar <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1146_ref_006">2002</xref>), Qin <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1146_ref_038">2014</xref>) usually use <italic>ad-hoc</italic> values of parameters to determine the separation between relevant and non relevant attributes, and such methods lack flexibility for practical applications. The proposed strategy makes use of the holo-entropy in Eq. (<xref rid="j_info1146_eq_006">5</xref>), as it provides an efficient and effective way to estimate the quality of cluster structure of a given data subset on a given set of attributes. It establishes the separation based on an automatic process.</p>
<p>Let the whole feature space of a cluster (or cluster candidate) <italic>D</italic> be <italic>Q</italic>, and let <italic>Q</italic> be separated into two feature subspaces <italic>S</italic> and <italic>N</italic>, where <italic>S</italic> is a candidate for the relevant subspace of <italic>D</italic> and <italic>N</italic> is a candidate for its non-relevant subspace, <inline-formula id="j_info1146_ineq_113"><alternatives><mml:math>
<mml:mi mathvariant="italic">Q</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo>∪</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi></mml:math><tex-math><![CDATA[$Q=N\cup S$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_114"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo>∩</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>∅</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(N\cap S=\varnothing )$]]></tex-math></alternatives></inline-formula>. We want to evaluate the quality of the feature-space separation in order to find an optimal <italic>S</italic> as the relevant subspace of <italic>D</italic>. In fact, by using holo-entropy, the quality of the feature subspace <italic>S</italic> can be evaluated by <inline-formula id="j_info1146_ineq_115"><alternatives><mml:math>
<mml:mi mathvariant="italic">HL</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{HL}(D|S)$]]></tex-math></alternatives></inline-formula>. This measure can be equivalently written as 
<disp-formula id="j_info1146_eq_010">
<label>(9)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">std</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ qcs(D|S)=\sum \limits_{i\in S}\mathit{std}(i)\]]]></tex-math></alternatives>
</disp-formula> 
and 
<disp-formula id="j_info1146_eq_011">
<label>(10)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">std</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">entropy</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mo movablelimits="false">min</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo movablelimits="false">max</mml:mo>
<mml:mo>−</mml:mo>
<mml:mo movablelimits="false">min</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathit{std}(i)=\frac{\mathit{entropy}(i)-\min }{\max -\min }\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_info1146_ineq_116"><alternatives><mml:math>
<mml:mi mathvariant="italic">entropy</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{entropy}(i)$]]></tex-math></alternatives></inline-formula>, min and max respectively denote the entropy of attribute <inline-formula id="j_info1146_ineq_117"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{i}}$]]></tex-math></alternatives></inline-formula> and the minimum and maximum values of attribute entropy in the whole space of a subcluster. <inline-formula id="j_info1146_ineq_118"><alternatives><mml:math>
<mml:mi mathvariant="italic">std</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{std}(i)$]]></tex-math></alternatives></inline-formula> refers to the normalized information entropy of attribute <inline-formula id="j_info1146_ineq_119"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{i}}$]]></tex-math></alternatives></inline-formula>. An advantage of using normalized entropies is that it allows another seemingly close function to be defined to evaluate the quality of <italic>N</italic> as a non-relevant subspace. 
<disp-formula id="j_info1146_eq_012">
<label>(11)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">std</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ qncs(D|N)=\sum \limits_{i\in N}\big(1-\mathit{std}(i)\big).\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>In both cases, the smaller the value of <inline-formula id="j_info1146_ineq_120"><alternatives><mml:math>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$qcs()$]]></tex-math></alternatives></inline-formula> (or <inline-formula id="j_info1146_ineq_121"><alternatives><mml:math>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$qncs()$]]></tex-math></alternatives></inline-formula>), the better the quality of <italic>S</italic> (or <italic>N</italic>) as a (non-)relevant subspace. From these two functions, we define a new measure for evaluating the quality of the feature-space separation by 
<disp-formula id="j_info1146_eq_013">
<label>(12)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">AF</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">Q</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">b</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">dims</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>+</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">b</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">dims</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathit{AF}(D,Q)=\frac{qcs(D|S)}{nb\_ \mathit{dims}(S)}+\frac{qncs(D|N)}{nb\_ \mathit{dims}(N)}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Obviously, this measure is designed to strike a balance between the sizes of the relevant and non-relevant subspaces.</p>
<p>The optimization (minimization, in fact) of Eq. (<xref rid="j_info1146_eq_013">12</xref>) aims to find <italic>S</italic> that leads to the optimal value of <inline-formula id="j_info1146_ineq_122"><alternatives><mml:math>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mi mathvariant="italic">F</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">Q</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$AF(D,Q)$]]></tex-math></alternatives></inline-formula>. This optimization can be performed by a demarcation detection process. In fact, if we first sort all the dimensions in increasing order of entropies; i.e. supposing that for any pair of dimensions <italic>i</italic> and <italic>j</italic>, we have <inline-formula id="j_info1146_ineq_123"><alternatives><mml:math>
<mml:mi mathvariant="italic">entropy</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">entropy</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{entropy}(i),\mathit{entropy}(j)$]]></tex-math></alternatives></inline-formula>, then the optimization of (<xref rid="j_info1146_eq_013">12</xref>) consists in finding a demarcation point <italic>e</italic> such that the S composed of dimensions from 1 to <italic>e</italic> (and entropy from <inline-formula id="j_info1146_ineq_124"><alternatives><mml:math>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$e+1$]]></tex-math></alternatives></inline-formula> to <italic>m</italic>) minimizes <inline-formula id="j_info1146_ineq_125"><alternatives><mml:math>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mi mathvariant="italic">F</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">Q</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$AF(D,Q)$]]></tex-math></alternatives></inline-formula>. In this optimization, we use in Eq. (<xref rid="j_info1146_eq_013">12</xref>) the average of the holo-entropy measures in Eqs. (<xref rid="j_info1146_eq_010">9</xref>) and (<xref rid="j_info1146_eq_012">11</xref>) to compute the demarcation point. This choice is made to favour a balanced separation.</p>
<p>For the Soybean dataset, after initialization, the dataset is divided into three subclusters <inline-formula id="j_info1146_ineq_126"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_127"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_info1146_ineq_128"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula>. Based on the information entropy, Eq. (<xref rid="j_info1146_eq_011">10</xref>) is used for normalization, and the normalized entropy of each attribute of each subcluster is shown in Table <xref rid="j_info1146_tab_003">3</xref>. The values in Table <xref rid="j_info1146_tab_003">3</xref> are the related normalized entropy of the corresponding attribute of each subcluster. For instance, the value of 0.8113 is the standard information entropy of <inline-formula id="j_info1146_ineq_129"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula>’s attribute <inline-formula id="j_info1146_ineq_130"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{2}}$]]></tex-math></alternatives></inline-formula>.</p>
<table-wrap id="j_info1146_tab_003">
<label>Table 3</label>
<caption>
<p>The normalized entropy of attributes for each subcluster.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Clusters</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_131"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_132"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_133"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_134"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{4}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_135"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{5}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_136"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_137"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0.6365</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_138"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.8113</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The optimization in Eq. (<xref rid="j_info1146_eq_013">12</xref>) allows us to determine the relevant subspace for each subcluster. The results are shown in Table <xref rid="j_info1146_tab_004">4</xref>, where <italic>y</italic> indicates that the corresponding attribute is a characteristic attribute while <italic>n</italic> indicates a noisy attribute. For instance, <inline-formula id="j_info1146_ineq_139"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}},{A_{2}},{A_{5}}$]]></tex-math></alternatives></inline-formula> are characteristic attributes of <inline-formula id="j_info1146_ineq_140"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula>, while <inline-formula id="j_info1146_ineq_141"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{3}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_142"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{4}}$]]></tex-math></alternatives></inline-formula> are noisy attributes. In other words, the relevant subspace for <inline-formula id="j_info1146_ineq_143"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> is composed of <inline-formula id="j_info1146_ineq_144"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}},{A_{2}},{A_{5}}$]]></tex-math></alternatives></inline-formula>, and the noisy subspace is composed of <inline-formula id="j_info1146_ineq_145"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{3}}$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_info1146_ineq_146"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{4}}$]]></tex-math></alternatives></inline-formula>. The relevant subspace vector of <inline-formula id="j_info1146_ineq_147"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> is thus <inline-formula id="j_info1146_ineq_148"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[1,1,0,0,1]$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_149"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula>’s is <inline-formula id="j_info1146_ineq_150"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[1,1,0,0,1]$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_151"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula>’s is <inline-formula id="j_info1146_ineq_152"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[0,0,1,1,1]$]]></tex-math></alternatives></inline-formula>. Finally, the relevant subspace vector shown in Table <xref rid="j_info1146_tab_004">4</xref>. The main phase of relevant subspace detection is described as follows:</p>
<p><bold>Subspace Detection</bold></p>
<p><bold>Input:</bold> <italic>K</italic> subclusters <inline-formula id="j_info1146_ineq_153"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">init</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$C={C_{1}},{C_{2}},\dots ,{C_{{K_{\mathit{init}}}}}$]]></tex-math></alternatives></inline-formula></p>
<p><bold>Output:</bold> <inline-formula id="j_info1146_ineq_154"><alternatives><mml:math>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi></mml:math><tex-math><![CDATA[$en\_ \mathit{mat}$]]></tex-math></alternatives></inline-formula> (attribute entropy matrix for subclusters)</p>
<p> <inline-formula id="j_info1146_ineq_155"><alternatives><mml:math>
<mml:mi mathvariant="italic">sub</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi></mml:math><tex-math><![CDATA[$\mathit{sub}\_ \mathit{mat}$]]></tex-math></alternatives></inline-formula> (relevant subspace vector matrix for subclusters)</p>
<p><bold>Begin</bold></p>
<p> <bold>For</bold> each subcluster <inline-formula id="j_info1146_ineq_156"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{i}}$]]></tex-math></alternatives></inline-formula> in <italic>C</italic></p>
<p>  <bold>For</bold> each attribute <inline-formula id="j_info1146_ineq_157"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{j}}$]]></tex-math></alternatives></inline-formula> in <inline-formula id="j_info1146_ineq_158"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{i}}$]]></tex-math></alternatives></inline-formula></p>
<p>   Calculate attribute information entropy <inline-formula id="j_info1146_ineq_159"><alternatives><mml:math>
<mml:mi mathvariant="italic">IE</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{IE}(j)$]]></tex-math></alternatives></inline-formula> <inline-formula id="j_info1146_ineq_160"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>⩽</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo>⩽</mml:mo>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(1\leqslant j\leqslant m)$]]></tex-math></alternatives></inline-formula></p>
<p>  <bold>End</bold> for</p>
<p>  Use Eq. (<xref rid="j_info1146_eq_011">10</xref>) to normalize <inline-formula id="j_info1146_ineq_161"><alternatives><mml:math>
<mml:mi mathvariant="italic">IE</mml:mi></mml:math><tex-math><![CDATA[$\mathit{IE}$]]></tex-math></alternatives></inline-formula></p>
<p>  <inline-formula id="j_info1146_ineq_162"><alternatives><mml:math>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">IE</mml:mi></mml:math><tex-math><![CDATA[$en\_ \mathit{mat}(i)=\mathit{IE}$]]></tex-math></alternatives></inline-formula>;</p>
<p>  Use Eq. (<xref rid="j_info1146_eq_013">12</xref>) to determine subspace vector <inline-formula id="j_info1146_ineq_163"><alternatives><mml:math>
<mml:mi mathvariant="italic">V</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$V(i)$]]></tex-math></alternatives></inline-formula>;</p>
<p>  <inline-formula id="j_info1146_ineq_164"><alternatives><mml:math>
<mml:mi mathvariant="italic">sub</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">V</mml:mi></mml:math><tex-math><![CDATA[$\mathit{sub}\_ \mathit{mat}(i)=V$]]></tex-math></alternatives></inline-formula></p>
<p> <bold>End</bold> for</p>
<p><bold>End</bold></p>
</sec>
<sec id="j_info1146_s_007">
<label>4.3</label>
<title>Intrinsic Compactness</title>
<p>To select the subclusters to merge, we introduce a new measure, named intrinsic compactness, to measure the compactness of a cluster. The concept of intrinsic compactness here differs from conventional variance-based compactness measures in that it incorporates the feature relevance in the relevant subspace and the size of the cluster. For the calculation of intrinsic compactness, we assign a weight to each relevant attribute.</p>
<sec id="j_info1146_s_008">
<label>4.3.1</label>
<title>Attribute Weighting</title>
<table-wrap id="j_info1146_tab_004">
<label>Table 4</label>
<caption>
<p>The relevant subspace vectors of each subcluster.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Clusters</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_165"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_166"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_167"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_168"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{4}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_169"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{5}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_170"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">y</td>
<td style="vertical-align: top; text-align: left">y</td>
<td style="vertical-align: top; text-align: left">n</td>
<td style="vertical-align: top; text-align: left">n</td>
<td style="vertical-align: top; text-align: left">y</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_171"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">y</td>
<td style="vertical-align: top; text-align: left">y</td>
<td style="vertical-align: top; text-align: left">n</td>
<td style="vertical-align: top; text-align: left">n</td>
<td style="vertical-align: top; text-align: left">y</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_172"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">n</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">n</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">y</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">y</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">y</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Attribute weighting, widely used in soft subspace clustering (Domeniconi <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_014">2004</xref>; Gan and Wu, <xref ref-type="bibr" rid="j_info1146_ref_017">2004</xref>; Parsons <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_037">2004</xref>; Tan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_045">2005</xref>), allows feature selection to be performed using more effective optimization methods for continuous objective functions. All attributes are not equally important for characterizing the cluster structure of a dataset. Even within the relevant subspace of a cluster, attributes may still differ from each other in importance. In this paper, we consider attribute weighting in order to take both the entropy and the size of the cluster into account. It is not enough to assign weights based solely on the entropy of the attributes: it is also necessary to take the size of the subcluster into consideration. As an example, let a dataset contain subclusters denoted by <inline-formula id="j_info1146_ineq_173"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> (with 11 objects) and <inline-formula id="j_info1146_ineq_174"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula> (with 2 objects), and an attribute <italic>A</italic>. Suppose <italic>A</italic> gets the values <inline-formula id="j_info1146_ineq_175"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[1,1,1,1,1,1,2,2,3,3,4]$]]></tex-math></alternatives></inline-formula> in <inline-formula id="j_info1146_ineq_176"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_177"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[1,2]$]]></tex-math></alternatives></inline-formula> in <inline-formula id="j_info1146_ineq_178"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula>. The entropy of attribute <italic>A</italic> is 1.168 in subcluster <inline-formula id="j_info1146_ineq_179"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> and 0.69 in subcluster <inline-formula id="j_info1146_ineq_180"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula>. Based on the entropy values, it appears that attribute <italic>A</italic> in <inline-formula id="j_info1146_ineq_181"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula> is more important than in <inline-formula id="j_info1146_ineq_182"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula>. But actually, attribute <italic>A</italic> appears more important to <inline-formula id="j_info1146_ineq_183"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> than <inline-formula id="j_info1146_ineq_184"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula> because <italic>a</italic> expresses more centrality at value 1 in <inline-formula id="j_info1146_ineq_185"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula>, while this value expresses dispersity in <inline-formula id="j_info1146_ineq_186"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula>. An important reason for this is that the number of objects in <inline-formula id="j_info1146_ineq_187"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> is much larger than in <inline-formula id="j_info1146_ineq_188"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula>, and attribute <italic>A</italic> in <inline-formula id="j_info1146_ineq_189"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> should thus be assigned a greater weight. Obviously, it is reasonable and rational that the cluster size should be considered in weight allocation, rather than performing it solely on the basis of entropy.</p>
<p>In order to take the size of subcluster into account, we propose a simple approach for attributing a weighting associated with the relevant subspace. The proposed weighting approach calculates the weights from the entropy of attributes in the relevant subspace and the number of objects in the corresponding subcluster. This method is motivated by effectiveness in practical applications rather than by theoretical needs, as the attribute weight for a merged subcluster is closely related to the entropy of the attribute and the size of the two subclusters. Attribute weighting is proportional to the size of the subcluster and inversely proportional to the entropy. In other words, the formula can be written as <inline-formula id="j_info1146_ineq_190"><alternatives><mml:math><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">entropy</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle></mml:math><tex-math><![CDATA[$\frac{|C|}{\mathit{entropy}}$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_info1146_ineq_191"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|C|$]]></tex-math></alternatives></inline-formula> is the size of the subcluster and <inline-formula id="j_info1146_ineq_192"><alternatives><mml:math>
<mml:mi mathvariant="italic">entropy</mml:mi></mml:math><tex-math><![CDATA[$\mathit{entropy}$]]></tex-math></alternatives></inline-formula> is the information entropy of the attribute. Since information entropy can take a value of zero, we set the fixed parameter <inline-formula id="j_info1146_ineq_193"><alternatives><mml:math>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0.0001</mml:mn></mml:math><tex-math><![CDATA[$\alpha =0.0001$]]></tex-math></alternatives></inline-formula> to guarantee that the denominator of <inline-formula id="j_info1146_ineq_194"><alternatives><mml:math><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">entropy</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle></mml:math><tex-math><![CDATA[$\frac{|C|}{\mathit{entropy}}$]]></tex-math></alternatives></inline-formula> is not zero. The formula can be rewritten as <inline-formula id="j_info1146_ineq_195"><alternatives><mml:math><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">entropy</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle></mml:math><tex-math><![CDATA[$\frac{|C|}{\mathit{entropy}+\alpha }$]]></tex-math></alternatives></inline-formula>. Thus, the weight of attribute <inline-formula id="j_info1146_ineq_196"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{i}}$]]></tex-math></alternatives></inline-formula> is defined as follows: 
<disp-formula id="j_info1146_eq_014">
<label>(13)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">std</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>+</mml:mo><mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">std</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">Total</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">Total</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ w(i)=\frac{\frac{|{C_{1}}|}{(\mathit{std}({C_{1}},i)+\alpha }+\frac{|{C_{2}}|}{(\mathit{std}({C_{2}},i)+\alpha )}}{\mathit{Total}({C_{1}})+\mathit{Total}({C_{2}})}\]]]></tex-math></alternatives>
</disp-formula> 
where <disp-formula-group id="j_info1146_dg_002">
<disp-formula id="j_info1146_eq_015">
<label>(14)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="left">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mi mathvariant="italic">Total</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:munder><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">std</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathit{Total}({C_{1}})=\sum \limits_{m\in S}\frac{|{C_{1}}|}{(\mathit{std}(m)+\alpha )},\]]]></tex-math></alternatives>
</disp-formula>
<disp-formula id="j_info1146_eq_016">
<label>(15)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="left">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mi mathvariant="italic">Total</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:munder><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">std</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathit{Total}({C_{2}})=\sum \limits_{n\in S}\frac{|{C_{2}}|}{(\mathit{std}(n)+\alpha )}.\]]]></tex-math></alternatives>
</disp-formula>
</disp-formula-group> Here <inline-formula id="j_info1146_ineq_197"><alternatives><mml:math>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$w(i)$]]></tex-math></alternatives></inline-formula> refers to the weight of attribute <inline-formula id="j_info1146_ineq_198"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{i}}$]]></tex-math></alternatives></inline-formula> which is a member of the relevant subspace of <inline-formula id="j_info1146_ineq_199"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}\cup {C_{2}}$]]></tex-math></alternatives></inline-formula>, <italic>S</italic> is the union subspace of the two subclusters, <inline-formula id="j_info1146_ineq_200"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|{C_{1}}|$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_201"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|{C_{2}}|$]]></tex-math></alternatives></inline-formula> denote the respective sizes of subclusters <inline-formula id="j_info1146_ineq_202"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_203"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_info1146_ineq_204"><alternatives><mml:math>
<mml:mi mathvariant="italic">std</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{std}({C_{1}},i)$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_205"><alternatives><mml:math>
<mml:mi mathvariant="italic">std</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{std}({C_{2}},i)$]]></tex-math></alternatives></inline-formula> denote the normalized entropies of subclusters <inline-formula id="j_info1146_ineq_206"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_207"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula>, respectively.</p>
<p>With the above formulas, we consider both the attribute entropy and the size of the subclusters in computing the weight of an attribute in a merged cluster. The initial importance of each attribute is modulated by the size of the subcluster. An important attribute coming from a large subcluster is assigned a larger weight. On the other hand, an important attribute coming from a small subcluster is assigned a proportionally smaller weight. The contribution of the selected relevant subspace of each subcluster to the merged cluster is thus better balanced.</p>
<p>Continuing with the example given in Section <xref rid="j_info1146_s_006">4.2</xref>, the subspaces corresponding to the merged subclusters are shown in Table <xref rid="j_info1146_tab_005">5</xref>. <italic>y</italic> indicates that the corresponding attribute is a characteristic attribute while <italic>n</italic> indicates a noisy attribute. The relevant subspaces for <inline-formula id="j_info1146_ineq_208"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_209"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula> are composed of <inline-formula id="j_info1146_ineq_210"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[{A_{1}},{A_{2}},{A_{5}}]$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_211"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[{A_{1}},{A_{2}},{A_{5}}]$]]></tex-math></alternatives></inline-formula>, respectively. Choosing the characteristic attribute union set, <inline-formula id="j_info1146_ineq_212"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[{A_{1}},{A_{2}},{A_{5}}]$]]></tex-math></alternatives></inline-formula> serves as the relevant subspace of <inline-formula id="j_info1146_ineq_213"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}},{C_{2}}$]]></tex-math></alternatives></inline-formula>. The relevant subspace weights for each merged subcluster, according to formulas (<xref rid="j_info1146_eq_014">13</xref>), (<xref rid="j_info1146_eq_015">14</xref>) and (<xref rid="j_info1146_eq_016">15</xref>), are given in Table <xref rid="j_info1146_tab_006">6</xref> (∗ denotes that the attribute is a noise attribute for the merged subcluster).</p>
<table-wrap id="j_info1146_tab_005">
<label>Table 5</label>
<caption>
<p>Relevant subspace vectors of merged subclusters.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Clusters</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_214"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_215"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_216"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_217"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{4}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_218"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{5}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_219"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}\cup {C_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">y</td>
<td style="vertical-align: top; text-align: left">y</td>
<td style="vertical-align: top; text-align: left">n</td>
<td style="vertical-align: top; text-align: left">n</td>
<td style="vertical-align: top; text-align: left">y</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_220"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}\cup {C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">y</td>
<td style="vertical-align: top; text-align: left">y</td>
<td style="vertical-align: top; text-align: left">y</td>
<td style="vertical-align: top; text-align: left">y</td>
<td style="vertical-align: top; text-align: left">y</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_221"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}\cup {C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">y</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">y</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">y</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">y</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">y</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_info1146_tab_006">
<label>Table 6</label>
<caption>
<p>Relevant subspace weights for each merged subcluster.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Clusters</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_222"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_223"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_224"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_225"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{4}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_226"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{5}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_227"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}\cup {C_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0.3333</td>
<td style="vertical-align: top; text-align: left">0.3333</td>
<td style="vertical-align: top; text-align: left">*</td>
<td style="vertical-align: top; text-align: left">*</td>
<td style="vertical-align: top; text-align: left">0.3333</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_228"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}\cup {C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0.1667</td>
<td style="vertical-align: top; text-align: left">0.1667</td>
<td style="vertical-align: top; text-align: left">0.1667</td>
<td style="vertical-align: top; text-align: left">0.1667</td>
<td style="vertical-align: top; text-align: left">0.3333</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_229"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}\cup {C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.1429</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.1429</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.1905</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.1905</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.3333</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The main phase of weight calculation is described as follows:</p>
<p><bold>Weight Calculation</bold></p>
<p><bold>Input:</bold> <italic>K</italic> subclusters <inline-formula id="j_info1146_ineq_230"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">init</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$C=\{{C_{1}},{C_{2}},\dots ,{C_{{K_{\mathit{init}}}}}\}$]]></tex-math></alternatives></inline-formula></p>
<p> <inline-formula id="j_info1146_ineq_231"><alternatives><mml:math>
<mml:mi mathvariant="italic">en</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi></mml:math><tex-math><![CDATA[$\mathit{en}\_ \mathit{mat}$]]></tex-math></alternatives></inline-formula> (attribute entropy matrix for subclusters)</p>
<p> <inline-formula id="j_info1146_ineq_232"><alternatives><mml:math>
<mml:mi mathvariant="italic">sub</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi></mml:math><tex-math><![CDATA[$\mathit{sub}\_ \mathit{mat}$]]></tex-math></alternatives></inline-formula> (relevant subspace vector matrix for subclusters)</p>
<p><bold>Output:</bold> <inline-formula id="j_info1146_ineq_233"><alternatives><mml:math>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi></mml:math><tex-math><![CDATA[$w\_ \mathit{mat}$]]></tex-math></alternatives></inline-formula> (weight matrix for merged subclusters)</p>
<p><bold>Begin</bold></p>
<p> <bold>For</bold> each subcluster <inline-formula id="j_info1146_ineq_234"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{i}}$]]></tex-math></alternatives></inline-formula> in <italic>C</italic></p>
<p>  Calculate the size of <inline-formula id="j_info1146_ineq_235"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{i}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_236"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|{C_{i}}|$]]></tex-math></alternatives></inline-formula></p>
<p>  <bold>For</bold> each subcluster <inline-formula id="j_info1146_ineq_237"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{j}}$]]></tex-math></alternatives></inline-formula> in <italic>C</italic>, <inline-formula id="j_info1146_ineq_238"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(i<j)$]]></tex-math></alternatives></inline-formula></p>
<p>   Calculate the size of <inline-formula id="j_info1146_ineq_239"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{j}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_240"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|{C_{j}}|$]]></tex-math></alternatives></inline-formula></p>
<p>   Find the union of <inline-formula id="j_info1146_ineq_241"><alternatives><mml:math>
<mml:mi mathvariant="italic">sub</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{sub}\_ \mathit{mat}(i)$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_242"><alternatives><mml:math>
<mml:mi mathvariant="italic">sub</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{sub}\_ \mathit{mat}(j)$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_243"><alternatives><mml:math>
<mml:mi mathvariant="italic">subV</mml:mi></mml:math><tex-math><![CDATA[$\mathit{subV}$]]></tex-math></alternatives></inline-formula></p>
<p>   <bold>For</bold> each attribute <italic>s</italic> in <inline-formula id="j_info1146_ineq_244"><alternatives><mml:math>
<mml:mi mathvariant="italic">subV</mml:mi></mml:math><tex-math><![CDATA[$\mathit{subV}$]]></tex-math></alternatives></inline-formula></p>
<p>    Use Eqs. (<xref rid="j_info1146_eq_015">14</xref>), (<xref rid="j_info1146_eq_016">15</xref>) to get <inline-formula id="j_info1146_ineq_245"><alternatives><mml:math>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$S({C_{i}})$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_246"><alternatives><mml:math>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$S({C_{j}})$]]></tex-math></alternatives></inline-formula></p>
<p>    Use Eq. (<xref rid="j_info1146_eq_014">13</xref>) to calculate <inline-formula id="j_info1146_ineq_247"><alternatives><mml:math>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$w(s)$]]></tex-math></alternatives></inline-formula></p>
<p>    <inline-formula id="j_info1146_ineq_248"><alternatives><mml:math>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$w\_ \mathit{mat}(i,j,s)=w(s)$]]></tex-math></alternatives></inline-formula></p>
<p>   <bold>End</bold> for</p>
<p>  <bold>End</bold> for</p>
<p> <bold>End</bold> for</p>
<p><bold>End</bold></p>
</sec>
<sec id="j_info1146_s_009">
<label>4.3.2</label>
<title>Intrinsic Compactness</title>
<p>As subcluster merging criterion, most existing hierarchical algorithms (Guha <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_023">1999</xref>; Do and Kim, <xref ref-type="bibr" rid="j_info1146_ref_013">2008</xref>; Zhang and Fang, <xref ref-type="bibr" rid="j_info1146_ref_054">2013</xref>) use similarity measures that do not consider correlations between attributes or variations in the importance of each attribute. For this reason, we propose the concept of intrinsic compactness, defined as a weighted holo-entropy computed on attributes in the relevant subspace. The intrinsic compactness <inline-formula id="j_info1146_ineq_249"><alternatives><mml:math>
<mml:mi mathvariant="italic">IC</mml:mi></mml:math><tex-math><![CDATA[$\mathit{IC}$]]></tex-math></alternatives></inline-formula> is defined on a potential cluster resulting from the merging of two subclusters, and will be used to measure the quality of the merge. Let <inline-formula id="j_info1146_ineq_250"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_251"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula> be the two subclusters and <italic>C</italic> the potential cluster resulting from merging <inline-formula id="j_info1146_ineq_252"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_253"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula>. The intrinsic compactness is defined as follows: 
<disp-formula id="j_info1146_eq_017">
<label>(16)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">IC</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">RS</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathit{IC}({C_{1}},{C_{2}})=\sum \limits_{i\in \mathit{RS}(C)}w(i)\ast E(i)\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_info1146_ineq_254"><alternatives><mml:math>
<mml:mi mathvariant="italic">RS</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{RS}(C)$]]></tex-math></alternatives></inline-formula> stands for the relevant subspace of <italic>C</italic>, <inline-formula id="j_info1146_ineq_255"><alternatives><mml:math>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$w(i)$]]></tex-math></alternatives></inline-formula> is the weight of attribute <italic>i</italic>, and <inline-formula id="j_info1146_ineq_256"><alternatives><mml:math>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$E(i)$]]></tex-math></alternatives></inline-formula> is the entropy of attribute <italic>i</italic>. The above intrinsic compactness <inline-formula id="j_info1146_ineq_257"><alternatives><mml:math>
<mml:mi mathvariant="italic">IC</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{IC}({C_{1}},{C_{2}})$]]></tex-math></alternatives></inline-formula> degenerates to a measure equivalent to the holo-entropy if all the weights are equally important in the relevant subspace of <italic>C</italic>. Similar to the holo-entropy, <inline-formula id="j_info1146_ineq_258"><alternatives><mml:math>
<mml:mi mathvariant="italic">IC</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{IC}({C_{1}},{C_{2}})$]]></tex-math></alternatives></inline-formula> measures the compactness of <italic>C</italic> while considering only the relevant subspace and taking into account the contribution of individual attributes. The smaller <inline-formula id="j_info1146_ineq_259"><alternatives><mml:math>
<mml:mi mathvariant="italic">IC</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{IC}({C_{1}},{C_{2}})$]]></tex-math></alternatives></inline-formula> is, the more similar the intrinsic structures of <inline-formula id="j_info1146_ineq_260"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_261"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula>, and the more likely it is that they originated from a homogeneous class. To continue with the above example, based on information theory, the value of each <inline-formula id="j_info1146_ineq_262"><alternatives><mml:math>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$E(i)$]]></tex-math></alternatives></inline-formula> is shown Table <xref rid="j_info1146_tab_007">7</xref> (∗ denotes a noise attribute). The intrinsic compactness of the merged subclusters, computed by formula (<xref rid="j_info1146_eq_017">16</xref>), is shown in Table <xref rid="j_info1146_tab_008">8</xref>.</p>
<table-wrap id="j_info1146_tab_007">
<label>Table 7</label>
<caption>
<p>Information entropy of attributes for each merged subcluster.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Clusters</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_263"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_264"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_265"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_266"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{4}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_267"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${A_{5}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_268"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}\cup {C_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">*</td>
<td style="vertical-align: top; text-align: left">*</td>
<td style="vertical-align: top; text-align: left">0.6829</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_269"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}\cup {C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">0.5623</td>
<td style="vertical-align: top; text-align: left">0.6616</td>
<td style="vertical-align: top; text-align: left">0.5623</td>
<td style="vertical-align: top; text-align: left">1.0397</td>
<td style="vertical-align: top; text-align: left">0.6931</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_270"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}\cup {C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.5938</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.6829</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.5938</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.9557</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.6829</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_info1146_tab_008">
<label>Table 8</label>
<caption>
<p>Compactness of merged subclusters.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Clusters</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_271"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_272"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_273"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_274"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">–</td>
<td style="vertical-align: top; text-align: left">0.2276</td>
<td style="vertical-align: top; text-align: left">0.7020</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_275"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">–</td>
<td style="vertical-align: top; text-align: left">–</td>
<td style="vertical-align: top; text-align: left">0.7067</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_276"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">–</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">–</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">–</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Given the symmetry of the intrinsic compactness measure, the compactness matrix is an upper triangular matrix (see Table <xref rid="j_info1146_tab_008">8</xref>). The main phase of intrinsic compactness calculation is described as follows:</p>
<p><bold>Intrinsic compactness calculation</bold></p>
<p><bold>Input:</bold> <inline-formula id="j_info1146_ineq_277"><alternatives><mml:math>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi></mml:math><tex-math><![CDATA[$w\_ \mathit{mat}$]]></tex-math></alternatives></inline-formula> (weight matrix for subclusters)</p>
<p> <inline-formula id="j_info1146_ineq_278"><alternatives><mml:math>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">mat</mml:mi></mml:math><tex-math><![CDATA[$en\_ \mathit{mat}$]]></tex-math></alternatives></inline-formula> (attribute entropy matrix for subclusters)</p>
<p><bold>Output:</bold> <inline-formula id="j_info1146_ineq_279"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">matrix</mml:mi></mml:math><tex-math><![CDATA[$C\_ \mathit{matrix}$]]></tex-math></alternatives></inline-formula> (inter-subcluster compactness matrix)</p>
<p><bold>Begin</bold></p>
<p> Use Eq. (<xref rid="j_info1146_eq_017">16</xref>) to generate the <inline-formula id="j_info1146_ineq_280"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">matrix</mml:mi></mml:math><tex-math><![CDATA[$C\_ \mathit{matrix}$]]></tex-math></alternatives></inline-formula></p>
<p><bold>End</bold></p>
<p>The last step of our algorithm is merging clusters. First the minimal value in the inter-cluster compactness matrix is found, and the row and column of that minimal value then provide the labels of the corresponding subclusters. This pair of subclusters likely comes from a homogeneous class, and is selected for merging. The entropies, subspaces and weights and the inter-cluster compactness matrix are then updated in the next iteration. The main phase of cluster merging is described as follows:</p>
<p><bold>Cluster merging</bold></p>
<p><bold>Input:</bold> <inline-formula id="j_info1146_ineq_281"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$C=\{{C_{1}},{C_{2}},\dots ,{C_{k}}\}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_282"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mi mathvariant="normal">ˍ</mml:mi>
<mml:mi mathvariant="italic">matrix</mml:mi></mml:math><tex-math><![CDATA[$C\_ \mathit{matrix}$]]></tex-math></alternatives></inline-formula></p>
<p><bold>Output:</bold> <inline-formula id="j_info1146_ineq_283"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$C=\{{C_{1}},{C_{2}},\dots ,{C_{k-1}}\}$]]></tex-math></alternatives></inline-formula></p>
<p><bold>Begin</bold></p>
<p> Find minimal compactness <inline-formula id="j_info1146_ineq_284"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$C(i,j)$]]></tex-math></alternatives></inline-formula> in the <inline-formula id="j_info1146_ineq_285"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">matrix</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{\mathit{matrix}}}$]]></tex-math></alternatives></inline-formula> and</p>
<p>  merge subclusters <italic>i</italic> and <italic>j</italic> to <inline-formula id="j_info1146_ineq_286"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo movablelimits="false">min</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{\min (i,j)}}$]]></tex-math></alternatives></inline-formula>;</p>
<p> Update <inline-formula id="j_info1146_ineq_287"><alternatives><mml:math>
<mml:mi mathvariant="italic">C</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo movablelimits="false">max</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$C=\{{C_{1}},{C_{2}},\dots ,{C_{k}}\}-{C_{\max (i,j)}}$]]></tex-math></alternatives></inline-formula> for the next iteration</p>
<p><bold>End</bold></p>
</sec>
</sec>
</sec>
<sec id="j_info1146_s_010">
<label>5</label>
<title>Experiments and Results</title>
<p>In this section, we report experimental results of HPCCD on nine datasets from UCI Machine Learning Repository and comparison with four state-of-the-art algorithms. We will first describe the experimental design (Section <xref rid="j_info1146_s_011">5.1</xref>) and evaluation criteria (Section <xref rid="j_info1146_s_012">5.2</xref>). Then, we will present the performance results in terms of clustering accuracy (Section <xref rid="j_info1146_s_013">5.3</xref>) and analyse the feature relevance (Section <xref rid="j_info1146_s_014">5.4</xref>).</p>
<sec id="j_info1146_s_011">
<label>5.1</label>
<title>Experimental Design</title>
<p>Besides the HPCCD algorithm, we tested four state-of-the-art algorithms, MGR (Qin <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_038">2014</xref>), K-modes (Aggarwal <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_002">1999</xref>), COOLCAT (Barbar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_006">2002</xref>) and LIMBO (Andritsos <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_003">2004</xref>), for comparison with HPCCD. The choice of these algorithms for comparison is based on the following considerations. MGR (Mean Gain Ratio) is a divisive hierarchical clustering algorithm based on information theory, which performs clustering by selecting a clustering attribute based on the mean gain ratio and detecting an equivalence class on the clustering attribute using the entropy of clusters. The partition-based K-modes algorithm is one of the first algorithms for clustering categorical data and is widely considered to be the benchmark algorithm. COOLCAT is an incremental heuristic algorithm based on information theory, which explores the relationships between dataset and entropy, since clusters of similar POIs (Points Of Interest) yield lower entropy than clusters of dissimilar ones. LIMBO is an Information Bottleneck (IB)-based hierarchical clustering algorithm which quantifies the relevant information preserved when clustering.</p>
<p>Nine real-life datasets obtained from the UCI Machine Learning Repository (UCI Machine Learning Repository, <xref ref-type="bibr" rid="j_info1146_ref_046">2011</xref>) were used to evaluate the clustering performance: Zoo, Congressional Votes (Votes), Chess, Nursery, Soybean, Mushroom, Balance Scale, Car Evaluation and Hayes-Roth. Information about the datasets is tabulated in Table <xref rid="j_info1146_tab_009">9</xref>.</p>
<table-wrap id="j_info1146_tab_009">
<label>Table 9</label>
<caption>
<p>UCI dataset description.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Dataset name</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Number of objects</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Number of attributes</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Number of classes</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Soybean</td>
<td style="vertical-align: top; text-align: left">47</td>
<td style="vertical-align: top; text-align: left">35</td>
<td style="vertical-align: top; text-align: left">4</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Votes</td>
<td style="vertical-align: top; text-align: left">435</td>
<td style="vertical-align: top; text-align: left">16</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Mushroom</td>
<td style="vertical-align: top; text-align: left">8124</td>
<td style="vertical-align: top; text-align: left">22</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Nursery</td>
<td style="vertical-align: top; text-align: left">12960</td>
<td style="vertical-align: top; text-align: left">8</td>
<td style="vertical-align: top; text-align: left">4</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Zoo</td>
<td style="vertical-align: top; text-align: left">101</td>
<td style="vertical-align: top; text-align: left">16</td>
<td style="vertical-align: top; text-align: left">7</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Chess</td>
<td style="vertical-align: top; text-align: left">3196</td>
<td style="vertical-align: top; text-align: left">36</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Hayes-roth</td>
<td style="vertical-align: top; text-align: left">132</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">3</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Balance scale</td>
<td style="vertical-align: top; text-align: left">625</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">3</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Car evaluation</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1728</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">6</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">4</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Our implementation runs on a Desktop Microcomputer with 2.4 GHz and 4 G memory. To eliminate the effect of random factors, we ran each algorithm 10 times (with random initialization) on every dataset, and all results shown are averages.</p>
</sec>
<sec id="j_info1146_s_012">
<label>5.2</label>
<title>Clustering Performance Index</title>
<p>We use two criteria, clustering accuracy and <inline-formula id="j_info1146_ineq_288"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula>, as the performance index for comparing HPCCD with other algorithms. The accuracy of clustering (<inline-formula id="j_info1146_ineq_289"><alternatives><mml:math>
<mml:mi mathvariant="italic">AC</mml:mi></mml:math><tex-math><![CDATA[$\mathit{AC}$]]></tex-math></alternatives></inline-formula>) measure is defined as follows: 
<disp-formula id="j_info1146_eq_018">
<label>(17)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">AC</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">K</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathit{AC}(D)=\frac{{\textstyle\textstyle\sum _{i=1}^{K}}{C_{i}}}{|D|}\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_info1146_ineq_290"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">D</mml:mi>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|D|$]]></tex-math></alternatives></inline-formula> is the number of objects in the dataset, <italic>K</italic> is the number of classes in the test dataset, and <inline-formula id="j_info1146_ineq_291"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${C_{i}}$]]></tex-math></alternatives></inline-formula> is the maximum number of objects in cluster <italic>i</italic> belonging to the same original class in the test dataset, i.e. the majority class.</p>
<p>For the sake of comparing clustering results against external criteria, we introduce another clustering criterion, the Adjusted Rand Index (<inline-formula id="j_info1146_ineq_292"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula>) (Hubert and Arabie, <xref ref-type="bibr" rid="j_info1146_ref_029">1985</xref>; Yeung and Ruzzo, <xref ref-type="bibr" rid="j_info1146_ref_051">2001</xref>). For better cluster validation, <inline-formula id="j_info1146_ineq_293"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula> is a measure of agreement between two partitions, one being the clustering result and the other the original classes. Given a dataset with <italic>n</italic> objects, suppose <inline-formula id="j_info1146_ineq_294"><alternatives><mml:math>
<mml:mi mathvariant="italic">U</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$U=\{{u_{1}},{u_{2}},\dots ,{u_{s}}\}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_295"><alternatives><mml:math>
<mml:mi mathvariant="italic">V</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$V=\{{v_{1}},{v_{2}},\dots ,{v_{t}}\}$]]></tex-math></alternatives></inline-formula> represent the original classes and the clustering result, respectively. <inline-formula id="j_info1146_ineq_296"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${n_{ij}}$]]></tex-math></alternatives></inline-formula> denotes the number of objects that are in both class <inline-formula id="j_info1146_ineq_297"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${u_{i}}$]]></tex-math></alternatives></inline-formula> and cluster <inline-formula id="j_info1146_ineq_298"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${v_{i}}$]]></tex-math></alternatives></inline-formula>, while <inline-formula id="j_info1146_ineq_299"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${u_{i}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_300"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${v_{j}}$]]></tex-math></alternatives></inline-formula> are the numbers of objects in class <inline-formula id="j_info1146_ineq_301"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${u_{i}}$]]></tex-math></alternatives></inline-formula> and cluster <inline-formula id="j_info1146_ineq_302"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${v_{i}}$]]></tex-math></alternatives></inline-formula>, respectively. 
<disp-formula id="j_info1146_eq_019">
<label>(18)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">ARI</mml:mi>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mtable equalrows="false" equalcolumns="false" columnalign="center">
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mtable equalrows="false" equalcolumns="false" columnalign="center">
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mtable equalrows="false" equalcolumns="false" columnalign="center">
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true">]</mml:mo>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mtable equalrows="false" equalcolumns="false" columnalign="center">
<mml:mtr>
<mml:mtd class="array">
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mstyle displaystyle="false">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>∗</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mtable equalrows="false" equalcolumns="false" columnalign="center">
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mtable equalrows="false" equalcolumns="false" columnalign="center">
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true">]</mml:mo>
<mml:mo>−</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mtable equalrows="false" equalcolumns="false" columnalign="center">
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mtable equalrows="false" equalcolumns="false" columnalign="center">
<mml:mtr>
<mml:mtd class="array">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true">]</mml:mo>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mtable equalrows="false" equalcolumns="false" columnalign="center">
<mml:mtr>
<mml:mtd class="array">
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathit{ARI}=\frac{{\textstyle\sum _{ij}}\big(\substack{{n_{ij}}\\ {} 2}\big)-\big[{\textstyle\sum _{i}}\big(\substack{{u_{i}}\\ {} 2}\big){\textstyle\sum _{j}}\big(\substack{{v_{j}}\\ {} 2}\big)\big]/\big(\substack{n\\ {} 2}\big)}{\frac{1}{2}\ast \big[{\textstyle\sum _{i}}\big(\substack{{u_{i}}\\ {} 2}\big)+{\textstyle\sum _{j}}\big(\substack{{v_{j}}\\ {} 2}\big)\big]-\big[{\textstyle\sum _{i}}\big(\substack{{u_{i}}\\ {} 2}\big){\textstyle\sum _{j}}\big(\substack{{v_{j}}\\ {} 2}\big)\big]/\big(\substack{n\\ {} 2}\big)}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>The closer the clustering result is to the real classes, the larger the value of the corresponding <inline-formula id="j_info1146_ineq_303"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula>. Based on these two evaluation standards, we analyse the performance of HPCCD and compare it with the other algorithms on nine real datasets from UCI. Finally, we also analyse the relationship between clusters and their relevant subspaces. By introducing the concepts of principal features and core features, we demonstrate the effectiveness of the relevant subspaces.</p>
</sec>
<sec id="j_info1146_s_013">
<label>5.3</label>
<title>Analysis of Clustering Accuracy</title>
<p>In this subsection, we will report and analyse the clustering results of HPCCD on the various datasets mentioned above. Tables <xref rid="j_info1146_tab_010">10</xref> and <xref rid="j_info1146_tab_012">12</xref> show the clustering results, accuracies and <inline-formula id="j_info1146_ineq_304"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula> values for HPCCD on the Zoo and Soybean datasets. Tables <xref rid="j_info1146_tab_013">13</xref> and <xref rid="j_info1146_tab_014">14</xref> show the accuracies and <inline-formula id="j_info1146_ineq_305"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula> values of five algorithms on the nine datasets.</p>
<p>The Zoo dataset contains 101 objects and comprises the classes Mammal, Bird, Fish, Invertebrate, Insect, Amphibian and Reptile. HPCCD obtains four clusters that correspond perfectly to the original classes Mammal, Invertebrate, Insect and Reptile and three other clusters that are quite pure in terms of the majority class. The distribution of the data points in each cluster is given in Table <xref rid="j_info1146_tab_010">10</xref>. The accuracy measure <inline-formula id="j_info1146_ineq_306"><alternatives><mml:math>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mi mathvariant="italic">C</mml:mi></mml:math><tex-math><![CDATA[$AC$]]></tex-math></alternatives></inline-formula> is 0.9604, while the <inline-formula id="j_info1146_ineq_307"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula> is 0.9630.</p>
<table-wrap id="j_info1146_tab_010">
<label>Table 10</label>
<caption>
<p>Results of HPCCD on the Zoo dataset.</p>
</caption>
<table>
<thead>
<tr>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Cluster</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Instances</td>
<td colspan="7" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Classes</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Accuracy</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">ARI</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Mam</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Bir</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Fih</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Inv</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Ins</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Amp</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Rep</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">41</td>
<td style="vertical-align: top; text-align: left">41</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">10</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">8</td>
<td style="vertical-align: top; text-align: left">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">21</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">20</td>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0.960</td>
<td style="vertical-align: top; text-align: left">0.963</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">13</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">13</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">7</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">7</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">7</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">5</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">4</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_info1146_tab_011">
<label>Table 11</label>
<caption>
<p>Results of HPCCD on the Soybean dataset.</p>
</caption>
<table>
<thead>
<tr>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Cluster</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Instances</td>
<td colspan="4" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Classes</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Accuracy</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">ARI</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_308"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${D_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_309"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${D_{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_310"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${D_{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_311"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${D_{4}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">10</td>
<td style="vertical-align: top; text-align: left">10</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">10</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">10</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">1.0</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">17</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">0</td>
<td style="vertical-align: top; text-align: left">17</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">4</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">10</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">10</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_info1146_tab_012">
<label>Table 12</label>
<caption>
<p>Clustering accuracy of five algorithms on nine datasets.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Algorithm</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Zoo</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Vote</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Soybean</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Mushroom</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Chess</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Nursery</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Car-</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Hayes-</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Balance-</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">HPCCD</td>
<td style="vertical-align: top; text-align: left"><bold>0.9406</bold></td>
<td style="vertical-align: top; text-align: left"><bold>0.9218</bold></td>
<td style="vertical-align: top; text-align: left"><bold>1.0</bold></td>
<td style="vertical-align: top; text-align: left">0.8641</td>
<td style="vertical-align: top; text-align: left"><bold>0.5823</bold></td>
<td style="vertical-align: top; text-align: left"><bold>0.5595</bold></td>
<td style="vertical-align: top; text-align: left"><bold>0.7</bold></td>
<td style="vertical-align: top; text-align: left"><bold>0.5152</bold></td>
<td style="vertical-align: top; text-align: left">0.652</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MGR</td>
<td style="vertical-align: top; text-align: left">0.931</td>
<td style="vertical-align: top; text-align: left">0.828</td>
<td style="vertical-align: top; text-align: left">*</td>
<td style="vertical-align: top; text-align: left">0.667</td>
<td style="vertical-align: top; text-align: left">0.534</td>
<td style="vertical-align: top; text-align: left">0.53</td>
<td style="vertical-align: top; text-align: left"><bold>0.7</bold></td>
<td style="vertical-align: top; text-align: left">0.485</td>
<td style="vertical-align: top; text-align: left">0.635</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">K-modes</td>
<td style="vertical-align: top; text-align: left">0.6930</td>
<td style="vertical-align: top; text-align: left">0.8344</td>
<td style="vertical-align: top; text-align: left">0.8510</td>
<td style="vertical-align: top; text-align: left">0.5179</td>
<td style="vertical-align: top; text-align: left">0.5475</td>
<td style="vertical-align: top; text-align: left">0.4287</td>
<td style="vertical-align: top; text-align: left">0.5410</td>
<td style="vertical-align: top; text-align: left">0.4621</td>
<td style="vertical-align: top; text-align: left">0.6336</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">COOLCAT</td>
<td style="vertical-align: top; text-align: left">0.8900</td>
<td style="vertical-align: top; text-align: left">0.7816</td>
<td style="vertical-align: top; text-align: left">0.9362</td>
<td style="vertical-align: top; text-align: left">0.5220</td>
<td style="vertical-align: top; text-align: left">0.5228</td>
<td style="vertical-align: top; text-align: left">0.3775</td>
<td style="vertical-align: top; text-align: left"><bold>0.7</bold></td>
<td style="vertical-align: top; text-align: left">0.4394</td>
<td style="vertical-align: top; text-align: left"><bold>1.0</bold></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">LIMBO</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.9109</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.8230</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>1.0</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>0.8902</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.5222</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.4974</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>0.7</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.4621</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.6096</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_info1146_tab_013">
<label>Table 13</label>
<caption>
<p>Clustering ARI of five algorithms on nine datasets.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Algorithm</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Zoo</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Vote</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Soybean</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Mushroom</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Chess</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Nursery</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Car-</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Hayes-</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Balance-</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">HPCCD</td>
<td style="vertical-align: top; text-align: left"><bold>0.9630</bold></td>
<td style="vertical-align: top; text-align: left"><bold>0.7109</bold></td>
<td style="vertical-align: top; text-align: left"><bold>1.0</bold></td>
<td style="vertical-align: top; text-align: left">0.5302</td>
<td style="vertical-align: top; text-align: left"><bold>0.0260</bold></td>
<td style="vertical-align: top; text-align: left"><bold>0.2181</bold></td>
<td style="vertical-align: top; text-align: left">0.0428</td>
<td style="vertical-align: top; text-align: left"><bold>0.0742</bold></td>
<td style="vertical-align: top; text-align: left">0.0923</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">MGR</td>
<td style="vertical-align: top; text-align: left">0.9617</td>
<td style="vertical-align: top; text-align: left">0.4279</td>
<td style="vertical-align: top; text-align: left">*</td>
<td style="vertical-align: top; text-align: left">0.1254</td>
<td style="vertical-align: top; text-align: left">0.0036</td>
<td style="vertical-align: top; text-align: left">0.1680</td>
<td style="vertical-align: top; text-align: left">0.0129</td>
<td style="vertical-align: top; text-align: left">0.0392</td>
<td style="vertical-align: top; text-align: left">0.1011</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">K-modes</td>
<td style="vertical-align: top; text-align: left">0.4782</td>
<td style="vertical-align: top; text-align: left">0.4463</td>
<td style="vertical-align: top; text-align: left">0.6586</td>
<td style="vertical-align: top; text-align: left">0.0533</td>
<td style="vertical-align: top; text-align: left">0.0082</td>
<td style="vertical-align: top; text-align: left">0.0565</td>
<td style="vertical-align: top; text-align: left"><bold>0.0540</bold></td>
<td style="vertical-align: top; text-align: left">0.0252</td>
<td style="vertical-align: top; text-align: left">0.0915</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">COOLCAT</td>
<td style="vertical-align: top; text-align: left">0.8586</td>
<td style="vertical-align: top; text-align: left">0.3154</td>
<td style="vertical-align: top; text-align: left">0.8214</td>
<td style="vertical-align: top; text-align: left">0.0018</td>
<td style="vertical-align: top; text-align: left">0.0018</td>
<td style="vertical-align: top; text-align: left">0.0083</td>
<td style="vertical-align: top; text-align: left">0.0500</td>
<td style="vertical-align: top; text-align: left">.00261</td>
<td style="vertical-align: top; text-align: left"><bold>1.0</bold></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">LIMBO</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.8318</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.4159</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>1.0</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>0.6090</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.0082</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.0793</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.0285</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.0381</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.0684</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For the Soybean dataset, the accuracy of our algorithm achieves 1, and the <inline-formula id="j_info1146_ineq_312"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula> is 1. Table <xref rid="j_info1146_tab_011">11</xref> shows the clustering results.</p>
<p>Tables <xref rid="j_info1146_tab_012">12</xref> and <xref rid="j_info1146_tab_013">13</xref> show the comparisons between HPCCD and the four comparison algorithms (MGR, K-modes, COOLCAT and LIMBO) in terms of clustering accuracy and <inline-formula id="j_info1146_ineq_313"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula> on the nine datasets. We can see that HPCCD achieves better results for both accuracy and <inline-formula id="j_info1146_ineq_314"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula> on most of the datasets. On the datasets Zoo, Soybean, Vote, Chess, Nursery and Hayes-Roth, in particular, it is obvious that the proposed algorithm HPCCD has significant advantages over other algorithms both in terms of accuracy and <inline-formula id="j_info1146_ineq_315"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula>. On the dataset Mushroom, LIMBO gets the highest accuracy and <inline-formula id="j_info1146_ineq_316"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula> while HPCCD achieves higher accuracy and <inline-formula id="j_info1146_ineq_317"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula> than the other three algorithms. However, HPCCD, K-modes, MGR and COOLCAT share the same accuracy of 0.7 on the dataset Car-Evaluation, while K-modes gets the best <inline-formula id="j_info1146_ineq_318"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula> of all the algorithms. Finally, on the dataset Balance Scale, it is COOLCAT which obtains the best accuracy and <inline-formula id="j_info1146_ineq_319"><alternatives><mml:math>
<mml:mi mathvariant="italic">ARI</mml:mi></mml:math><tex-math><![CDATA[$\mathit{ARI}$]]></tex-math></alternatives></inline-formula>, and succeeds in clustering the data accurately. These results show the high efficiency and accuracy of the proposed HPCCD.</p>
</sec>
<sec id="j_info1146_s_014">
<label>5.4</label>
<title>Analysis of Feature Relevance</title>
<p>The good performance of HPCCD shown in the previous session is due largely to effective selection of the relevant subspace for each cluster. In this section, we provide some preliminary analysis of the benefit that these relevant subspaces bring to characterization of the clusters. For this purpose, we build global feature subspaces from the individual feature subspaces and perform clustering using a different clustering algorithm to test whether the feature subspaces from the results of HPCCD improve the results of other clustering algorithms. Without loss of generality, we will report clustering results obtained by the K-modes algorithm.</p>
<p>For simplicity, we consider the intersection and union sets of all the clusters relevant subspaces. The features in the intersection set, named the core features, are relevant to all the clusters. Features in the union set are named principal features. Principal features thus include core features. Principal features that are not core features may be relevant only to some clusters and contribute to these clusters structures. The set of principal features thus corresponds to global feature selection, and the set of core features, if not empty, provides common features relevant to the clusters structure. It is expected that using the set of principal features will improve the clustering results compared to using the full feature space, because it allows one to avoid the use of noise features from the full feature space. It is also expected that using only the set of core features will still generate high-quality clusters, as these features are essential to all clusters.</p>
<p>The following experiment confirms the above expectations. We tested the K-modes algorithm on the principal feature subspace and the core feature subspace of the nine datasets. For the Zoo data, Table <xref rid="j_info1146_tab_014">14</xref> shows the relevant subspace for each cluster. For example, the relevant subspace of Cluster 1 comprises 11 features: <inline-formula id="j_info1146_ineq_320"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>8</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>9</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>10</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>11</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>12</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>14</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{1,2,3,4,5,8,9,10,11,12,14\}$]]></tex-math></alternatives></inline-formula>. The core feature space comprises features <inline-formula id="j_info1146_ineq_321"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>8</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>9</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>10</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>12</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{2,3,4,8,9,10,12\}$]]></tex-math></alternatives></inline-formula>. Cluster 1 also has four principal features <inline-formula id="j_info1146_ineq_322"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>11</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>14</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{1,5,11,14\}$]]></tex-math></alternatives></inline-formula> besides the core features. The principal feature space of the 7 clusters is the full feature space. If we cluster the dataset on the core feature space, the clustering accuracy is 0.8811. This clustering accuracy is higher than the result based on the full feature space (0.6930), as can be seen in Table <xref rid="j_info1146_tab_016">16</xref>.</p>
<table-wrap id="j_info1146_tab_014">
<label>Table 14</label>
<caption>
<p>Relevant subspace of each cluster for the Zoo dataset.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Cluster</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Relevant subspace</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 14</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 13, 15, 16</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">2, 3, 4, 6, 7, 8, 9, 10, 12, 14, 15, 16</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left">1, 2, 3, 4, 8, 9, 10, 11, 12, 13, 14</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">5</td>
<td style="vertical-align: top; text-align: left">1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">6</td>
<td style="vertical-align: top; text-align: left">1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">7</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>This result also seems compatible with our a priori knowledge of the data. Based on the details of the Zoo data from the UCI Machine Learning Repository (Andritsos <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1146_ref_003">2004</xref>), the 7 core features include feathers, eggs, milk, toothed, backbone, breathes and fins. This means that these features are relevant for all animals. For the Mammal class (cluster 1), additional features such as hair, airborne, venomous and tail are relevant. In the real world, to judge an animal, the core features are necessary. However, to judge if an animal is a mammal, the core features alone are insufficient, and features such as hair, airborne, venomous and tail are necessary. On the other hand, the features aquatic, predator, legs, domestic and cat size are not relevant for mammals. For the Bird class (cluster 4), the features hair, venomous, legs, and tail are relevant, whereas the features aquatic, airborne, predator, domestic and cat size are not.</p>
<p>Table <xref rid="j_info1146_tab_015">15</xref> gives the relevant subspace for each cluster of the Soybean dataset. Clusters 1 to 4 have <inline-formula id="j_info1146_ineq_323"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>7</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>11</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>12</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>20</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>21</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>22</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>23</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>24</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>26</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>27</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>28</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{2,3,4,5,7,11,12,20,21,22,23,24,26,27,28\}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_324"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>7</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>11</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>12</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>20</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>21</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>23</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>24</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>25</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>26</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>27</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>28</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>35</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{2,3,4,7,11,12,20,21,23,24,25,26,27,28,35\}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_info1146_ineq_325"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>8</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>11</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>12</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>21</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>22</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>23</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>24</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>25</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>26</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>27</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>28</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>35</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{2,3,8,11,12,21,22,23,24,25,26,27,28,35\}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_info1146_ineq_326"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>7</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>11</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>12</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>20</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>22</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>23</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>25</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>26</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>27</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>28</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>35</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{2,3,7,11,12,20,22,23,25,26,27,28,35\}$]]></tex-math></alternatives></inline-formula> as their respective relevant subspaces. The core feature space comprises <inline-formula id="j_info1146_ineq_327"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>11</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>12</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>23</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>26</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>27</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>28</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{2,3,11,12,23,26,27,28\}$]]></tex-math></alternatives></inline-formula>. This indicates that these features are relevant to all clusters, while <inline-formula id="j_info1146_ineq_328"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>7</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>21</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>24</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>35</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{4,5,7,21,24,35\}$]]></tex-math></alternatives></inline-formula> are also relevant to Cluster 1, and so on, as shown in Table <xref rid="j_info1146_tab_016">16</xref> (<inline-formula id="j_info1146_ineq_329"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast }}$]]></tex-math></alternatives></inline-formula> indicates that the core feature set is empty). The principal feature space comprises <inline-formula id="j_info1146_ineq_330"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>7</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>8</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>11</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>12</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>20</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>21</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>22</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>23</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>24</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>25</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>26</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>27</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>28</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>35</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{2,3,4,5,7,8,11,12,20,21,22,23,24,25,26,27,28,35\}$]]></tex-math></alternatives></inline-formula>. If we cluster the dataset on the principal feature space, the clustering accuracy is 1.0. This is higher than the result based on the full feature space (0.8510), as shown in Table <xref rid="j_info1146_tab_016">16</xref>. If we cluster the dataset on the core feature space, the clustering accuracy is 0.8085. This is lower than the result based on the full feature space (0.8510), because some important features, e.g. <inline-formula id="j_info1146_ineq_331"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>7</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{4,5,7\}$]]></tex-math></alternatives></inline-formula> have been removed.</p>
<table-wrap id="j_info1146_tab_015">
<label>Table 15</label>
<caption>
<p>Relevant subspace of each cluster for the Soybean dataset.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Cluster</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Dimensions of relevant subspace</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left">2, 3, 4, 5, 7, 11, 12, 20, 21, 22, 23, 24, 26, 27, 28</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left">2, 3, 4, 7, 11, 12, 20, 21, 23, 24, 25, 26, 27, 28, 35</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left">2, 3 ,8 ,11, 12,21, 22, 23, 24, 25, 26, 27, 28, 35</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">4</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">2, 3, 7, 11, 12, 20, 22, 23, 25, 26, 27, 28, 35</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_info1146_tab_016">
<label>Table 16</label>
<caption>
<p>Clustering accuracy on full feature space, principal feature space and core feature space.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Dataset</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Full feature space</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Principal feature space</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Core feature space</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Zoo</td>
<td style="vertical-align: top; text-align: left">0.6930</td>
<td style="vertical-align: top; text-align: left">0.6930</td>
<td style="vertical-align: top; text-align: left">0.8811</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Votes</td>
<td style="vertical-align: top; text-align: left">0.8344</td>
<td style="vertical-align: top; text-align: left">0.8344</td>
<td style="vertical-align: top; text-align: left">0.8851</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Soybean</td>
<td style="vertical-align: top; text-align: left">0.8510</td>
<td style="vertical-align: top; text-align: left">1.0</td>
<td style="vertical-align: top; text-align: left">0.8085</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Chess</td>
<td style="vertical-align: top; text-align: left">0.5475</td>
<td style="vertical-align: top; text-align: left">0.5663</td>
<td style="vertical-align: top; text-align: left">0.5222</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Nursery</td>
<td style="vertical-align: top; text-align: left">0.4287</td>
<td style="vertical-align: top; text-align: left">0.5301</td>
<td style="vertical-align: top; text-align: left">0.342</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Car evaluation</td>
<td style="vertical-align: top; text-align: left">0.5410</td>
<td style="vertical-align: top; text-align: left">0.6887</td>
<td style="vertical-align: top; text-align: left">0.7</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Hayes-roth</td>
<td style="vertical-align: top; text-align: left">0.4621</td>
<td style="vertical-align: top; text-align: left">0.5075</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_info1146_ineq_332"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast }}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Balance scale</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.6336</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.5056</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_info1146_ineq_333"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast }}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Based on the discussions above, our approach not only precisely detects the clusters and their relevant features, but also discovers the principal feature space, and the core feature space. The core feature space determines the key cluster structure, because it is relevant to all clusters. This is very important in knowledge mining from high-dimensional data.</p>
</sec>
</sec>
<sec id="j_info1146_s_015">
<label>6</label>
<title>Conclusions</title>
<p>Most hierarchical clustering algorithms tend to be based on similarity measures computed on a common feature space, which is not effective for clustering high-dimensional data. In this paper, we have proposed a new way of exploring information entropy, holo-entropy, attribute selection and attribute weighting to extract the feature subspace and merge clusters that have different feature subspaces. The new algorithm differs from existing mainstream hierarchical clustering algorithms in its use of a weighted holo-entropy to replace the pairwise-similarity-based measures for merging two subclusters. The advantages of our algorithm are as follows: first, it takes interrelationships of the attributes into account and avoids the conditional independence hypothesis, which is an implicit hypothesis made by most existing hierarchical clustering algorithms. Secondly, it employs the entropy and holo-entropy to detect the relevant subspace, and find the principal feature space and the core feature space from the whole feature space of corresponding subclusters. Thirdly, it uses intra-class compactness as a standard for merging subclusters rather than the traditional similarity measurement. We performed experiments that demonstrate the effectiveness of the new algorithm in terms of clustering accuracy and analysis of the relevant subspaces obtained.</p>
</sec>
</body>
<back>
<ack id="j_info1146_ack_001">
<title>Acknowledgements</title>
<p>This project is supported by the National Natural Science Foundation of China (61170130) and the State Key Laboratory of Rail Traffic Control and Safety (Contract No. RCS2012K005), Beijing Jiaotong University).</p></ack>
<ref-list id="j_info1146_reflist_001">
<title>References</title>
<ref id="j_info1146_ref_001">
<mixed-citation publication-type="journal"><string-name><surname>Aggarwal</surname>, <given-names>C.C.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>P.S.</given-names></string-name> (<year>2002</year>). <article-title>Redefining clustering for high dimensional applications</article-title>. <source>IEEE Transactions on Knowledge and Data Engineering</source>, <volume>14</volume>(<issue>2</issue>), <fpage>210</fpage>–<lpage>225</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_002">
<mixed-citation publication-type="chapter"><string-name><surname>Aggarwal</surname>, <given-names>C.C.</given-names></string-name>, <string-name><surname>Procopiuc</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Wolf</surname>, <given-names>J.L.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>P.S.</given-names></string-name>, <string-name><surname>Park</surname>, <given-names>J.S.</given-names></string-name> (<year>1999</year>). <chapter-title>Fast algorithms for projected clustering</chapter-title>. In: <source>Proceedings of the ACM SIGMOD</source>, Vol. <volume>99</volume>, pp. <fpage>61</fpage>–<lpage>72</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_003">
<mixed-citation publication-type="journal"><string-name><surname>Andritsos</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Tsaparas</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Miller</surname>, <given-names>R.J.</given-names></string-name>, <string-name><surname>Sevcik</surname>, <given-names>K.C.</given-names></string-name> (<year>2004</year>). <article-title>LIMBO: scalable clustering of categorical data</article-title>. <source>Lecture Notes in Computer Science</source>, <volume>2992</volume>, <fpage>123</fpage>–<lpage>146</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_004">
<mixed-citation publication-type="journal"><string-name><surname>Bai</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Liang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Dang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Cao</surname>, <given-names>F.</given-names></string-name> (<year>2011</year>a). <article-title>A novel attribute weighting algorithm for clustering high-dimensional categorical data</article-title>. <source>Pattern Recognition</source>, <volume>44</volume>(<issue>12</issue>), <fpage>2843</fpage>–<lpage>2861</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_005">
<mixed-citation publication-type="journal"><string-name><surname>Bai</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Liang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Dang</surname>, <given-names>C.</given-names></string-name> (<year>2011</year>b). <article-title>An initialization method to simultaneously find initial cluster and the number of clusters for clustering categorical data</article-title>. <source>Knowledge and Information Systems</source>, <volume>24</volume>, <fpage>785</fpage>–<lpage>795</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_006">
<mixed-citation publication-type="chapter"><string-name><surname>Barbar</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Couto</surname>, <given-names>J.</given-names></string-name> (<year>2002</year>). <chapter-title>COOLCAT: an entropy-based algorithm for categorical clustering</chapter-title>. In: <source>Proceedings of the eleventh International Conference on Information and Knowledge Management</source>. <publisher-name>ACM</publisher-name>, pp. <fpage>582</fpage>–<lpage>589</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_007">
<mixed-citation publication-type="journal"><string-name><surname>Bouguessa</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>S.</given-names></string-name> (<year>2009</year>). <article-title>Mining projected clusters in high-dimensional spaces</article-title>. <source>IEEE Transactions on Knowledge and Data Engineering</source>, <volume>21</volume>(<issue>4</issue>), <fpage>507</fpage>–<lpage>522</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_008">
<mixed-citation publication-type="journal"><string-name><surname>Cao</surname>, <given-names>F.Y.</given-names></string-name>, <string-name><surname>Liang</surname>, <given-names>J.Y.</given-names></string-name>, <string-name><surname>Bai</surname>, <given-names>L.</given-names></string-name> (<year>2009</year>). <article-title>A new initialization method for categorical data clustering</article-title>. <source>Expert Systems with Applications</source>, <volume>33</volume>(<issue>7</issue>), <fpage>10223</fpage>–<lpage>10228</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_009">
<mixed-citation publication-type="journal"><string-name><surname>Choi</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Ryu</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Yoo</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Choi</surname>, <given-names>J.</given-names></string-name> (<year>2012</year>). <article-title>Combining relevancy and methodological quality into a single ranking for evidence-based medicine</article-title>. <source>Information Sciences</source>, <volume>214</volume>, <fpage>76</fpage>–<lpage>90</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_010">
<mixed-citation publication-type="journal"><string-name><surname>Cover</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Hart</surname>, <given-names>P.</given-names></string-name> (<year>1967</year>). <article-title>Nearest neighbor pattern classification</article-title>. <source>IEEE Transactions on Information Theory</source>, <volume>13</volume>, <fpage>21</fpage>–<lpage>27</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_011">
<mixed-citation publication-type="book"><string-name><surname>Cover</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Thomas</surname>, <given-names>J.</given-names></string-name> (<year>2012</year>). <source>Elements of Information Theory</source>. <publisher-name>John Wiley and Sons</publisher-name>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_012">
<mixed-citation publication-type="journal"><string-name><surname>Derrac</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Cornelis</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Garca</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Herrera</surname>, <given-names>F.</given-names></string-name> (<year>2012</year>). <article-title>Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection</article-title>. <source>Information Sciences</source>, <volume>186</volume>, <fpage>73</fpage>–<lpage>92</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_013">
<mixed-citation publication-type="chapter"><string-name><surname>Do</surname>, <given-names>H.J.</given-names></string-name>, <string-name><surname>Kim</surname>, <given-names>J.Y.</given-names></string-name> (<year>2008</year>). <chapter-title>Categorical data clustering using the combinations of attribute values</chapter-title>. In <series>Series Lecture Notes in Computer Science</series><italic>: Vol.</italic> <volume>5073</volume>. <source>Computational Science and Its Applications – ICCSA 2008</source>, pp. <fpage>220</fpage>–<lpage>231</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_014">
<mixed-citation publication-type="chapter"><string-name><surname>Domeniconi</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Papadopoulos</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Gunopulos</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Ma</surname>, <given-names>D.</given-names></string-name> (<year>2004</year>). <chapter-title>Subspace clustering of high dimensional data</chapter-title>. In: <source>SDM 2004</source>, pp. <fpage>73</fpage>–<lpage>93</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_015">
<mixed-citation publication-type="journal"><string-name><surname>Filippone</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Sanguinetti</surname>, <given-names>G.</given-names></string-name> (<year>2010</year>). <article-title>Information theoretic novelty detection</article-title>. <source>Pattern Recognition</source>, <volume>43</volume>, <fpage>805</fpage>–<lpage>814</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_016">
<mixed-citation publication-type="book"><string-name><surname>Fukunaga</surname>, <given-names>K.</given-names></string-name> (<year>2013</year>). <source>Introduction to Statistical Pattern Recognition</source>. <publisher-name>Academic Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_017">
<mixed-citation publication-type="journal"><string-name><surname>Gan</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>J.</given-names></string-name> (<year>2004</year>). <article-title>Subspace clustering for high dimensional categorical data</article-title>. <source>ACM SIGKDD Explorations Newsletter</source>, <volume>6</volume>(<issue>2</issue>), <fpage>87</fpage>–<lpage>94</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_018">
<mixed-citation publication-type="journal"><string-name><surname>Gan</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>J.</given-names></string-name> (<year>2005</year>). <article-title>A genetic <italic>k</italic>-modes algorithm for clustering categorical data</article-title>. <source>Lecture Notes in Computer Science</source>, <volume>3584</volume>, <fpage>195</fpage>–<lpage>202</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_019">
<mixed-citation publication-type="journal"><string-name><surname>Gan</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>Z.</given-names></string-name> (<year>2009</year>). <article-title>A genetic fuzzy <italic>k</italic>-modes algorithm for clustering categorical data</article-title>. <source>Expert Systems with Applications</source>, <volume>36</volume>, <fpage>1615</fpage>–<lpage>1620</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_020">
<mixed-citation publication-type="chapter"><string-name><surname>Ganti</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Gehrke</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Ramakrishnan</surname>, <given-names>R.</given-names></string-name> (<year>1999</year>). <chapter-title>CACTUS: clustering categorical data using summaries</chapter-title>. In: <source>Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Datamining</source>, <conf-loc>San Diego, CA, USA</conf-loc>, pp. <fpage>73</fpage>–<lpage>83</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_021">
<mixed-citation publication-type="book"><string-name><surname>Greenacre</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Blasius</surname>, <given-names>J.</given-names></string-name> (<year>2006</year>). <source>Multiple Correspondence Analysis and Related Methods</source>. <publisher-name>CRC Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_022">
<mixed-citation publication-type="journal"><string-name><surname>Guan</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Zhou</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Xiao</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Guo</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>T.</given-names></string-name> (<year>2013</year>). <article-title>Fast dimension reduction for document classification based on imprecise spectrum analysis</article-title>. <source>Information Sciences</source>, <volume>222</volume>, <fpage>147</fpage>–<lpage>162</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_023">
<mixed-citation publication-type="chapter"><string-name><surname>Guha</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Rastogi</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Shim</surname>, <given-names>K.</given-names></string-name> (<year>1999</year>). <chapter-title>ROCK: a robust clustering algorithm for categorical attributes</chapter-title>. In: <source>Data Engineering, 1999. Proceedings, 15th International Conference on IEEE</source>, pp. <fpage>512</fpage>–<lpage>521</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_024">
<mixed-citation publication-type="journal"><string-name><surname>He</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Deng</surname>, <given-names>S.</given-names></string-name> (<year>2005</year>). <article-title>K-ANMI: a mutual information based clustering algorithm for categorical data</article-title>. <source>Information Fusion</source>, <volume>9</volume>(<issue>2</issue>), <fpage>223</fpage>–<lpage>233</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_025">
<mixed-citation publication-type="journal"><string-name><surname>He</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Deng</surname>, <given-names>S.</given-names></string-name> (<year>2011</year>). <article-title>Attribute value weighting in <italic>k</italic>-modes clustering</article-title>. <source>Expert Systems with Applications</source>, <volume>38</volume>, <fpage>15365</fpage>–<lpage>15369</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_026">
<mixed-citation publication-type="journal"><string-name><surname>Hruschka</surname>, <given-names>E.R.</given-names></string-name>, <string-name><surname>Campello</surname>, <given-names>R.J.G.B.</given-names></string-name>, <string-name><surname>de Castro</surname>, <given-names>L.N.</given-names></string-name> (<year>2006</year>). <article-title>Evolving clusters in gene-expression data</article-title>. <source>Information Sciences</source>, <volume>176</volume>(<issue>13</issue>), <fpage>1898</fpage>–<lpage>1927</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_027">
<mixed-citation publication-type="journal"><string-name><surname>Huang</surname>, <given-names>Z.</given-names></string-name> (<year>1998</year>). <article-title>Extensions to the <italic>k</italic>-means algorithm for clustering large data sets with categorical values</article-title>. <source>Data Mining and Knowledge Discovery</source>, <volume>2</volume>(<issue>3</issue>), <fpage>283</fpage>–<lpage>304</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_028">
<mixed-citation publication-type="journal"><string-name><surname>Huang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Rong</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>Z.</given-names></string-name> (<year>2005</year>). <article-title>Automated variable weighting in <italic>k</italic>-means type clustering</article-title>. <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, <volume>27</volume>(<issue>5</issue>), <fpage>657</fpage>–<lpage>668</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_029">
<mixed-citation publication-type="journal"><string-name><surname>Hubert</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Arabie</surname>, <given-names>P.</given-names></string-name> (<year>1985</year>). <article-title>Comparing partitions</article-title>. <source>Journal of Classification</source>, <volume>2</volume>(<issue>1</issue>), <fpage>193</fpage>–<lpage>218</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_030">
<mixed-citation publication-type="chapter"><string-name><surname>Jollois</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Nadif</surname>, <given-names>M.</given-names></string-name> (<year>2002</year>). <chapter-title>Clustering large categorical data</chapter-title>. In: <source>Proceedings of Pacific Asia Conference on Knowledge Discovery in Databases (PAKDD02)</source>, pp. <fpage>257</fpage>–<lpage>263</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_031">
<mixed-citation publication-type="journal"><string-name><surname>Kim</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>D.</given-names></string-name> (<year>2004</year>). <article-title>Fuzzy clustering of categorical data using fuzzy centroids</article-title>. <source>Pattern Recognition Letters</source>, <volume>25</volume>(<issue>11</issue>), <fpage>1263</fpage>–<lpage>1271</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_032">
<mixed-citation publication-type="journal"><string-name><surname>Li</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Jajodia</surname>, <given-names>S.</given-names></string-name> (<year>2006</year>). <article-title>Looking into the seeds of time: discovering temporal patterns in large transaction sets</article-title>. <source>Information Sciences</source>, <volume>176</volume>(<issue>8</issue>), <fpage>1003</fpage>–<lpage>1031</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_033">
<mixed-citation publication-type="journal"><string-name><surname>Li</surname>, <given-names>H.X.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>J.-L.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Fan</surname>, <given-names>B.</given-names></string-name> (<year>2013</year>). <article-title>Probabilistic support vector machines for classification of noise affected data</article-title>. <source>Information Sciences</source>, <volume>221</volume>, <fpage>60</fpage>–<lpage>71</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_034">
<mixed-citation publication-type="journal"><string-name><surname>Lingras</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Hogo</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Snorek</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>West</surname>, <given-names>C.</given-names></string-name> (<year>2005</year>). <article-title>Temporal analysis of clusters of supermarket customers: conventional versus interval set approach</article-title>. <source>Information Sciences</source>, <volume>172</volume>(<issue>1–2</issue>), <fpage>215</fpage>–<lpage>240</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_035">
<mixed-citation publication-type="journal"><string-name><surname>Lu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Zhou</surname>, <given-names>C.</given-names></string-name> (<year>2011</year>). <article-title>Particle swarm optimizer for variable weighting in clustering high dimensional data</article-title>. <source>Machine Learning</source>, <volume>82</volume>(<issue>1</issue>), <fpage>43</fpage>–<lpage>70</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_036">
<mixed-citation publication-type="chapter"><string-name><surname>Nemalhabib</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Shiri</surname>, <given-names>N.</given-names></string-name> (<year>2006</year>). In: <source>ACM Symposium on Applied Computing</source>, pp. <fpage>637</fpage>–<lpage>638</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_037">
<mixed-citation publication-type="journal"><string-name><surname>Parsons</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Haque</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>H.</given-names></string-name> (<year>2004</year>). <article-title>Subspace clustering for high dimensional data: a review</article-title>. <source>ACM SIGKDD Explorations Newsletter</source>, <volume>6</volume>(<issue>1</issue>), <fpage>90</fpage>–<lpage>105</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_038">
<mixed-citation publication-type="journal"><string-name><surname>Qin</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Ma</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Herawan</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Zain</surname>, <given-names>J.M.</given-names></string-name> (<year>2014</year>). <article-title>MGR: an information theory based hierarchical divisive clustering algorithm for categorical data</article-title>. <source>Knowledge-Based Systems</source>, <volume>67</volume>, <fpage>401</fpage>–<lpage>411</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_039">
<mixed-citation publication-type="journal"><string-name><surname>Sabit</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Al-Anbuky</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Gholamhosseini</surname>, <given-names>H.</given-names></string-name> (<year>2011</year>). <article-title>Data stream mining for wireless sensor networks environment: energy efficient fuzzy clustering algorithm</article-title>. <source>International Journal of Autonomous and Adaptive Communications Systems</source>, <volume>4</volume>, <fpage>383</fpage>–<lpage>397</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_040">
<mixed-citation publication-type="journal"><string-name><surname>Santos</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Brezo</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Ugarte-Pedrero</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Bringas</surname>, <given-names>G.</given-names></string-name> (<year>2013</year>). <article-title>Opcode sequences as representation of executables for data-mining-based unknown malware detection</article-title>. <source>Information Sciences</source>, <volume>231</volume>, <fpage>64</fpage>–<lpage>82</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_041">
<mixed-citation publication-type="journal"><string-name><surname>Schwenkera</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Trentin</surname>, <given-names>E.</given-names></string-name> (<year>2014</year>). <article-title>Pattern classification and clustering: a review of partially supervised learning approaches</article-title>. <source>Pattern Recognition Letters</source>, <volume>37</volume>, <fpage>4</fpage>–<lpage>14</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_042">
<mixed-citation publication-type="journal"><string-name><surname>Shannon</surname>, <given-names>C.E.</given-names></string-name> (<year>1948</year>). <article-title>A mathematical theory of communication</article-title>. <source>The Bell System Technical Journal</source>, <volume>XXVII</volume>(<issue>3</issue>), <fpage>379</fpage>–<lpage>423</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_043">
<mixed-citation publication-type="journal"><string-name><surname>Shrivastava</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Tyagi</surname>, <given-names>V.</given-names></string-name> (<year>2014</year>). <article-title>Content based image retrieval based on relative locations of multiple regions of interest using selective regions matching</article-title>. <source>Information Sciences</source>, <volume>259</volume>, <fpage>212</fpage>–<lpage>224</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_044">
<mixed-citation publication-type="other"><string-name><surname>Srinivasa</surname>, <given-names>S.</given-names></string-name> (2005). A review on multivariate mutual information. University of Notre Dame, Notre Dame, Indiana, 2, 1–6.</mixed-citation>
</ref>
<ref id="j_info1146_ref_045">
<mixed-citation publication-type="chapter"><string-name><surname>Tan</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Cheng</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Ghanem</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>H.</given-names></string-name> (<year>2005</year>). <chapter-title>A novel refinement approach for text categorization</chapter-title>. In: <source>Proceedings of the ACM 14th Conference on Information and Knowledge Management</source>, pp. <fpage>469</fpage>–<lpage>476</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_046">
<mixed-citation publication-type="other"><string-name><surname>UCI Machine Learning Repository</surname></string-name> (2011). <uri>http://www.ics.uci.edu/mlearn/MLRepository.html</uri>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_047">
<mixed-citation publication-type="journal"><string-name><surname>Watanabe</surname>, <given-names>S.</given-names></string-name> (<year>1960</year>). <article-title>Information theoretical analysis of multivariate correlation</article-title>. <source>IBM Journal of Research and Development</source>, <volume>4</volume>, <fpage>66</fpage>–<lpage>82</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_048">
<mixed-citation publication-type="journal"><string-name><surname>Wu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>S.</given-names></string-name> (<year>2013</year>). <article-title>Information-theoretic outlier detection for large-scale categorical data</article-title>. <source>IEEE Transactions on Knowledge and Data Engineering</source>, <volume>25</volume>(<issue>3</issue>), <fpage>589</fpage>–<lpage>601</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_049">
<mixed-citation publication-type="journal"><string-name><surname>Xiong</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Mayers</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Monga</surname>, <given-names>E.</given-names></string-name> (<year>2011</year>). <article-title>DHCC: divisive hierarchical clustering of categorical data</article-title>. <source>Data Mining and Knowledge Discovery</source>, <volume>24</volume>(<issue>1</issue>), <fpage>103</fpage>–<lpage>135</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_050">
<mixed-citation publication-type="journal"><string-name><surname>Yao</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Dash</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Tan</surname>, <given-names>S.T.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>H.</given-names></string-name> (<year>2000</year>). <article-title>Entropy-based fuzzy clustering and fuzzy modeling</article-title>. <source>Fuzzy Sets and Systems</source>, <volume>113</volume>(<issue>3</issue>), <fpage>381</fpage>–<lpage>388</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_051">
<mixed-citation publication-type="journal"><string-name><surname>Yeung</surname>, <given-names>K.Y.</given-names></string-name>, <string-name><surname>Ruzzo</surname>, <given-names>W.L.</given-names></string-name> (<year>2001</year>). <article-title>Details of the adjusted Rand index and clustering algorithms, supplement to the paper. An empirical study on principal component analysis for clustering gene expression data</article-title>. <source>Bioinformatics</source>, <volume>17</volume>(<issue>9</issue>), <fpage>763</fpage>–<lpage>774</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_052">
<mixed-citation publication-type="journal"><string-name><surname>Yip</surname>, <given-names>K.Y.L.</given-names></string-name>, <string-name><surname>Cheng</surname>, <given-names>D.W.</given-names></string-name>, <string-name><surname>Ng</surname>, <given-names>M.K.</given-names></string-name> (<year>2004</year>). <article-title>HARP: a practical projected clustering algorithm</article-title>. <source>IEEE Transactions on Knowledge and Data Engineering</source>, <volume>16</volume>(<issue>11</issue>), <fpage>1387</fpage>–<lpage>1397</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_053">
<mixed-citation publication-type="journal"><string-name><surname>Yu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>S.H.</given-names></string-name>, <string-name><surname>Jeon</surname>, <given-names>M.</given-names></string-name> (<year>2012</year>). <article-title>An adaptive ACO-based fuzzy clustering algorithm for noisy image segmentation</article-title>. <source>International Journal of Innovative Computing, Information and Control</source>, <volume>8</volume>(<issue>6</issue>), <fpage>3907</fpage>–<lpage>3918</lpage>.</mixed-citation>
</ref>
<ref id="j_info1146_ref_054">
<mixed-citation publication-type="journal"><string-name><surname>Zhang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Fang</surname>, <given-names>Z.</given-names></string-name> (<year>2013</year>). <article-title>An improved K-means clustering algorithm</article-title>. <source>Journal of Information and Computational Science</source>, <volume>10</volume>(<issue>1</issue>), <fpage>193</fpage>–<lpage>199</lpage>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>