Holo-Entropy Based Categorical Data Hierarchical Clustering

Sun, Haojun; Chen, Rongbo; Qin, Yong; Wang, Shengrui

doi:10.15388/Informatica.2017.131

Informatica

Holo-Entropy Based Categorical Data Hierarchical Clustering

Volume 28, Issue 2 (2017), pp. 303–328

Haojun Sun Rongbo Chen Yong Qin Shengrui Wang

https://doi.org/10.15388/Informatica.2017.131

Pub. online: 1 January 2017 Type: Research Article

Open Access

Received
1 January 2016

Accepted
1 March 2017

Published
1 January 2017

Abstract

Clustering high-dimensional data is a challenging task in data mining, and clustering high-dimensional categorical data is even more challenging because it is more difficult to measure the similarity between categorical objects. Most algorithms assume feature independence when computing similarity between data objects, or make use of computationally demanding techniques such as PCA for numerical data. Hierarchical clustering algorithms are often based on similarity measures computed on a common feature space, which is not effective when clustering high-dimensional data. Subspace clustering algorithms discover feature subspaces for clusters, but are mostly partition-based; i.e. they do not produce a hierarchical structure of clusters. In this paper, we propose a hierarchical algorithm for clustering high-dimensional categorical data, based on a recently proposed information-theoretical concept named holo-entropy. The algorithm proposes new ways of exploring entropy, holo-entropy and attribute weighting in order to determine the feature subspace of a cluster and to merge clusters even though their feature subspaces differ. The algorithm is tested on UCI datasets, and compared with several state-of-the-art algorithms. Experimental results show that the proposed algorithm yields higher efficiency and accuracy than the competing algorithms and allows higher reproducibility.

References

Aggarwal, C.C., Yu, P.S. (2002). Redefining clustering for high dimensional applications. IEEE Transactions on Knowledge and Data Engineering, 14(2), 210–225.

Aggarwal, C.C., Procopiuc, C., Wolf, J.L., Yu, P.S., Park, J.S. (1999). Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD, Vol. 99, pp. 61–72.

Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C. (2004). LIMBO: scalable clustering of categorical data. Lecture Notes in Computer Science, 2992, 123–146.

Bai, L., Liang, J., Dang, C., Cao, F. (2011a). A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognition, 44(12), 2843–2861.

Bai, L., Liang, J., Dang, C. (2011b). An initialization method to simultaneously find initial cluster and the number of clusters for clustering categorical data. Knowledge and Information Systems, 24, 785–795.

Barbar, D., Li, Y., Couto, J. (2002). COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the eleventh International Conference on Information and Knowledge Management. ACM, pp. 582–589.

Bouguessa, M., Wang, S. (2009). Mining projected clusters in high-dimensional spaces. IEEE Transactions on Knowledge and Data Engineering, 21(4), 507–522.

Cao, F.Y., Liang, J.Y., Bai, L. (2009). A new initialization method for categorical data clustering. Expert Systems with Applications, 33(7), 10223–10228.

Choi, S., Ryu, B., Yoo, S., Choi, J. (2012). Combining relevancy and methodological quality into a single ranking for evidence-based medicine. Information Sciences, 214, 76–90.

Cover, T., Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21–27.

Cover, T., Thomas, J. (2012). Elements of Information Theory. John Wiley and Sons.

Derrac, J., Cornelis, C., Garca, S., Herrera, F. (2012). Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection. Information Sciences, 186, 73–92.

Do, H.J., Kim, J.Y. (2008). Categorical data clustering using the combinations of attribute values. In Series Lecture Notes in Computer Science: Vol. 5073. Computational Science and Its Applications – ICCSA 2008, pp. 220–231.

Domeniconi, C., Papadopoulos, D., Gunopulos, D., Ma, D. (2004). Subspace clustering of high dimensional data. In: SDM 2004, pp. 73–93.

Filippone, M., Sanguinetti, G. (2010). Information theoretic novelty detection. Pattern Recognition, 43, 805–814.

Fukunaga, K. (2013). Introduction to Statistical Pattern Recognition. Academic Press.

Gan, G., Wu, J. (2004). Subspace clustering for high dimensional categorical data. ACM SIGKDD Explorations Newsletter, 6(2), 87–94.

Gan, G., Yang, Z., Wu, J. (2005). A genetic k-modes algorithm for clustering categorical data. Lecture Notes in Computer Science, 3584, 195–202.

Gan, G., Wu, J., Yang, Z. (2009). A genetic fuzzy k-modes algorithm for clustering categorical data. Expert Systems with Applications, 36, 1615–1620.

Ganti, V., Gehrke, J., Ramakrishnan, R. (1999). CACTUS: clustering categorical data using summaries. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Datamining, San Diego, CA, USA, pp. 73–83.

Greenacre, M., Blasius, J. (2006). Multiple Correspondence Analysis and Related Methods. CRC Press.

Guan, H., Zhou, J., Xiao, B., Guo, M., Yang, T. (2013). Fast dimension reduction for document classification based on imprecise spectrum analysis. Information Sciences, 222, 147–162.

Guha, S., Rastogi, R., Shim, K. (1999). ROCK: a robust clustering algorithm for categorical attributes. In: Data Engineering, 1999. Proceedings, 15th International Conference on IEEE, pp. 512–521.

He, Z., Xu, X., Deng, S. (2005). K-ANMI: a mutual information based clustering algorithm for categorical data. Information Fusion, 9(2), 223–233.

He, Z., Xu, X., Deng, S. (2011). Attribute value weighting in k-modes clustering. Expert Systems with Applications, 38, 15365–15369.

Hruschka, E.R., Campello, R.J.G.B., de Castro, L.N. (2006). Evolving clusters in gene-expression data. Information Sciences, 176(13), 1898–1927.

Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.

Huang, Z., Ng, M., Rong, H., Li, Z. (2005). Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 657–668.

Hubert, L., Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.

Jollois, F., Nadif, M. (2002). Clustering large categorical data. In: Proceedings of Pacific Asia Conference on Knowledge Discovery in Databases (PAKDD02), pp. 257–263.

Kim, D., Lee, K., Lee, D. (2004). Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recognition Letters, 25(11), 1263–1271.

Li, Y., Zhu, S., Wang, X., Jajodia, S. (2006). Looking into the seeds of time: discovering temporal patterns in large transaction sets. Information Sciences, 176(8), 1003–1031.

Li, H.X., Yang, J.-L., Zhang, G., Fan, B. (2013). Probabilistic support vector machines for classification of noise affected data. Information Sciences, 221, 60–71.

Lingras, P., Hogo, M., Snorek, M., West, C. (2005). Temporal analysis of clusters of supermarket customers: conventional versus interval set approach. Information Sciences, 172(1–2), 215–240.

Lu, Y., Wang, S., Li, S., Zhou, C. (2011). Particle swarm optimizer for variable weighting in clustering high dimensional data. Machine Learning, 82(1), 43–70.

Nemalhabib, A., Shiri, N. (2006). In: ACM Symposium on Applied Computing, pp. 637–638.

Parsons, L., Haque, E., Liu, H. (2004). Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter, 6(1), 90–105.

Qin, H., Ma, X., Herawan, T., Zain, J.M. (2014). MGR: an information theory based hierarchical divisive clustering algorithm for categorical data. Knowledge-Based Systems, 67, 401–411.

Sabit, H., Al-Anbuky, A., Gholamhosseini, H. (2011). Data stream mining for wireless sensor networks environment: energy efficient fuzzy clustering algorithm. International Journal of Autonomous and Adaptive Communications Systems, 4, 383–397.

Santos, F., Brezo, X., Ugarte-Pedrero, P., Bringas, G. (2013). Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences, 231, 64–82.

Schwenkera, F., Trentin, E. (2014). Pattern classification and clustering: a review of partially supervised learning approaches. Pattern Recognition Letters, 37, 4–14.

Shannon, C.E. (1948). A mathematical theory of communication. The Bell System Technical Journal, XXVII(3), 379–423.

Shrivastava, N., Tyagi, V. (2014). Content based image retrieval based on relative locations of multiple regions of interest using selective regions matching. Information Sciences, 259, 212–224.

Srinivasa, S. (2005). A review on multivariate mutual information. University of Notre Dame, Notre Dame, Indiana, 2, 1–6.

Tan, S., Cheng, X., Ghanem, M., Wang, B., Xu, H. (2005). A novel refinement approach for text categorization. In: Proceedings of the ACM 14th Conference on Information and Knowledge Management, pp. 469–476.

UCI Machine Learning Repository (2011). http://www.ics.uci.edu/mlearn/MLRepository.html.

Watanabe, S. (1960). Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development, 4, 66–82.

Wu, S., Wang, S. (2013). Information-theoretic outlier detection for large-scale categorical data. IEEE Transactions on Knowledge and Data Engineering, 25(3), 589–601.

Xiong, T., Wang, S., Mayers, A., Monga, E. (2011). DHCC: divisive hierarchical clustering of categorical data. Data Mining and Knowledge Discovery, 24(1), 103–135.

Yao, J., Dash, M., Tan, S.T., Liu, H. (2000). Entropy-based fuzzy clustering and fuzzy modeling. Fuzzy Sets and Systems, 113(3), 381–388.

Yeung, K.Y., Ruzzo, W.L. (2001). Details of the adjusted Rand index and clustering algorithms, supplement to the paper. An empirical study on principal component analysis for clustering gene expression data. Bioinformatics, 17(9), 763–774.

Yip, K.Y.L., Cheng, D.W., Ng, M.K. (2004). HARP: a practical projected clustering algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(11), 1387–1397.

Yu, J., Lee, S.H., Jeon, M. (2012). An adaptive ACO-based fuzzy clustering algorithm for noisy image segmentation. International Journal of Innovative Computing, Information and Control, 8(6), 3907–3918.

Zhang, C., Fang, Z. (2013). An improved K-means clustering algorithm. Journal of Information and Computational Science, 10(1), 193–199.

Biographies

Sun Haojun

haojunsun@stu.edu.cn

H. Sun is a professor at the Department of Computer, Shantou University, China. His main research interests are in data mining, machine learning, pattern recognition, etc.

Chen Rongbo

13rbchen@stu.edu.cn

R. Chen was awarded the candidate of master’s degree at computer science, Shantou University, His main research interests include data mining, machine learning, pattern recognition, etc.

Qin Yong

yqin@bjtu.edu.cn

Y. Qin is a professor in State Key Laboratory of Rail Traffic Control and Safety Beijing Jiaotong University. The main research interest is intelligent transportation system, traffic safety engineering, intelligent control theory.

Wang Shengrui

shengrui.wang@usherbrooke.ca

S. Wang is a professor in Department of Computer Science University of Sherbrooke, Canada. In research, he is interested in pattern recognition, data mining, bio-informatics, neural networks, image processing, remote sensing, GIS. His current projects include high-dimensional data clustering, categorical data clustering, data streams mining, protein and RNA sequences mining, graph matching and graph clustering, fuzzy clustering and variable selection for data mining, location-based services, bankruptcy prediction, business intelligence.

Full article Cited by

Open access article under the CC BY license.

Keywords

hierarchical clustering holo-entropy subspace categorical data

Metrics

since January 2020

1203

Article info
views

581

Full article
views

566

PDF
downloads

222

XML
downloads

RSS

Authors

Abstract

References

Biographies

Export citation

Copy and paste formatted citation

Download citation in file