Approach for Multi-Label Text Data Class Verification and Adjustment Based on Self-Organizing Map and Latent Semantic Analysis

Stefanovič, Pavel; Kurasova, Olga

doi:10.15388/22-INFOR473

Informatica

Approach for Multi-Label Text Data Class Verification and Adjustment Based on Self-Organizing Map and Latent Semantic Analysis

Volume 33, Issue 1 (2022), pp. 109–130

Pavel Stefanovič Olga Kurasova

https://doi.org/10.15388/22-INFOR473

Pub. online: 10 January 2022 Type: Research Article

Open Access

Received
1 June 2021

Accepted
1 January 2022

Published
10 January 2022

Abstract

In this paper, a new approach has been proposed for multi-label text data class verification and adjustment. The approach helps to make semi-automated revisions of class assignments to improve the quality of the data. The data quality significantly influences the accuracy of the created models, for example, in classification tasks. It can also be useful for other data analysis tasks. The proposed approach is based on the combination of the usage of the text similarity measure and two methods: latent semantic analysis and self-organizing map. First, the text data must be pre-processed by selecting various filters to clean the data from unnecessary and irrelevant information. Latent semantic analysis has been selected to reduce the vectors dimensionality of the obtained vectors that correspond to each text from the analysed data. The cosine similarity distance has been used to determine which of the multi-label text data class should be changed or adjusted. The self-organizing map has been selected as the key method to detect similarity between text data and make decisions for a new class assignment. The experimental investigation has been performed using the newly collected multi-label text data. Financial news data in the Lithuanian language have been collected from four public websites and classified by experts into ten classes manually. Various parameters of the methods have been analysed, and the influence on the final results has been estimated. The final results are validated by experts. The research proved that the proposed approach could be helpful to verify and adjust multi-label text data classes. 82% of the correct assignments are obtained when the data dimensionality is reduced to 40 using the latent semantic analysis, and the self-organizing map size is reduced from 40 to 5 by step 5.

References

Aggarwal, C.C., Zhai, C. (2012). A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128. Springer.

Ahmed, N.A., Shehab, M.A., Al-Ayyoub, M., Hmeidi, I. (2015). Scalable multi-label arabic text classification. In: 2015 6th International Conference on Information and Communication Systems (ICICS), pp. 212–217. IEEE.

Aly, S., Almotairi, S. (2020). Deep convolutional self-organizing map network for robust handwritten digit recognition. IEEE Access, 8, 107035–107045.

Bhuiyan, H., Ara, J., Bardhan, R., Islam, M.R. (2017). Retrieving YouTube video by sentiment analysis on user comment. In: 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 474–478. IEEE.

Blei, D.M., Ng, A.Y., Jordan, M.I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Blum, M.G., Nunes, M.A., Prangle, D., Sisson, S.A., et al.(2013). A comparative review of dimension reduction methods in approximate Bayesian computation. Statistical Science, 28(2), 189–208.

Demšar, J., Curk, T., Erjavec, A., Gorup, Č., Hočevar, T., Milutinovič, M., Možina, M., Polajnar, M., Toplak, M., Starič, A., Štajdohar M., Umek, L., Žagar, L., Žbontar, J., Žitnik, M., Zupan, B. (2013). Orange: data mining toolbox in Python. Journal of Machine Learning Research, 14(1), 2349–2353.

Dumais, S.T. (2004). Latent semantic analysis. Annual Review of Information Science and Technology, 38(1), 188–230.

Dzemyda, G., Kurasova, O. (2002). Comparative analysis of the graphical result presentation in the SOM software. Informatica, 13(3), 275–286.

Hernández-Alvarez, M., Gomez, J.M. (2016). Survey about citation context analysis: Tasks, techniques, and resources. Natural Language Engineering, 22(3), 327–349.

Hmeidi, I., Al-Ayyoub, M., Mahyoub, N.A., Shehab, M.A. (2016). A lexicon based approach for classifying Arabic multi-labeled text. International Journal of Web Information Systems. 12(4), 504–532.

Jocas, D. (2020). Lithuanian Stemming Algorithm. https://snowballstem.org/algorithms/lithuanian/stemmer.html.

Joulin, A., Grave, E., Bojanowski, P., Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

Kapočiūtė-Dzikienė, J., Damaševičius, R., Woźniak, M. (2019). Sentiment analysis of Lithuanian texts using traditional and deep learning approaches. Computers, 8(1), 4.

Khan, J.Y., Khondaker, M.T.I., Afroz, S., Uddin, G., Iqbal, A. (2021). A benchmark study of machine learning models for online fake news detection. Machine Learning with Applications, 4, 100032.

Kharlamov, A.A., Orekhov, A.V., Bodrunova, S.S., Lyudkevich, N.S. (2019). Social network sentiment analysis and message clustering. In: International Conference on Internet Science, pp. 18–31. Springer.

Kim, K., Chung, B.-s., Choi, Y., Lee, S., Jung, J.-Y., Park, J. (2014). Language independent semantic kernels for short-text classification. Expert Systems with Applications, 41(2), 735–743.

Kohonen, T. (2012). Self-Organizing Maps, Vol. 30. Springer Science & Business Media.

Krilavičius, T., Medelis, Ž., Kapočiūtė-Dzikienė, J., Žalandauskas, T. (2012). News media analysis using focused crawl and natural language processing: case of Lithuanian news websites. In: International Conference on Information and Software Technologies, pp. 48–61. Springer.

LFND (2021). Lithuanian Financial News Dataset (LFND) (multi-labeled). https://www.kaggle.com/pavelstefanovi/lithuanian-financial-news-dataset-multilabeled.

Licen, S., Di Gilio, A., Palmisani, J., Petraccone, S., de Gennaro, G., Barbieri, P. (2020). Pattern recognition and anomaly detection by self-organizing maps in a multi month e-nose survey at an industrial site. Sensors, 20(7), 1887.

López, A.U., Mateo, F., Navío-Marco, J., Martínez-Martínez, J.M., Gómez-Sanchís, J., Vila-Francés, J., Serrano-López, A.J. (2019). Analysis of computer user behavior, security incidents and fraud using self-organizing maps. Computers & Security, 83, 38–51.

Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J. (2021). Deep learning–based text classification: a comprehensive review. ACM Computing Surveys (CSUR), 54(3), 1–40.

Nanculef, R., Flaounas, I., Cristianini, N. (2014). Efficient classification of multi-labeled text streams by clashing. Expert Systems with Applications, 41(11), 5431–5450.

Park, C.H., Lee, M. (2008). On applying linear discriminant analysis for multi-labeled problems. Pattern Recognition Letters, 29(7), 878–887.

Ramage, D., Hall, D., Nallapati, R., Manning, C.D. (2009). Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 248–256.

Stefanovič, P., Kurasova, O. (2011). Visual analysis of self-organizing maps. Nonlinear Analysis: Modelling and Control, 16(4), 488–504.

Stefanovič, P., Kurasova, O. (2014). Creation of text document matrices and visualization by self-organizing map. Information Technology and Control, 43(1), 37–46.

Stefanovic, P., Kurasova, O. (2014). Investigation on learning parameters of self-organizing maps. Baltic Journal of Modern Computing, 2(2), 45.

Stefanovič, P., Kurasova, O., Štrimaitis, R. (2019). The n-grams based text similarity detection approach using self-organizing maps and similarity measures. Applied Sciences, 9(9), 1870.

Štrimaitis, R., Stefanovič, P., Ramanauskaitė, S., Slotkienė, A. (2021). Financial context news sentiment analysis for the Lithuanian language. Applied Sciences, 11(10), 4443.

Ueda, N., Saito, K. (2003). Parametric mixture models for multi-labeled text. In: Advances in Neural Information Processing Systems, pp. 737–744.

Ultsch, A., Siemon, H.P. (1989). Exploratory Data Analysis: Using Kohonen Networks on Transputers. Univ., FB Informatik.

Yoshioka, K., Dozono, H. (2018). The classification of the documents based on Word2Vec and 2-layer self organizing maps. International Journal of Machine Learning and Computing, 8(3), 252–255.

Biographies

Stefanovič Pavel

pavel.stefanovic@vilniustech.lt

P. Stefanovič received a PhD degree in computer science from the Institute of Mathematics and Informatics, Vilnius University, Lithuania, in 2015. He is currently employed as a researcher and associate professor at the Faculty of Fundamental Sciences, Vilnius Gediminas Technical University. His research interests include data mining methods, natural language pre-processing, machine learning methods, visualization of multidimensional data, data clustering methods. He is the author of 12 publications.

Kurasova Olga

olga.kurasova@mif.vu.lt

O. Kurasova received a PhD degree in computer science from the Institute of Mathematics and Informatics, Vytautas Magnus University, Lithuania, in 2005. She is currently employed as a principal researcher and a professor at the Institute of Data Science and Digital Technologies, Vilnius University. Her research interests include data mining methods, optimization theory and applications, artificial intelligence, neural networks, visualization of multidimensional data, multiple criteria decision support, parallel computing, and image processing. She is the author of more than 80 scientific publications.

Full article Related articles Cited by

Open access article under the CC BY license.

Keywords

multi-label text data clustering self-organizing map latent semantic analysis Lithuanian language

Metrics

since January 2020

1404

Article info
views

687

Full article
views

752

PDF
downloads

203

XML
downloads

RSS

Authors

Abstract

References

Biographies

Export citation

Copy and paste formatted citation

Download citation in file