Informatica logo


Login Register

  1. Home
  2. Issues
  3. Volume 33, Issue 1 (2022)
  4. Approach for Multi-Label Text Data Class ...

Informatica

Information Submit your article For Referees Help ATTENTION!
  • Article info
  • Full article
  • Related articles
  • Cited by
  • More
    Article info Full article Related articles Cited by

Approach for Multi-Label Text Data Class Verification and Adjustment Based on Self-Organizing Map and Latent Semantic Analysis
Volume 33, Issue 1 (2022), pp. 109–130
Pavel Stefanovič   Olga Kurasova  

Authors

 
Placeholder
https://doi.org/10.15388/22-INFOR473
Pub. online: 10 January 2022      Type: Research Article      Open accessOpen Access

Received
1 June 2021
Accepted
1 January 2022
Published
10 January 2022

Abstract

In this paper, a new approach has been proposed for multi-label text data class verification and adjustment. The approach helps to make semi-automated revisions of class assignments to improve the quality of the data. The data quality significantly influences the accuracy of the created models, for example, in classification tasks. It can also be useful for other data analysis tasks. The proposed approach is based on the combination of the usage of the text similarity measure and two methods: latent semantic analysis and self-organizing map. First, the text data must be pre-processed by selecting various filters to clean the data from unnecessary and irrelevant information. Latent semantic analysis has been selected to reduce the vectors dimensionality of the obtained vectors that correspond to each text from the analysed data. The cosine similarity distance has been used to determine which of the multi-label text data class should be changed or adjusted. The self-organizing map has been selected as the key method to detect similarity between text data and make decisions for a new class assignment. The experimental investigation has been performed using the newly collected multi-label text data. Financial news data in the Lithuanian language have been collected from four public websites and classified by experts into ten classes manually. Various parameters of the methods have been analysed, and the influence on the final results has been estimated. The final results are validated by experts. The research proved that the proposed approach could be helpful to verify and adjust multi-label text data classes. 82% of the correct assignments are obtained when the data dimensionality is reduced to 40 using the latent semantic analysis, and the self-organizing map size is reduced from 40 to 5 by step 5.

References

 
Aggarwal, C.C., Zhai, C. (2012). A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128. Springer.
 
Ahmed, N.A., Shehab, M.A., Al-Ayyoub, M., Hmeidi, I. (2015). Scalable multi-label arabic text classification. In: 2015 6th International Conference on Information and Communication Systems (ICICS), pp. 212–217. IEEE.
 
Aly, S., Almotairi, S. (2020). Deep convolutional self-organizing map network for robust handwritten digit recognition. IEEE Access, 8, 107035–107045.
 
Bhuiyan, H., Ara, J., Bardhan, R., Islam, M.R. (2017). Retrieving YouTube video by sentiment analysis on user comment. In: 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 474–478. IEEE.
 
Blei, D.M., Ng, A.Y., Jordan, M.I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
 
Blum, M.G., Nunes, M.A., Prangle, D., Sisson, S.A., et al.(2013). A comparative review of dimension reduction methods in approximate Bayesian computation. Statistical Science, 28(2), 189–208.
 
Demšar, J., Curk, T., Erjavec, A., Gorup, Č., Hočevar, T., Milutinovič, M., Možina, M., Polajnar, M., Toplak, M., Starič, A., Štajdohar M., Umek, L., Žagar, L., Žbontar, J., Žitnik, M., Zupan, B. (2013). Orange: data mining toolbox in Python. Journal of Machine Learning Research, 14(1), 2349–2353.
 
Dumais, S.T. (2004). Latent semantic analysis. Annual Review of Information Science and Technology, 38(1), 188–230.
 
Dzemyda, G., Kurasova, O. (2002). Comparative analysis of the graphical result presentation in the SOM software. Informatica, 13(3), 275–286.
 
Hernández-Alvarez, M., Gomez, J.M. (2016). Survey about citation context analysis: Tasks, techniques, and resources. Natural Language Engineering, 22(3), 327–349.
 
Hmeidi, I., Al-Ayyoub, M., Mahyoub, N.A., Shehab, M.A. (2016). A lexicon based approach for classifying Arabic multi-labeled text. International Journal of Web Information Systems. 12(4), 504–532.
 
Jocas, D. (2020). Lithuanian Stemming Algorithm. https://snowballstem.org/algorithms/lithuanian/stemmer.html.
 
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
 
Kapočiūtė-Dzikienė, J., Damaševičius, R., Woźniak, M. (2019). Sentiment analysis of Lithuanian texts using traditional and deep learning approaches. Computers, 8(1), 4.
 
Khan, J.Y., Khondaker, M.T.I., Afroz, S., Uddin, G., Iqbal, A. (2021). A benchmark study of machine learning models for online fake news detection. Machine Learning with Applications, 4, 100032.
 
Kharlamov, A.A., Orekhov, A.V., Bodrunova, S.S., Lyudkevich, N.S. (2019). Social network sentiment analysis and message clustering. In: International Conference on Internet Science, pp. 18–31. Springer.
 
Kim, K., Chung, B.-s., Choi, Y., Lee, S., Jung, J.-Y., Park, J. (2014). Language independent semantic kernels for short-text classification. Expert Systems with Applications, 41(2), 735–743.
 
Kohonen, T. (2012). Self-Organizing Maps, Vol. 30. Springer Science & Business Media.
 
Krilavičius, T., Medelis, Ž., Kapočiūtė-Dzikienė, J., Žalandauskas, T. (2012). News media analysis using focused crawl and natural language processing: case of Lithuanian news websites. In: International Conference on Information and Software Technologies, pp. 48–61. Springer.
 
LFND (2021). Lithuanian Financial News Dataset (LFND) (multi-labeled). https://www.kaggle.com/pavelstefanovi/lithuanian-financial-news-dataset-multilabeled.
 
Licen, S., Di Gilio, A., Palmisani, J., Petraccone, S., de Gennaro, G., Barbieri, P. (2020). Pattern recognition and anomaly detection by self-organizing maps in a multi month e-nose survey at an industrial site. Sensors, 20(7), 1887.
 
López, A.U., Mateo, F., Navío-Marco, J., Martínez-Martínez, J.M., Gómez-Sanchís, J., Vila-Francés, J., Serrano-López, A.J. (2019). Analysis of computer user behavior, security incidents and fraud using self-organizing maps. Computers & Security, 83, 38–51.
 
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J. (2021). Deep learning–based text classification: a comprehensive review. ACM Computing Surveys (CSUR), 54(3), 1–40.
 
Nanculef, R., Flaounas, I., Cristianini, N. (2014). Efficient classification of multi-labeled text streams by clashing. Expert Systems with Applications, 41(11), 5431–5450.
 
Park, C.H., Lee, M. (2008). On applying linear discriminant analysis for multi-labeled problems. Pattern Recognition Letters, 29(7), 878–887.
 
Ramage, D., Hall, D., Nallapati, R., Manning, C.D. (2009). Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 248–256.
 
Stefanovič, P., Kurasova, O. (2011). Visual analysis of self-organizing maps. Nonlinear Analysis: Modelling and Control, 16(4), 488–504.
 
Stefanovič, P., Kurasova, O. (2014). Creation of text document matrices and visualization by self-organizing map. Information Technology and Control, 43(1), 37–46.
 
Stefanovic, P., Kurasova, O. (2014). Investigation on learning parameters of self-organizing maps. Baltic Journal of Modern Computing, 2(2), 45.
 
Stefanovič, P., Kurasova, O., Štrimaitis, R. (2019). The n-grams based text similarity detection approach using self-organizing maps and similarity measures. Applied Sciences, 9(9), 1870.
 
Štrimaitis, R., Stefanovič, P., Ramanauskaitė, S., Slotkienė, A. (2021). Financial context news sentiment analysis for the Lithuanian language. Applied Sciences, 11(10), 4443.
 
Ueda, N., Saito, K. (2003). Parametric mixture models for multi-labeled text. In: Advances in Neural Information Processing Systems, pp. 737–744.
 
Ultsch, A., Siemon, H.P. (1989). Exploratory Data Analysis: Using Kohonen Networks on Transputers. Univ., FB Informatik.
 
Yoshioka, K., Dozono, H. (2018). The classification of the documents based on Word2Vec and 2-layer self organizing maps. International Journal of Machine Learning and Computing, 8(3), 252–255.

Biographies

Stefanovič Pavel
pavel.stefanovic@vilniustech.lt

P. Stefanovič received a PhD degree in computer science from the Institute of Mathematics and Informatics, Vilnius University, Lithuania, in 2015. He is currently employed as a researcher and associate professor at the Faculty of Fundamental Sciences, Vilnius Gediminas Technical University. His research interests include data mining methods, natural language pre-processing, machine learning methods, visualization of multidimensional data, data clustering methods. He is the author of 12 publications.

Kurasova Olga
olga.kurasova@mif.vu.lt

O. Kurasova received a PhD degree in computer science from the Institute of Mathematics and Informatics, Vytautas Magnus University, Lithuania, in 2005. She is currently employed as a principal researcher and a professor at the Institute of Data Science and Digital Technologies, Vilnius University. Her research interests include data mining methods, optimization theory and applications, artificial intelligence, neural networks, visualization of multidimensional data, multiple criteria decision support, parallel computing, and image processing. She is the author of more than 80 scientific publications.


Full article Related articles Cited by PDF XML
Full article Related articles Cited by PDF XML

Copyright
© 2022 Vilnius University
by logo by logo
Open access article under the CC BY license.

Keywords
multi-label text data clustering self-organizing map latent semantic analysis Lithuanian language

Metrics
since January 2020
1357

Article info
views

653

Full article
views

718

PDF
downloads

191

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

INFORMATICA

  • Online ISSN: 1822-8844
  • Print ISSN: 0868-4952
  • Copyright © 2023 Vilnius University

About

  • About journal

For contributors

  • OA Policy
  • Submit your article
  • Instructions for Referees
    •  

    •  

Contact us

  • Institute of Data Science and Digital Technologies
  • Vilnius University

    Akademijos St. 4

    08412 Vilnius, Lithuania

    Phone: (+370 5) 2109 338

    E-mail: informatica@mii.vu.lt

    https://informatica.vu.lt/journal/INFORMATICA
Powered by PubliMill  •  Privacy policy