In this paper, a new approach has been proposed for multi-label text data class verification and adjustment. The approach helps to make semi-automated revisions of class assignments to improve the quality of the data. The data quality significantly influences the accuracy of the created models, for example, in classification tasks. It can also be useful for other data analysis tasks. The proposed approach is based on the combination of the usage of the text similarity measure and two methods: latent semantic analysis and self-organizing map. First, the text data must be pre-processed by selecting various filters to clean the data from unnecessary and irrelevant information. Latent semantic analysis has been selected to reduce the vectors dimensionality of the obtained vectors that correspond to each text from the analysed data. The cosine similarity distance has been used to determine which of the multi-label text data class should be changed or adjusted. The self-organizing map has been selected as the key method to detect similarity between text data and make decisions for a new class assignment. The experimental investigation has been performed using the newly collected multi-label text data. Financial news data in the Lithuanian language have been collected from four public websites and classified by experts into ten classes manually. Various parameters of the methods have been analysed, and the influence on the final results has been estimated. The final results are validated by experts. The research proved that the proposed approach could be helpful to verify and adjust multi-label text data classes. 82% of the correct assignments are obtained when the data dimensionality is reduced to 40 using the latent semantic analysis, and the self-organizing map size is reduced from 40 to 5 by step 5.
Pub. online:1 Jan 2017Type:Research ArticleOpen Access
Volume 28, Issue 2 (2017), pp. 359–374
In recent years, the growth of marine traffic in ports and their surroundings raise the traffic and security control problems and increase the workload for traffic control operators. The automated identification system of vessel movement generates huge amounts of data that need to be analysed to make the proper decision. Thus, rapid self-learning algorithms for the decision support system have to be developed to detect the abnormal vessel movement in intense marine traffic areas. The paper presents a new self-learning adaptive classification algorithm based on the combination of a self-organizing map (SOM) and a virtual pheromone for abnormal vessel movement detection in maritime traffic. To improve the quality of classification results, Mexican hat neighbourhood function has been used as a SOM neighbourhood function. To estimate the classification results of the proposed algorithm, an experimental investigation has been performed using the real data set, provided by the Klaipėda seaport and that obtained from the automated identification system. The results of the research show that the proposed algorithm provides rapid self-learning characteristics and classification.
Pub. online:1 Jan 2011Type:Research ArticleOpen Access
Volume 22, Issue 1 (2011), pp. 115–134
In this paper, the quality of quantization and visualization of vectors, obtained by vector quantization methods (self-organizing map and neural gas), is investigated. A multidimensional scaling is used for visualization of multidimensional vectors. The quality of quantization is measured by a quantization error. Two numerical measures for proximity preservation (Konig's topology preservation measure and Spearman's correlation coefficient) are applied to estimate the quality of visualization. Results of visualization (mapping images) are also presented.