4.1 Text Data Analysed
To perform the experimental investigation, the newly collected data has been used (LFND,
2021). The data is collected from public financial Lithuania news websites and stored in a database as texts. The analysed data is a set of text
${X_{1}},{X_{2}},\dots ,{X_{N}}$, where
$N=12484$. In cooperation with a company, whose main field is accounting and business management software development with more than 30 years of experience, the five experts from the financial department assigned all text data manually to 10 classes (Collective, Development, Finance, Industry, Innovation, International, Law enforcement, Pandemic, Politics, and Reliability). In the process of data class assignment, the rule was that each text data item could be assigned to no more than two classes. As mentioned before, the problem with the manual assignment of the class is that every expert can interpret the text differently (human factor). Another problem is that some text can be assigned to more than two classes, so it is difficult to decide which classes must be main. This imbalance can lead to inaccurate results in other steps, thus it is important to discover the ways to solve this problem. For instance, we have the text “The pandemic has had many financial consequences around the world”. This sentence could be assigned to the classes “International”, “Pandemic”, and “Finance”.
Fig. 3
The token number distribution of unpre-processed text data.
The token number distribution of unpre-processed text data is presented in Fig.
3. As we can see in Fig.
3, the majority of the text length is no more than 54 tokens. There are 2490 texts whose number of tokens is equal from 54 to 105, and there are just 202 texts that are longer than 105 tokens.
In this research, the multi-label text data is analysed, thus some texts belong to one or two classes. Suppose, we have a text that belongs to “Pandemic” and “Collective” classes, so this text will be considered as “Pandemic, and “Collective” at the same time. The data class distribution is presented in Fig.
4 (if data item has more than one class, it is presented in both classes). Because of the reason that this data is collected from a financial news website, the majority of the text belongs to the class “Finance”. The number of data items from the other classes is similar, except the number of classes “Industry”, “Development”, and “Collective” is larger. The smallest number of data items are from class “Politics”.
There are 6025 texts that got just one class assigned, and the rest 6459 of data items are assigned to two classes. The total number of tokens over all the data is equal to 438730 when the data is unpre-processed (59148 unique tokens) and, respectively, after pre-processing (filters are described in Subsection
3.4) overall number of tokens is equal to 254615 (22730 unique tokens). The most frequent words of each data class are presented in Fig.
5, which allows us to find which words represent each class the best.
Fig. 4
Distribution of data class.
Fig. 5
Word clouds of each class.
4.2 Experimental Research Results and Validation
As it was mentioned before, first of all, the text data has to be pre-processed. In our experimental investigation, we used the following pre-processing filters: removed numbers, tokens were converted to the lower case, used Lithuanian language snowball stemming algorithm (Jocas,
2020), erased punctuation, used smaller than three characters tokens’ length, and used the Lithuanian language stop words list. An example of SOM using the analysed data is presented in Fig.
6. An Orange data mining tool has been used for the visual presentation of SOM (Demšar
et al.,
2013). In this type of visualization, the circles show just the class label of the majority data items which fall in the one SOM cell, so some data items from other classes can be in the same cell as well. In this example, the selected size of the SOM is equal to
$10\times 10$ (it can be various depending on the researcher selection), but because of the u-matrix visualization, the additional cells are included in the SOM, which are used to represent the distance between clusters. The darker colours mean that the distance is larger than the light colour cells. As we can see, on the left side and the left top corner of the SOM, the blue colour dominates, indicating that the majority of the data is from class “Collective”. On the right side and the right top corner, “Finance” class data items are placed. On the top of the SOM, the light blue circles represent the data items that belong to the “Law enforcement” class. All other class members spread overall SOM, and the small clusters are formed.
Fig. 6
Data presented in $10\times 10$ SOM using u-matrix: a) coloured by the first class; b) coloured by the second class.
There are a lot of various options that can be selected in the approach, so according to our previous research (Stefanovic and Kurasova,
2014), we choose the following SOM parameters by default: SOM size is equal to
$40\times 40$, and reduced by 5 until the SOM size is equal to
$5\times 5$; neighbouring function is Gaussian and learning rate is linear; iteration number equals to 100; the initial SOM neurons are generated at random. In the class assignment part, we used the cosine similarity distance, and the most frequent word lists of each class have 15 words. The primary research showed that the words overlap from different classes when selecting the high number of the most frequent word, so the final results can be worse. To determine which dimensionality has to be selected as an output of the LSA model, the research has been conducted, where the reduced dimensionality
D varies from 10 to 50 by step 10, and the limit percent of new class assignment
$L=90\% $. The process described in Algorithm
1 has been performed, and after all the steps, each data item has been assigned to adjusted classes. As mentioned before, the experts have been asked to review the new data class assignments and mark one of the tags described in Table
2. All assigned tags have been calculated and used to evaluate the proposed approach.
Table 2
Class assignments tags and their descriptions.
Tag name |
Tag description |
Accept |
The “Accept” tag is used if the new class is unambiguously assigned correctly. For example, the text primarily belongs to one class, and the approach finds that an additional class has to be assigned. If the new class assignment is correct, the expert marks it as “Accept”. In other situation, if the text primarily is assigned to two classes, and the approach changes one of the class correctly, this tag also is used. |
Decline |
“Decline” is marked if the approach assigns the class obviously incorrectly. For example, the text primarily belongs to the classes “Finance” and “Politics”. The proposed approach assigned a new class “Development” instead of the class “Finance”, but this class is incorrect, so the expert has to mark the tag “Decline”. |
Possible |
Suppose the text is primarily assigned to two classes – “Industry” and “Finance”. The proposed approach makes a new assignment, and as a result, the “Finance” class has been changed to the class “Innovation”. If the analysed text data can have more than two classes and the “Innovation” class is correct, the tag “Possible” has to be used. In such a way, the artificial balancing of assigned classes can be done when the class depends not on the human point of view but is based only on the words in the text. |
First of all, the primary research has been performed to find out how the size of the SOM influences the number of new assignments. Thus, for simplicity, when using LSA, the dimensionality is reduced to
$D=10$, and the results are presented in Fig.
7. As we can see, in the beginning, when the SOM size is
$40\times 40$, the number of the “Accept” assignment is equal to 35. When reducing the SOM size, the “Accept” number is also decreasing. The “Decline” and “Possible” curves are similar enough. When the SOM size is
$30\times 30$, and
$25\times 25$, the number of “Decline” assignment is slightly larger.
Fig. 7
Class assignments reviewed by experts, $D=10$, $L=90\% $.
In this experimental investigation, we will assume that if the new class assignments reviewed by experts are tagged as “Accept” and “Possible”, it will be considered as a correct new class assignment, and the data is not corrupted. In such a way, the correct assignment ratio can be calculated using the simple formula (
5). The ratio can be expressed as percentages:
Fig. 8
Class assignments reviewed by experts, $L=90\% $.
The same calculations have been performed using the approach with different reduced dimensionalities, and the overall correct assignment value is presented in Fig.
8. As we can see, using
$L=90\% $, the highest number of assignments (296 assignments) is obtained when
$D=40$. The highest correct assignment ratio is
$89.61\% $, obtained when
$D=50$, but the results are not significantly different from the case when
$D=40$. The worst result considering the number of assignments and correct assignment values is when the dimensionality equals 10. According to the obtained results, we select
$D=40$ in the following research, because of the highest number of assignments and almost the highest correct assignment ratio. To find out which limit
L should be chosen, an experiment has been performed. The limit
L has been changed from
$95\% $ to
$70\% $ by step
$5\% $. The total number of new class assignments has been calculated. Also, the counter has been used to determine how many times the same text has changed the class (one time, two times, three times, and four times) in overall steps of the SOM size reduction. The results are presented in Fig.
9.
Fig. 9
Dependence of the numbers of new class assignments on SOM size, $D=40$.
As we can see, when the limit is equal to
$90\% $,
$85\% $, and
$80\% $, the number of assignments tends to decrease, so the selected limit is suitable. If the highest limit of
$95\% $ is selected and the SOM size is
$5\times 5$, the data is clustered too much, so the number of assignments increases. Obviously, the limits
$L=75\% $ and
$L=70\% $ are not suitable for the analysed data because when the SOM size is
$5\times 5$, the number of new assignments highly increases. Also, overall steps of the SOM size reduction, some of the text data class has been changed even four times. The high number of new assignments indicates that data items are assigned to a new class in each step of the SOM size reduction. The new assignment becomes pointless, because usually, the text data class is continuously changed. A deeper analysis, when the limits are from
$95\% $ to
$80\% $, has been performed, and the results are presented in Fig.
10. The correct assignment ratio in each step has shown that when the limit
$L=95\% $, only 5 of 70 assignments were “Decline”, the other assignments were “Accept” and “Possible”. If the limit
$L=90\% $, there 152 “Accept”, 31 “Decline”, and 113 texts are tagged as “Possible”. Almost all correct assignment ratios are higher than
$85\% $, and lower just when the SOM size is equal to
$35\times 35$, and
$30\times 30$. As we can see, when the limits are equal to
$L=85\% $ and
$L=80\% $, the correct assignments ratio is near
$80\% $, only when the size of SOM is equal to
$5\times 5$, the ratio decreases to
$39.13\% $ and
$20.22\% $, respectively.
Fig. 10
Class assignments reviewed by experts, $D=40$, where L is from $95\% $ to $80\% $.
A deeper analysis of the SOM size
$5\times 5$ has been performed to analyse why the correct assignment ratio decreases significantly. As we can see (Fig.
11), the highest number of “Decline” is when trying to assign the “Law enforcement” class. The analysis of texts where a class is assigned incorrectly showed that one of the common reasons influencing the results is some specific words in the Lithuanian texts. For example, the word “research” can often be found in law enforcement and innovation context texts. The word “research” can indicate some criminal situations, law enforcement investigations, but also it could refer to the scientific research context in the class of “Innovation”. One of the problem solutions is to include such words in the stop word list, but it is possible that this word can be useful in some situations.
Fig. 11
The distribution of new class assigments, when SOM size is $5\times 5$, $D=40$.
The correct assignment ratio per all steps of the SOM size reduction is obtained and presented in Fig.
12. As we can see, with each reduction of the SOM size, the correct assignments are gradually decreasing, but the number of assignments is increasing. When the limit percent
$L=80\% $ is selected, the correct assignments ratio is equal to
$76.46\% $, and the number of assignments is 1546, which means that approximately
$13\% $ of the data items class has been changed. In this case, the 364 of 1546 multi-label text data class has been assigned incorrectly. The rest of the assigned classes have been tagged as “Accept” (619), and “Possible” equal to 563. The highest correct assignments ratio is obtained when the limit is equal to
$95\% $, but just 70 times text data class have been changed. The ratio value is high, so these class assignments are indisputably correct.
Fig. 12
The correct assignment ratio over all steps of the proposed approach, $D=40$.
The experimental investigation has shown that the optimal limit percent is equal to $85\% $ because the correct assignments ratio is more than $82\% $, where just 155 of 865 assignments were incorrect. The best way to improve the results of the proposed approach could be manually prepared keywords of an analysed data for each class. By selecting not overlapping words between different classes, the new class assignment ratio could be higher.
4.3 Discussion
The comprehensively experimental investigation has been performed using one Lithuanian multi-label text data, and the usability of the proposed approach has been experimentally proved. The analysed data has been chosen because of the following reasons: the data size (the usage of the proposed approach with higher amount of text data); the text data is not in the English language (the English language usually is suitable for various methods and is structurally simpler); the data must be multi-label (one data item belongs to more than one class). The same level experimental investigation using other multi-label text data has not been performed, because data with similar properties have not been found. Usually, all of the data in the various freely accessed databases are of small size or artificially made. Primary research has shown that the chosen language does not significantly influence the obtained results, the concept of the model remains the same. Therefore, the proposed approach can be used to adjust multi-label text data classes in any language. There is also a limitation on how many classes need to have one item of the analysed data.
When analysing other data with other properties, for example, text data that have more classes, different language and different lengths of text, the selected parameters used in the proposed approach should be tuned according to the specificity of text data. For example, an English stemming algorithm should be used to analyse an English text, and if the text is much longer, the frequency of words included may also be higher than three, etc. Different pre-processed data may affect the parameters of the LSA and SOM algorithms (reduced data dimensionality may be higher, while SOM size reduction starts with smaller/larger SOM size). Each newly proposed approach has its limitations and threats, but the results of the experimental study are promising and will be used in future work on classification tasks.