Journal:Informatica
Volume 25, Issue 1 (2014), pp. 95–111
Abstract
Nowadays data mining algorithms are successfully applying to analyze the real data in our life to provide useful suggestion. Since some available real data is multi-valued and multi-labeled, researchers have focused their attention on developing approaches to mine multi-valued and multi-labeled data in recent years. Unfortunately, there are no algorithms can discretize multi-valued and multi-labeled data to improve the performance of data mining. In this paper, we proposed a novel approach to solve this problem. Our approach is based on a statistical-based discretization metric and the simulated annealing search algorithm. Experimental results show that our approach can effectively improve the performance of the-state-of-art multi-valued and multi-labeled classification algorithm.
Journal:Informatica
Volume 23, Issue 4 (2012), pp. 521–536
Abstract
In a supervised learning, the relationship between the available data and the performance (what is learnt) is not well understood. How much data to use, or when to stop the learning process, are the key questions.
In the paper, we present an approach for an early assessment of the extracted knowledge (classification models) in the terms of performance (accuracy). The key questions are answered by detecting the point of convergence, i.e., where the classification model's performance does not improve any more even when adding more data items to the learning set. For the learning process termination criteria we developed a set of equations for detection of the convergence that follow the basic principles of the learning curve. The developed solution was evaluated on real datasets. The results of the experiment prove that the solution is well-designed: the learning process stopping criteria are not subjected to local variance and the convergence is detected where it actually has occurred.
Journal:Informatica
Volume 20, Issue 1 (2009), pp. 35–50
Abstract
We tested the ability of humans and machines (data mining techniques) to assign stress to Slovene words. This is a challenging comparison for machines since humans accomplish the task outstandingly even on unknown words without any context. The goal of finding good machine-made models for stress assignment was set by applying new methods and by making use of a known theory about rules for stress assignment in Slovene. The upgraded data mining methods outperformed expert-defined rules on practically all subtasks, thus showing that data mining can more than compete with humans when constructing formal knowledge about stress assignment is concerned. Unfortunately, compared to humans directly, the data mining methods still failed to achieve as good results as humans on assigning stress to unknown words.
Journal:Informatica
Volume 18, Issue 3 (2007), pp. 343–362
Abstract
One of the tasks of data mining is classification, which provides a mapping from attributes (observations) to pre-specified classes. Classification models are built by using underlying data. In principle, the models built with more data yield better results. However, the relationship between the available data and the performance is not well understood, except that the accuracy of a classification model has diminishing improvements as a function of data size. In this paper, we present an approach for an early assessment of the extracted knowledge (classification models) in the terms of performance (accuracy), based on the amount of data used. The assessment is based on the observation of the performance on smaller sample sizes. The solution is formally defined and used in an experiment. In experiments we show the correctness and utility of the approach.
Journal:Informatica
Volume 11, Issue 2 (2000), pp. 115–124
Abstract
Influence of projection pursuit on classification errors and estimates of a posteriori probabilities from the sample is considered. Observed random variable is supposed to satisfy a multidimensional Gaussian mixture model. Presented computer simulation results show that for comparatively small sample size classification using projection pursuit algorithm gives better accuracy of estimates of a posteriori probabilities and less classification error.