Pub. online:5 Aug 2022Type:Research ArticleOpen Access
Journal:Informatica
Volume 16, Issue 1 (2005), pp. 61–74
Abstract
This paper discusses a soft sample clustering problem for multivariate independent random data satisfying the mixture model of the Gaussian distribution. The theory recommends to estimate the parameters of model by the maximum likelihood method and to use “plug-in” approach for data clustering. Unfortunately, the calculation problem of the maximum likelihood estimate is not completely solved in multivariate case. This work proposes a new constructive a few stage procedure to solve this task. This procedure includes statistical distribution analysis of a large number of the univariate projections of observations, geometric clustering of a multivariate sample and application of EM algorithm. The results of the accuracy analysis of the proposed methods is made by means of Monte-Carlo simulation.
Pub. online:10 Jan 2022Type:Research ArticleOpen Access
Journal:Informatica
Volume 33, Issue 1 (2022), pp. 109–130
Abstract
In this paper, a new approach has been proposed for multi-label text data class verification and adjustment. The approach helps to make semi-automated revisions of class assignments to improve the quality of the data. The data quality significantly influences the accuracy of the created models, for example, in classification tasks. It can also be useful for other data analysis tasks. The proposed approach is based on the combination of the usage of the text similarity measure and two methods: latent semantic analysis and self-organizing map. First, the text data must be pre-processed by selecting various filters to clean the data from unnecessary and irrelevant information. Latent semantic analysis has been selected to reduce the vectors dimensionality of the obtained vectors that correspond to each text from the analysed data. The cosine similarity distance has been used to determine which of the multi-label text data class should be changed or adjusted. The self-organizing map has been selected as the key method to detect similarity between text data and make decisions for a new class assignment. The experimental investigation has been performed using the newly collected multi-label text data. Financial news data in the Lithuanian language have been collected from four public websites and classified by experts into ten classes manually. Various parameters of the methods have been analysed, and the influence on the final results has been estimated. The final results are validated by experts. The research proved that the proposed approach could be helpful to verify and adjust multi-label text data classes. 82% of the correct assignments are obtained when the data dimensionality is reduced to 40 using the latent semantic analysis, and the self-organizing map size is reduced from 40 to 5 by step 5.
Pub. online:1 Jan 2018Type:Research ArticleOpen Access
Journal:Informatica
Volume 29, Issue 4 (2018), pp. 633–650
Abstract
In recent years, Wireless Sensor Networks (WSNs) received great attention because of their important applications in many areas. Consequently, a need for improving their performance and efficiency, especially in energy awareness, is of a great interest. Therefore, in this paper, we proposed a lifetime improvement fixed clustering energy awareness routing protocol for WSNs named Load Balancing Cluster Head (LBCH) protocol. LBCH mainly aims at reducing the energy consumption in the network and balancing the workload over all nodes within the network. A novel method for selecting initial cluster heads (CHs) is proposed. In addition, the network nodes are evenly distributed into clusters to build balanced size clusters. Finally, a novel scheme is proposed to circulate the role of CHs depending on the energy and location information of each node in each cluster. Multihop technique is used to minimize the communication distance between CHs and the base station (BS) thus saving nodes energy. In order to evaluate the performance of LBCH, a thorough simulation has been conducted and the results are compared with other related protocols (i.e. ACBEC-WSNs-CD, Adaptive LEACH-F, LEACH-F, and RRCH). The simulations showed that LBCH overcomes other related protocols for both continuous data and event-based data models at different network densities. LBCH achieved an average improvement in the range of 2–172%, 18–145.5%, 10.18–62%, 63–82.5% over the compared protocols in terms of number of alive nodes, first node died (FND), network throughput, and load balancing, respectively.
Pub. online:1 Jan 2017Type:Research ArticleOpen Access
Journal:Informatica
Volume 28, Issue 1 (2017), pp. 105–130
Abstract
Analysing massive amounts of data and extracting value from it has become key across different disciplines. As the amounts of data grow rapidly, current approaches for data analysis are no longer efficient. This is particularly true for clustering algorithms where distance calculations between pairs of points dominate overall time: the more data points are in the dataset, the bigger the share of time needed for distance calculations.
Crucial to the data analysis and clustering process, however, is that it is rarely straightforward: instead, parameters need to be determined and tuned first. Entirely accurate results are thus rarely needed and instead we can sacrifice little precision of the final result to accelerate the computation. In this paper we develop ADvaNCE, a new approach based on approximating DBSCAN. More specifically, we propose two measures to reduce distance calculation overhead and to consequently approximate DBSCAN: (1) locality sensitive hashing to approximate and speed up distance calculations and (2) representative point selection to reduce the number of distance calculations.
The experiments show that the resulting clustering algorithm is more scalable than the state-of-the-art as the datasets become bigger. Compared with the most recent approximation technique for DBSCAN, our approach is in general one order of magnitude faster (at most 30× in our experiments) as the size of the datasets increase.
Journal:Informatica
Volume 25, Issue 4 (2014), pp. 563–580
Abstract
Abstract
Clustering is one of the better known unsupervised learning methods with the aim of discovering structures in the data. This paper presents a distance-based Sweep-Hyperplane Clustering Algorithm (SHCA), which uses sweep-hyperplanes to quickly locate each point’s approximate nearest neighbourhood. Furthermore, a new distance-based dynamic model that is based on -tree hierarchical space partitioning, extends SHCA’s capability for finding clusters that are not well-separated, with arbitrary shape and density. Experimental results on different synthetic and real multidimensional datasets that are large and noisy demonstrate the effectiveness of the proposed algorithm.
Journal:Informatica
Volume 25, Issue 2 (2014), pp. 265–282
Abstract
The aim of this study is to predict the energy generated by a solar thermal system. To achieve this, a hybrid intelligent system was developed based on local regression models with low complexity and high accuracy. Input data is divided into clusters by using a Self Organization Maps; a local model will then be created for each cluster. Different regression techniques were tested and the best one was chosen. The novel hybrid regression system based on local models is empirically verified with a real dataset obtained by the solar thermal system of a bioclimatic house.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 1–10
Abstract
Estimation and modelling problems as they arise in many data analysis areas often turn out to be unstable and/or intractable by standard numerical methods. Such problems frequently occur in fitting of large data sets to a certain model and in predictive learning. Heuristics are general recommendations based on practical statistical evidence, in contrast to a fixed set of rules that cannot vary, although guarantee to give the correct answer. Although the use of these methods became more standard in several fields of sciences, their use for estimation and modelling in statistics appears to be still limited. This paper surveys a set of problem-solving strategies, guided by heuristic information, that are expected to be used more frequently. The use of recent advances in different fields of large-scale data analysis is promoted focusing on applications in medicine, biology and technology.
Journal:Informatica
Volume 21, Issue 3 (2010), pp. 455–470
Abstract
In this article, a method is proposed for analysing the thermovision-based video data that characterize the dynamics of temperature anisotropy of the heart tissue in a spatial domain. Many cardiac rhythm disturbances at present time are treated by applying destructive energy sources. One of the most common source and the related methodology is to use radio-frequency ablation procedure. However, the rate of the risk of complications including arrhythmia recurrence remains enough high. The drawback of the methodology used is that the suchlike destruction procedure cannot be monitored by visual spectra and results in the inability to control the ablation efficiency. To the end of understanding the nature of possible complications and controlling the treating process, the means of thermovision could be used. The aim of the study was to analyse possible mechanisms of these complications, measure and determine optimal radio-frequency ablation parameters, according to the analysis of video data, acquired using thermovision.
Journal:Informatica
Volume 20, Issue 2 (2009), pp. 187–202
Abstract
In this paper, a method for the study of cluster stability is purposed. We draw pairs of samples from the data, according to two sampling distributions. The first distribution corresponds to the high density zones of data-elements distribution. Thus it is associated with the clusters cores. The second one, associated with the cluster margins, is related to the low density zones. The samples are clustered and the two obtained partitions are compared. The partitions are considered to be consistent if the obtained clusters are similar. The resemblance is measured by the total number of edges, in the clusters minimal spanning trees, connecting points from different samples. We use the Friedman and Rafsky two sample test statistic. Under the homogeneity hypothesis, this statistic is normally distributed. Thus, it can be expected that the true number of clusters corresponds to the statistic empirical distribution which is closest to normal. Numerical experiments demonstrate the ability of the approach to detect the true number of clusters.
Journal:Informatica
Volume 19, Issue 3 (2008), pp. 377–390
Abstract
We investigate applicability of quantitative methods to discover the most fundamental structural properties of the most reliable political data in Lithuania. Namely, we analyze voting data of the Lithuanian Parliament. Two most widely used techniques of structural data analysis (clustering and multidimensional scaling) are compared. We draw some technical conclusions which can serve as recommendations in more purposeful application of these methods.