Pub. online:5 Aug 2022Type:Research ArticleOpen Access
Journal:Informatica
Volume 16, Issue 1 (2005), pp. 61–74
Abstract
This paper discusses a soft sample clustering problem for multivariate independent random data satisfying the mixture model of the Gaussian distribution. The theory recommends to estimate the parameters of model by the maximum likelihood method and to use “plug-in” approach for data clustering. Unfortunately, the calculation problem of the maximum likelihood estimate is not completely solved in multivariate case. This work proposes a new constructive a few stage procedure to solve this task. This procedure includes statistical distribution analysis of a large number of the univariate projections of observations, geometric clustering of a multivariate sample and application of EM algorithm. The results of the accuracy analysis of the proposed methods is made by means of Monte-Carlo simulation.
Journal:Informatica
Volume 24, Issue 3 (2013), pp. 447–460
Abstract
The paper is devoted to goodness of fit tests based on probability density estimates generated by kernel functions. The test statistic is considered in the form of maximum of the normalized deviation of the estimate from its expected value or a hypothesized distribution density function. A comparative Monte Carlo power study of the investigated criterion is provided. Simulation results show that the proposed test is a powerful competitor to the existing classical criteria testing goodness of fit against a specific type of alternative hypothesis. An analytical way for establishing the asymptotic distribution of the test statistic is proposed, using the theory of high excursions of close to Gaussian random processes and fields introduced by Rudzkis (1992, 2012).
Journal:Informatica
Volume 21, Issue 4 (2010), pp. 471–486
Abstract
The problem of automatic classification of scientific texts is considered. Methods based on statistical analysis of probabilistic distributions of scientific terms in texts are discussed. The procedures for selecting the most informative terms and the method of making use of auxiliary information related to the terms positions are presented. The results of experimental evaluation of proposed algorithms and procedures over real-world data are reported.
Journal:Informatica
Volume 9, Issue 4 (1998), pp. 479–490
Abstract
This article gives ideas for developing statistics software which can work without user intervention. Some popular methods of bandwidth selection for kernel density estimation (the nearest neighbour, least squares cross-validation, “plug-in” technique) are discussed. Modifications of the cross-validation criterion are proposed. Two-stage estimators combining these methods with multiplicative bias correction are investigated by simulation means.