Pub. online:5 Aug 2022Type:Research ArticleOpen Access
Journal:Informatica
Volume 16, Issue 1 (2005), pp. 61–74
Abstract
This paper discusses a soft sample clustering problem for multivariate independent random data satisfying the mixture model of the Gaussian distribution. The theory recommends to estimate the parameters of model by the maximum likelihood method and to use “plug-in” approach for data clustering. Unfortunately, the calculation problem of the maximum likelihood estimate is not completely solved in multivariate case. This work proposes a new constructive a few stage procedure to solve this task. This procedure includes statistical distribution analysis of a large number of the univariate projections of observations, geometric clustering of a multivariate sample and application of EM algorithm. The results of the accuracy analysis of the proposed methods is made by means of Monte-Carlo simulation.
Pub. online:1 Jan 2018Type:Research ArticleOpen Access
Journal:Informatica
Volume 29, Issue 4 (2018), pp. 675–692
Abstract
The main purpose of this article was to compare traditional binary logistic regression analysis with decision tree analysis for the evaluation of the risk of cardiovascular diseases in adult men living in the city. Patients and methods. In our study, we used data from the Multifactorial Ischemic Heart Disease Prevention Study (MIHDPS). In the MIHDPS study, a random sample of male inhabitants of Kaunas city (Lithuania) aged 40–59 years was examined between 1977 and 1980. We analysed a sample of 5626 men. Taking blood pressure lowering medicine, disability, intermittent claudication, regular smoking, a higher value of the body mass index, systolic blood pressure, age, total serum cholesterol, and walking in winter were associated with a higher probability of ischemic heart disease or cardiovascular diseases. Having more siblings and drinking alcohol were associated with a lower probability of these diseases. The binary logistic regression method showed a very slightly lower level of errors than the decision tree did (the difference between the two methods was 2.04% for ischemic heart disease (IHD) and 2.86% for cardiovascular disease (CVD), but for consumers, the decision tree is easier to understand and interpret the results. Both of these methods are appropriate to analyse cardiovascular disease data.
Journal:Informatica
Volume 9, Issue 4 (1998), pp. 479–490
Abstract
This article gives ideas for developing statistics software which can work without user intervention. Some popular methods of bandwidth selection for kernel density estimation (the nearest neighbour, least squares cross-validation, “plug-in” technique) are discussed. Modifications of the cross-validation criterion are proposed. Two-stage estimators combining these methods with multiplicative bias correction are investigated by simulation means.