Journal:Informatica
Volume 22, Issue 1 (2011), pp. 1–10
Abstract
Estimation and modelling problems as they arise in many data analysis areas often turn out to be unstable and/or intractable by standard numerical methods. Such problems frequently occur in fitting of large data sets to a certain model and in predictive learning. Heuristics are general recommendations based on practical statistical evidence, in contrast to a fixed set of rules that cannot vary, although guarantee to give the correct answer. Although the use of these methods became more standard in several fields of sciences, their use for estimation and modelling in statistics appears to be still limited. This paper surveys a set of problem-solving strategies, guided by heuristic information, that are expected to be used more frequently. The use of recent advances in different fields of large-scale data analysis is promoted focusing on applications in medicine, biology and technology.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 11–26
Abstract
For the purpose of exploring and modelling the relationships between a dataset and several datasets, multiblock Partial Least Squares is a widely-used regression technique. It is designed as an extension of PLS which aims at linking two datasets. In the same vein, we propose an extension of Redundancy Analysis to the multiblock setting. We show that PLS and multiblock Redundancy Analysis aim at maximizing the same criterion but the constraints are different. From the solutions of both these approaches, it turns out that they are the two end points of a continuum approach that we propose to investigate.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 27–42
Abstract
This paper offers an analysis of HIV/AIDS dynamics, defined by CD4 levels and Viral load, carried out from a macroscopic point of view by means of a general stochastic model. The model focuses on the patient's age as a relevant factor to forecast the transitions among the different levels of seriousness of the disease and simultaneously on the chronological time. The third model considers the two previous features simultaneously. In this way it is possible to quantify the medical scientific progresses due to the advances in the treatment of the HIV. The analyses have been performed through non-homogeneous semi-Markov processes. These models have been implemented by using real data provided by ISS (Istituto Superiore di Sanità, Rome, Italy). They refer to 2159 subjects enrolled in Italian public structures from September 1983 to January 2006. The relevant results include also the survival analysis of the infected patients. The computed conditional probabilities show the different responses of the subjects depending on their ages and the elapsing of time.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 43–56
Abstract
It is well known that in situations involving the study of large datasets where influential observations or outliers maybe present, regression models based on the Maximum Likelihood criterion are likely to be unstable. In this paper we investigate the use of the Minimum Density Power Divergence criterion as a practical tool for parametric regression models building. More precisely, we suggest a procedure relying on an index of similarity between estimated regression models and on a Monte Carlo Significance test of hypothesis that allows to check the existence of outliers in the data and therefore to choose the best tuning constant for the Minimum Density Power Divergence estimators. Theory is outlined, numerical examples featuring several experimental scenarios are provided and main results of a simulation study aiming to verify the goodness of the procedure are supplied.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 57–72
Abstract
Clinical investigators, health professionals and managers are often interested in developing criteria for clustering patients into clinically meaningful groups according to their expected length of stay. In this paper, we propose two novel types of survival trees; phase-type survival trees and mixed distribution survival trees, which extend previous work on exponential survival trees. The trees are used to cluster the patients with respect to length of stay where partitioning is based on covariates such as gender, age at the time of admission and primary diagnosis code. Likelihood ratio tests are used to determine optimal partitions. The approach is illustrated using nationwide data available from the English Hospital Episode Statistics (HES) database on stroke-related patients, aged 65 years and over, who were discharged from English hospitals over a 1-year period.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 73–96
Abstract
To set the values of the hyperparameters of a support vector machine (SVM), the method of choice is cross-validation. Several upper bounds on the leave-one-out error of the pattern recognition SVM have been derived. One of the most popular is the radius–margin bound. It applies to the hard margin machine, and, by extension, to the 2-norm SVM. In this article, we introduce the first quadratic loss multi-class SVM: the M-SVM2. It can be seen as a direct extension of the 2-norm SVM to the multi-class case, which we establish by deriving the corresponding generalized radius–margin bound.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 97–114
Abstract
This paper presents a study of the Hurst index estimation in the case of fractional Ornstein–Uhlenbeck and geometric Brownian motion models. The performance of the estimators is studied both with respect to the value of the Hurst index and the length of sample paths.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 115–134
Abstract
In this paper, the quality of quantization and visualization of vectors, obtained by vector quantization methods (self-organizing map and neural gas), is investigated. A multidimensional scaling is used for visualization of multidimensional vectors. The quality of quantization is measured by a quantization error. Two numerical measures for proximity preservation (Konig's topology preservation measure and Spearman's correlation coefficient) are applied to estimate the quality of visualization. Results of visualization (mapping images) are also presented.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 135–148
Abstract
Detecting communities in real world networks is an important problem for data analysis in science and engineering. By clustering nodes intelligently, a recursive algorithm is designed to detect community. Since the relabeling of nodes does not alter the topology of the network, the problem of community detection corresponds to the finding of a good labeling of nodes so that the adjacency matrix form blocks. By putting a fictitious interaction between nodes, the relabeling problem becomes one of energy minimization, where the total energy of the network is defined by putting interaction between the labels of nodes so that clustering nodes that are in the same community will decrease the total energy. A greedy method is used for the computation of minimum energy. The method shows efficient detection of community in artificial as well as real world network. The result is illustrated in a tree showing hierarchical structure of communities on the basis of sub-matrix density. Applications of the method to weighted and directed networks are discussed.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 149–164
Abstract
The contribution is focused on change point detection in a one-dimensional stochastic process by sparse parameter estimation from an overparametrized model. A stochastic process with change in the mean is estimated using dictionary consisting of Heaviside functions. The basis pursuit algorithm is used to get sparse parameter estimates. The mentioned method of change point detection in a stochastic process is compared with several standard statistical methods by simulations.