Journal:Informatica
Volume 22, Issue 2 (2011), pp. 203–224
Abstract
In this paper, we describe a model for aligning books and documents from bilingual corpus with a goal to create “perfectly” aligned bilingual corpus on word-to-word level. Presented algorithms differ from existing algorithms in consideration of the presence of human translator which usage we are trying to minimize. We treat human translator as an oracle who knows exact alignments and the goal of the system is to optimize (minimize) the use of this oracle. The effectiveness of the oracle is measured by the speed at which he can create “perfectly” aligned bilingual corpus. By “Perfectly” aligned corpus we mean zero entropy corpus because oracle can make alignments without any probabilistic interpretation, i.e., with 100% confidence. Sentence level alignments and word-to-word alignments, although treated separately in this paper, are integrated in a single framework. For sentence level alignments we provide a dynamic programming algorithm which achieves low precision and recall error rate. For word-to-word level alignments Expectation Maximization algorithm that integrates linguistic dictionaries is suggested as the main tool for the oracle to build “perfectly” aligned bilingual corpus. We show empirically that suggested pre-aligned corpus requires little interaction from the oracle and that creation of perfectly aligned corpus can be achieved almost with the speed of human reading. Presented algorithms are language independent but in this paper we verify them with English–Lithuanian language pair on two types of text: law documents and fiction literature.
Journal:Informatica
Volume 22, Issue 2 (2011), pp. 189–201
Abstract
Batch cryptography has been developed into two main branches – batch verification and batch identification. Batch verification is a method to determine whether a set of signatures contains invalid signatures, and batch identification is a method to find bad signatures if a set of signatures contains invalid signatures. Recently, some significant developments appeared in such field, especially by Lee et al., Ferrara et al. and Law et al., respectively. In this paper, we address some weakness of Lee et al.'s earlier work, and propose an identification method in an RSA-type signature. Our method is more efficient than the well known divide and conquer method for the signature scheme. We conclude this paper by providing a method to choose optimal divide and conquer verifiers.
Journal:Informatica
Volume 22, Issue 2 (2011), pp. 177–188
Abstract
The paper presents a novel method for improving the estimates of closely-spaced frequencies of a short length signal in additive Gaussian noise based on the Burg algorithm with extrapolation. The proposed method is implemented in two consecutive steps. In the first step, the Burg algorithm is used to estimate the parameters of the predictive filter, while in the second step the extrapolation technique of the signal is used to improve the frequency estimates. The experimental results demonstrate that the frequency estimates of the short length signal, using the Burg algorithm with extrapolation, are more accurate than the frequency estimates using the Burg algorithm without extrapolation.
Journal:Informatica
Volume 22, Issue 2 (2011), pp. 165–176
Abstract
The instrumental variable (IV) method is one of the most renowned methods for parameter estimation. Its bigger advantage is that it is applicable for open-loop as well as for closed-loop systems. The main difficulty in closed-loop identification is due to the correlation between the disturbances and the control signal induced by the loop. In order to overcome this problem, additional excitation signal is introduced. Non-recursive modifications of the instrumental variable method for closed-loop system identification on the base of a generalized IV method have been developed (Atanasov and Ichtev, 2009; Gilson and Van den Hof, 2001; Gilson and Van den Hof, 2003). In this paper, recursive algorithms for theses modifications are proposed and investigated. A simulation is carried out in order to illustrate the obtained results.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 149–164
Abstract
The contribution is focused on change point detection in a one-dimensional stochastic process by sparse parameter estimation from an overparametrized model. A stochastic process with change in the mean is estimated using dictionary consisting of Heaviside functions. The basis pursuit algorithm is used to get sparse parameter estimates. The mentioned method of change point detection in a stochastic process is compared with several standard statistical methods by simulations.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 135–148
Abstract
Detecting communities in real world networks is an important problem for data analysis in science and engineering. By clustering nodes intelligently, a recursive algorithm is designed to detect community. Since the relabeling of nodes does not alter the topology of the network, the problem of community detection corresponds to the finding of a good labeling of nodes so that the adjacency matrix form blocks. By putting a fictitious interaction between nodes, the relabeling problem becomes one of energy minimization, where the total energy of the network is defined by putting interaction between the labels of nodes so that clustering nodes that are in the same community will decrease the total energy. A greedy method is used for the computation of minimum energy. The method shows efficient detection of community in artificial as well as real world network. The result is illustrated in a tree showing hierarchical structure of communities on the basis of sub-matrix density. Applications of the method to weighted and directed networks are discussed.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 115–134
Abstract
In this paper, the quality of quantization and visualization of vectors, obtained by vector quantization methods (self-organizing map and neural gas), is investigated. A multidimensional scaling is used for visualization of multidimensional vectors. The quality of quantization is measured by a quantization error. Two numerical measures for proximity preservation (Konig's topology preservation measure and Spearman's correlation coefficient) are applied to estimate the quality of visualization. Results of visualization (mapping images) are also presented.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 97–114
Abstract
This paper presents a study of the Hurst index estimation in the case of fractional Ornstein–Uhlenbeck and geometric Brownian motion models. The performance of the estimators is studied both with respect to the value of the Hurst index and the length of sample paths.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 73–96
Abstract
To set the values of the hyperparameters of a support vector machine (SVM), the method of choice is cross-validation. Several upper bounds on the leave-one-out error of the pattern recognition SVM have been derived. One of the most popular is the radius–margin bound. It applies to the hard margin machine, and, by extension, to the 2-norm SVM. In this article, we introduce the first quadratic loss multi-class SVM: the M-SVM2. It can be seen as a direct extension of the 2-norm SVM to the multi-class case, which we establish by deriving the corresponding generalized radius–margin bound.
Journal:Informatica
Volume 22, Issue 1 (2011), pp. 57–72
Abstract
Clinical investigators, health professionals and managers are often interested in developing criteria for clustering patients into clinically meaningful groups according to their expected length of stay. In this paper, we propose two novel types of survival trees; phase-type survival trees and mixed distribution survival trees, which extend previous work on exponential survival trees. The trees are used to cluster the patients with respect to length of stay where partitioning is based on covariates such as gender, age at the time of admission and primary diagnosis code. Likelihood ratio tests are used to determine optimal partitions. The approach is illustrated using nationwide data available from the English Hospital Episode Statistics (HES) database on stroke-related patients, aged 65 years and over, who were discharged from English hospitals over a 1-year period.