Pub. online:1 Jan 2019Type:Research ArticleOpen Access
Journal:Informatica
Volume 30, Issue 3 (2019), pp. 573–593
Abstract
Conventional large vocabulary automatic speech recognition (ASR) systems require a mapping from words into sub-word units to generalize over the words that were absent in the training data and to enable the robust estimation of acoustic model parameters. This paper surveys the research done during the last 15 years on the topic of word to sub-word mappings for Lithuanian ASR systems. It also compares various phoneme and grapheme based mappings across a broad range of acoustic modelling techniques including monophone and triphone based Hidden Markov models (HMM), speaker adaptively trained HMMs, subspace gaussian mixture models (SGMM), feed-forward time delay neural network (TDNN), and state-of-the-art low frame rate bidirectional long short term memory (LFR BLSTM) recurrent deep neural network. Experimental comparisons are based on a 50-hour speech corpus. This paper shows that the best phone-based mapping significantly outperforms a grapheme-based mapping. It also shows that the lowest phone error rate of an ASR system is achieved by the phoneme-based lexicon that explicitly models syllable stress and represents diphthongs as single phonetic units.
Journal:Informatica
Volume 27, Issue 3 (2016), pp. 673–688
Abstract
This paper presents the corpus-driven approach in building the computational model of fundamental frequency, or , for Lithuanian language. The model was obtained by training the HMM-based speech synthesis system HTS on six hours of speech coming from multiple speakers. Several gender specific models, using different parameters and different contextual factors, were investigated. The models were evaluated by synthesizing contours and by comparing them to the original contours using criteria of root mean square error (RMSE) and voicing classification error. The HMM-based models showed an improvement of the RMSE over the mean-based model that predicted of the vowel on the basis of its average normalized pitch.
Journal:Informatica
Volume 19, Issue 2 (2008), pp. 271–284
Abstract
Classification and regression tree approach was used in this research to model phone duration of Lithuanian. 300 thousand samples of vowels and 400 thousand samples of consonants extracted from VDU-AB20 corpus were used in experimental part of research. Set of 15 parameters characterizing phone and its context were selected for duration prediction. The most significant of them were: identifier (ID) of phone being predicted, adjacent phones IDs and number of phones in syllable. Models were built using two different data sets: one speaker and 20 speakers. The influence of cost complexity pruning and different values of pre pruning were investigated. Prediction by average leaf duration vs. prediction by median leaf duration was also compared. Investigation of most vivid errors was performed, speech rate normalization and trivial noise reduction were applied and influence on models evaluation parameters discussed. The achieved results, correlation 0.8 and 0.75 respectively for vowels and consonants, and RMSE of ~18 ms are comparable with those reported for Check, Hindi and Telugu, Korean.
Journal:Informatica
Volume 17, Issue 1 (2006), pp. 111–124
Abstract
This paper investigates a variety of statistical cache-based language models built upon three corpora: English, Lithuanian, and Lithuanian base forms. The impact of the cache size, type of the decay function, including custom corpus derived functions, and interpolation technique (static vs. dynamic) on the perplexity of a language model is studied. The best results are achieved by models consisting of 3 components: standard 3-gram, decaying cache 1-gram and decaying cache 2-gram that are joined together by means of linear interpolation using the technique of dynamic weight update. Such a model led up to 36% and 43% perplexity improvement with respect to the 3-gram baseline for Lithuanian words and Lithuanian word base forms respectively. The best language model of English led up to a 16% perplexity improvement. This suggests that cache-based modeling is of greater utility for the free word order highly inflected languages.
Journal:Informatica
Volume 15, Issue 4 (2004), pp. 565–580
Abstract
This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n‐gram models of highly inflected Lithuanian language by interpolating them with complex n‐gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part‐of‐speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3‐gram and 4‐gram class‐based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class‐based models linearly interpolated with the 3‐gram model led up to a 13% reduction in the perplexity compared with the baseline 3‐gram model. Morphological models decreased out‐of‐vocabulary word rate from 1.5% to 1.02%.
Journal:Informatica
Volume 15, Issue 2 (2004), pp. 231–242
Abstract
This paper describes a preliminary experiment in designing a Hidden Markov Model (HMM)‐based part‐of‐speech tagger for the Lithuanian language. Part‐of‐speech tagging is the problem of assigning to each word of a text the proper tag in its context of appearance. It is accomplished in two basic steps: morphological analysis and disambiguation. In this paper, we focus on the problem of disambiguation, i.e., on the problem of choosing the correct tag for each word in the context of a set of possible tags. We constructed a stochastic disambiguation algorithm, based on supervised learning techniques, to learn hidden Markov model's parameters from hand‐annotated corpora. The Viterbi algorithm is used to assign the most probable tag to each word in the text.
Journal:Informatica
Volume 14, Issue 1 (2003), pp. 75–84
Abstract
In this paper, the opening work on the development of a Lithuanian HMM speech recognition system is described. The triphone single‐Gaussian HMM speech recognition system based on Mel Frequency Cepstral Coefficients (MFCC) was developed using HTK toolkit. Hidden Markov model's parameters were estimated from phone‐level hand‐annotated Lithuanian speech corpus. The system was evaluated on a speaker‐independent ∼750 distinct isolated‐word recognition task. Though the speaker adaptation and language modeling techniques were not used, the system was performing at 20% word error rate.
Journal:Informatica
Volume 9, Issue 3 (1998), pp. 343–364
Abstract
This paper describes a preliminary algorithm performing the mapping of sound to music score. Our procedure is constructed over signal-extracted energy and fundamental frequency traces alone. The algorithm is tested on real songs of average complexity. Although results seem to be promising, their detailed examination reveals some shortages of our approach as well as the set of application specific problems. It appears that musical analysis can not be entirely dissociated from phonetic processing. Further work should be oriented towards integration of knowledge of music as well.