Journal:Informatica
Volume 17, Issue 1 (2006), pp. 111–124
Abstract
This paper investigates a variety of statistical cache-based language models built upon three corpora: English, Lithuanian, and Lithuanian base forms. The impact of the cache size, type of the decay function, including custom corpus derived functions, and interpolation technique (static vs. dynamic) on the perplexity of a language model is studied. The best results are achieved by models consisting of 3 components: standard 3-gram, decaying cache 1-gram and decaying cache 2-gram that are joined together by means of linear interpolation using the technique of dynamic weight update. Such a model led up to 36% and 43% perplexity improvement with respect to the 3-gram baseline for Lithuanian words and Lithuanian word base forms respectively. The best language model of English led up to a 16% perplexity improvement. This suggests that cache-based modeling is of greater utility for the free word order highly inflected languages.
Journal:Informatica
Volume 15, Issue 4 (2004), pp. 565–580
Abstract
This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n‐gram models of highly inflected Lithuanian language by interpolating them with complex n‐gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part‐of‐speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3‐gram and 4‐gram class‐based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class‐based models linearly interpolated with the 3‐gram model led up to a 13% reduction in the perplexity compared with the baseline 3‐gram model. Morphological models decreased out‐of‐vocabulary word rate from 1.5% to 1.02%.