INFORMATICA

Informatica

0868-4952 0868-4952

inf15409

10.15388/Informatica.2004.079

Research article

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Vaičiūnas

Airenas

airenas@freemail.lt Kaminskas

Vytautas

V.Kaminskas@if.vdu.lt Department of Applied Informatics, Vytautas Magnus University, Vileikos 8, LT‐3035 Kaunas, Lithuania

Raškinis

Gailius

idgara@vdu.lt Center of Computational Linguistics, Vytautas Magnus University, Donelaičio 52, LT‐3000 Kaunas, Lithuania

01 01 2004

15 4 565 580 01 03 2004

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n‐gram models of highly inflected Lithuanian language by interpolating them with complex n‐gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part‐of‐speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3‐gram and 4‐gram class‐based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class‐based models linearly interpolated with the 3‐gram model led up to a 13% reduction in the perplexity compared with the baseline 3‐gram model. Morphological models decreased out‐of‐vocabulary word rate from 1.5% to 1.02%.

Keywords language models n‐grams class‐based models morphology inflections interpolation perplexity reduction out‐of‐vocabulary words