<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
	<front>
		<journal-meta>
			<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
			<journal-title-group>
				<journal-title>Informatica</journal-title>
			</journal-title-group>
			<issn pub-type="epub">0868-4952</issn>
			<issn pub-type="ppub">0868-4952</issn>
			<publisher>
				<publisher-name>VU</publisher-name>
			</publisher>
		</journal-meta>
		<article-meta>
			<article-id pub-id-type="publisher-id">inf15409</article-id>
			<article-id pub-id-type="doi">10.15388/Informatica.2004.079</article-id>
			<article-categories>
				<subj-group subj-group-type="heading">
					<subject>Research article</subject>
				</subj-group>
			</article-categories>
			<title-group>
				<article-title>Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition</article-title>
			</title-group>
			<contrib-group>
				<contrib contrib-type="Author">
					<name>
						<surname>Vaičiūnas</surname>
						<given-names>Airenas</given-names>
					</name>
					<email xlink:href="mailto:airenas@freemail.lt">airenas@freemail.lt</email>
					<xref ref-type="aff" rid="j_INFORMATICA_aff_000"/>
				</contrib>
				<contrib contrib-type="Author">
					<name>
						<surname>Kaminskas</surname>
						<given-names>Vytautas</given-names>
					</name>
					<email xlink:href="mailto:V.Kaminskas@if.vdu.lt">V.Kaminskas@if.vdu.lt</email>
					<xref ref-type="aff" rid="j_INFORMATICA_aff_000"/>
				</contrib>
				<aff id="j_INFORMATICA_aff_000">Department of Applied Informatics, Vytautas Magnus University, Vileikos 8, LT‐3035 Kaunas, Lithuania</aff>
			</contrib-group>
			<contrib-group>
				<contrib contrib-type="Author">
					<name>
						<surname>Raškinis</surname>
						<given-names>Gailius</given-names>
					</name>
					<email xlink:href="mailto:idgara@vdu.lt">idgara@vdu.lt</email>
					<xref ref-type="aff" rid="j_INFORMATICA_aff_001"/>
				</contrib>
				<aff id="j_INFORMATICA_aff_001">Center of Computational Linguistics, Vytautas Magnus University, Donelaičio 52, LT‐3000 Kaunas, Lithuania</aff>
			</contrib-group>
			<pub-date pub-type="epub">
				<day>01</day>
				<month>01</month>
				<year>2004</year>
			</pub-date>
			<volume>15</volume>
			<issue>4</issue>
			<fpage>565</fpage>
			<lpage>580</lpage>
			<history>
				<date date-type="received">
					<day>01</day>
					<month>03</month>
					<year>2004</year>
				</date>
			</history>
			<abstract>
				<p>This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n‐gram models of highly inflected Lithuanian language by interpolating them with complex n‐gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part‐of‐speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3‐gram and 4‐gram class‐based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class‐based models linearly interpolated with the 3‐gram model led up to a 13% reduction in the perplexity compared with the baseline 3‐gram model. Morphological models decreased out‐of‐vocabulary word rate from 1.5% to 1.02%.</p>
			</abstract>
			<kwd-group>
				<label>Keywords</label>
				<kwd>language models</kwd>
				<kwd>n‐grams</kwd>
				<kwd>class‐based models</kwd>
				<kwd>morphology</kwd>
				<kwd>inflections</kwd>
				<kwd>interpolation</kwd>
				<kwd>perplexity reduction</kwd>
				<kwd>out‐of‐vocabulary words</kwd>
			</kwd-group>
		</article-meta>
	</front>
</article>