<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
	<front>
		<journal-meta>
			<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
			<journal-title-group>
				<journal-title>Informatica</journal-title>
			</journal-title-group>
			<issn pub-type="epub">0868-4952</issn>
			<issn pub-type="ppub">0868-4952</issn>
			<publisher>
				<publisher-name>VU</publisher-name>
			</publisher>
		</journal-meta>
		<article-meta>
			<article-id pub-id-type="publisher-id">inf17109</article-id>
			<article-id pub-id-type="doi">10.15388/Informatica.2006.127</article-id>
			<article-categories>
				<subj-group subj-group-type="heading">
					<subject>Research article</subject>
				</subj-group>
			</article-categories>
			<title-group>
				<article-title>Cache-based Statistical Language Models of English and Highly Inflected Lithuanian</article-title>
			</title-group>
			<contrib-group>
				<contrib contrib-type="Author">
					<name>
						<surname>Vaičiūnas</surname>
						<given-names>Airenas</given-names>
					</name>
					<email xlink:href="mailto:airenas@freemail.lt">airenas@freemail.lt</email>
					<xref ref-type="aff" rid="j_INFORMATICA_aff_000"/>
				</contrib>
				<contrib contrib-type="Author">
					<name>
						<surname>Raškinis</surname>
						<given-names>Gailius</given-names>
					</name>
					<email xlink:href="mailto:g.raskinis@if.vdu.lt">g.raskinis@if.vdu.lt</email>
					<xref ref-type="aff" rid="j_INFORMATICA_aff_000"/>
				</contrib>
				<aff id="j_INFORMATICA_aff_000">Department of Applied Informatics, Vytautas Magnus University, Vileikos 8, LT-44404 Kaunas, Lithuania</aff>
			</contrib-group>
			<pub-date pub-type="epub">
				<day>01</day>
				<month>01</month>
				<year>2006</year>
			</pub-date>
			<volume>17</volume>
			<issue>1</issue>
			<fpage>111</fpage>
			<lpage>124</lpage>
			<history>
				<date date-type="received">
					<day>01</day>
					<month>08</month>
					<year>2005</year>
				</date>
			</history>
			<abstract>
				<p>This paper investigates a variety of statistical cache-based language models built upon three corpora: English, Lithuanian, and Lithuanian base forms. The impact of the cache size, type of the decay function, including custom corpus derived functions, and interpolation technique (static vs. dynamic) on the perplexity of a language model is studied. The best results are achieved by models consisting of 3 components: standard 3-gram, decaying cache 1-gram and decaying cache 2-gram that are joined together by means of linear interpolation using the technique of dynamic weight update. Such a model led up to 36% and 43% perplexity improvement with respect to the 3-gram baseline for Lithuanian words and Lithuanian word base forms respectively. The best language model of English led up to a 16% perplexity improvement. This suggests that cache-based modeling is of greater utility for the free word order highly inflected languages.</p>
			</abstract>
			<kwd-group>
				<label>Keywords</label>
				<kwd>language models</kwd>
				<kwd>n-grams</kwd>
				<kwd>cache models</kwd>
				<kwd>dynamic interpolation</kwd>
				<kwd>perplexity reduction</kwd>
				<kwd>inflected language</kwd>
				<kwd>free word order language</kwd>
				<kwd>Lithuanian</kwd>
			</kwd-group>
		</article-meta>
	</front>
</article>