<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article"><front><journal-meta><journal-id journal-id-type="publisher-id">INFORMATICA</journal-id><journal-title-group><journal-title>Informatica</journal-title></journal-title-group><issn pub-type="epub">0868-4952</issn><issn pub-type="ppub">0868-4952</issn><publisher><publisher-name>VU</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="publisher-id">inf22204</article-id><article-id pub-id-type="doi">10.15388/Informatica.2011.323</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research article</subject></subj-group></article-categories><title-group><article-title>Semi-Automatic Bilingual Corpus Creation with Zero Entropy Alignments</article-title></title-group><contrib-group><contrib contrib-type="Author"><name><surname>Laukaitis</surname><given-names>Algirdas</given-names></name><email xlink:href="mailto:algirdas.laukaitis@fm.vgtu.lt">algirdas.laukaitis@fm.vgtu.lt</email><xref ref-type="aff" rid="j_INFORMATICA_aff_000"/></contrib><contrib contrib-type="Author"><name><surname>Vasilecas</surname><given-names>Olegas</given-names></name><email xlink:href="mailto:olegas@fm.vgtu.lt">olegas@fm.vgtu.lt</email><xref ref-type="aff" rid="j_INFORMATICA_aff_000"/></contrib><contrib contrib-type="Author"><name><surname>Laukaitis</surname><given-names>Ricardas</given-names></name><email xlink:href="mailto:ricardas.laukaitis@vva.lt">ricardas.laukaitis@vva.lt</email><xref ref-type="aff" rid="j_INFORMATICA_aff_001"/></contrib><contrib contrib-type="Author"><name><surname>Plikynas</surname><given-names>Darius</given-names></name><email xlink:href="mailto:darius.plikynas@vva.lt">darius.plikynas@vva.lt</email><xref ref-type="aff" rid="j_INFORMATICA_aff_001"/></contrib><aff id="j_INFORMATICA_aff_000">Fundamental Sciences Faculty, Vilnius Gediminas Technical University, Saulėtekio al. 11, LT-10223 Vilnius, Lithuania</aff><aff id="j_INFORMATICA_aff_001">Academy of Business and Management, Research Centre, Basanavičiaus 29A, LT-03109 Vilnius, Lithuania</aff></contrib-group><pub-date pub-type="epub"><day>01</day><month>01</month><year>2011</year></pub-date><volume>22</volume><issue>2</issue><fpage>203</fpage><lpage>224</lpage><history><date date-type="received"><day>01</day><month>10</month><year>2010</year></date><date date-type="accepted"><day>01</day><month>04</month><year>2011</year></date></history><abstract><p>In this paper, we describe a model for aligning books and documents from bilingual corpus with a goal to create “perfectly” aligned bilingual corpus on word-to-word level. Presented algorithms differ from existing algorithms in consideration of the presence of human translator which usage we are trying to minimize. We treat human translator as an oracle who knows exact alignments and the goal of the system is to optimize (minimize) the use of this oracle. The effectiveness of the oracle is measured by the speed at which he can create “perfectly” aligned bilingual corpus. By “Perfectly” aligned corpus we mean zero entropy corpus because oracle can make alignments without any probabilistic interpretation, i.e., with 100% confidence. Sentence level alignments and word-to-word alignments, although treated separately in this paper, are integrated in a single framework. For sentence level alignments we provide a dynamic programming algorithm which achieves low precision and recall error rate. For word-to-word level alignments Expectation Maximization algorithm that integrates linguistic dictionaries is suggested as the main tool for the oracle to build “perfectly” aligned bilingual corpus. We show empirically that suggested pre-aligned corpus requires little interaction from the oracle and that creation of perfectly aligned corpus can be achieved almost with the speed of human reading. Presented algorithms are language independent but in this paper we verify them with English–Lithuanian language pair on two types of text: law documents and fiction literature.</p></abstract><kwd-group><label>Keywords</label><kwd>Viterbi alignments</kwd><kwd>dynamic programming</kwd><kwd>string alignments</kwd><kwd>machine translation</kwd><kwd>natural language processing</kwd><kwd>rapid development</kwd><kwd>low-density languages</kwd></kwd-group></article-meta></front></article>