<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn><issn pub-type="ppub">0868-4952</issn><issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFO1200</article-id>
<article-id pub-id-type="doi">10.15388/Informatica.2018.188</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>Sentence Level Alignment of Digitized Books Parallel Corpora</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Laukaitis</surname><given-names>Algirdas</given-names></name><email xlink:href="algirdas.laukaitis@vgtu.lt">algirdas.laukaitis@vgtu.lt</email><xref ref-type="aff" rid="j_info1200_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>A. Laukaitis</bold> has graduated from Vilnius University Faculty of Physics. He received the PhD degree from the Institute of Mathematics and Informatics, Vilnius. He is a professor of the Information Systems Department of Vilnius Gediminas Technical University. His research interests include text mining, natural language interfaces, machine translation systems and knowledge management.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Plikynas</surname><given-names>Darius</given-names></name><email xlink:href="darius.plikynas@mii.vu.lt">darius.plikynas@mii.vu.lt</email><xref ref-type="aff" rid="j_info1200_aff_002">2</xref><bio>
<p><bold>D. Plikynas</bold> is affiliated as professor, senior research fellow at the Institute of Data Science and Digital Technologies in Vilnius University. He is also affiliated as professor, chief research fellow at the Department of Business Technologies in Vilnius Gediminas Technical University. He has been involved in a number of EU and nationally financed research projects. He has published 2 monographs, 8 chapters in books, over 40 publications and over 50 conference papers. His main field of interest includes fundamental and applied research (modelling and simulation), covering interdisciplinary research domains in natural and social sciences, e.g. computational intelligence, agent based simulations, complexity research, social networks, distributed cognition.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Ostasius</surname><given-names>Egidijus</given-names></name><email xlink:href="egidijus.ostasius@vgtu.lt">egidijus.ostasius@vgtu.lt</email><xref ref-type="aff" rid="j_info1200_aff_001">1</xref><bio>
<p><bold>E. Ostasius</bold> is an associate professor of Vilnius Gediminas Technical University at the Faculty of Fundamental Sciences, Department of Information Technologies. He was awarded the candidate of mathematical sciences degree at Kaunas Polytechnic Institute in 1989, doctor of the mathematical sciences since 1993. His research interests include analysis of business processes and e-services, modelling, and evaluation in public and commercial sectors, their applications, and related issues.</p></bio>
</contrib>
<aff id="j_info1200_aff_001"><label>1</label>Fundamental Science Faculty, <institution>Vilnius Gediminas Technical University</institution>, <country>Lithuania</country></aff>
<aff id="j_info1200_aff_002"><label>2</label>Institute of Data Science and Digital Technologies, <institution>Vilnius University</institution>, <country>Lithuania</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2018</year></pub-date><pub-date pub-type="epub"><day>1</day><month>1</month><year>2018</year></pub-date><volume>29</volume><issue>4</issue><fpage>693</fpage><lpage>710</lpage><history><date date-type="received"><month>9</month><year>2017</year></date><date date-type="accepted"><month>9</month><year>2018</year></date></history>
<permissions><copyright-statement>© 2018 Vilnius University</copyright-statement><copyright-year>2018</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>In this paper, we propose a framework for extracting translation memory from a corpus of fiction and non-fiction books. In recent years, there have been several proposals to align bilingual corpus and extract translation memory from legal and technical documents. Yet, when it comes to an alignment of the corpus of translated fiction and non-fiction books, the existing alignment algorithms give low precision results. In order to solve this low precision problem, we propose a new method that incorporates existing alignment algorithms with proactive learning approach. We define several feature functions that are used to build two classifiers for text filtering and alignment. We report results on English-Lithuanian language pair and on bilingual corpus from 200 books. We demonstrate a significant improvement in alignment accuracy over currently available alignment systems.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>alignment of corpora</kwd>
<kwd>alignment of digitized books</kwd>
<kwd>machine translation</kwd>
<kwd>natural language processing</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_info1200_s_001">
<label>1</label>
<title>Introduction</title>
<p>Translation memory extraction is the problem of extracting word and phrase translations from bilingual corpora. These translations are usually extracted from technical document parallel corpora, which can be easily aligned on sentence level with low error rates. However, for most language pairs there isn’t a sufficient amount of technical document parallel data in order to build high quality translation memory database.</p>
<p>In order to extend translation memory database there have been numerous approaches to extract parallel sentences from non-parallel monolingual corpus, such as news articles or web pages. While these methods have been applied for translation dictionary improvement, little attention has been given to a corpus of translated books as a source of bilingual sentences. This is surprising given that the corpus of translated books has potential for extending translation memory much better than non-parallel monolingual corpus.</p>
<p>As an example, we can look at the corpus of English-Lithuanian language pair that we created in the past few years. The corpus consists of two parts: 1.8 million sentences from the European Parliament proceedings and 2.1 million sentences from the corpus of 200 books. The domain range of the books’ corpus is more complex and its language is more expressive than the language from European Parliament proceedings. Yet, these 200 books represent only a small fraction of all the books that have translations in English and Lithuanian languages.</p>
<fig id="j_info1200_fig_001">
<label>Fig. 1</label>
<caption>
<p>Differences between technical documents’ and books’ alignment processes.</p>
</caption>
<graphic xlink:href="info1200_g001.jpg"/>
</fig>
<p>We can ask ourselves why there was so little research done in order to create methods and algorithms for bilingual book corpus alignment. In order to answer this question we can look at Fig. <xref rid="j_info1200_fig_001">1</xref>, where we compare differences in alignment processes of technical bilingual corpus vs corpus consisting of fiction and non-fiction books.</p>
<p>In traditional technical document alignment process, we already have well formatted document pair and run alignment procedure which is based on one of the sentence length based alignment algorithms. Usually we get high quality alignments by using sentence length metric alone without any additional preprocessing. By adding some additional lexical information, i.e. dictionaries, minor improvements in alignment quality can be achieved.</p>
<p>On the other hand, we can see from Fig. <xref rid="j_info1200_fig_001">1</xref> that the books alignment process is more complex. There are at least four factors of additional string generation process (these strings we need to detect and remove) when we try to align books corpus. First, in some cases we must use optical character recognition (OCR) process in order to get text from a book. This process usually generates errors determined by OCR software. Even if we use the state-of-the-art OCR systems we still can get some errors. So, we usually prefer an alternative source to ORC systems, i.e. books in some digital format, i.e. <italic>pdf, epub, fb2, doc</italic>, etc.</p>
<p>The next process that can produce additional erroneous strings for alignment algorithms is due to the requirement to transform various document formats (e.g. <italic>pdf, epub, fb2, doc</italic>) to a single text format which can be used for the final alignment. Many file format converters are available, but it can happen that for some books they produce a transformation error. We found that the type of these errors is similar to error type we get from OCR process. Thus, we can tackle these two transformation error types with the same classifier.</p>
<p>After we get a text file, we need to use filters in order to remove various page formatting marks and notes that were entered by the editor and the translator (e.g. translator notes at the end of book page can be a significant source of errors). We would like to remove these insertions and for this we developed filters based on regular expressions.</p>
<p>Once we get the filtered text files we still need a special algorithm for book alignment because existing algorithms usually work well only on relatively small documents. We found that the major factor of alignment errors at this stage of corpus processing is due to a very figurative translation of some books. Sometimes the translator can skip several paragraphs just because some passages in book require substantial efforts to translate. What we would like to do in this case is to identify such omissions and mark them to be excluded from alignment process.</p>
<p>These introductory remarks define the rest of the paper, which is organized as follows. In the next section, we review some related works and discuss a few open issues. In Section <xref rid="j_info1200_s_003">3</xref> we present the general architecture of the system for the bilingual books’ corpus creation. Section <xref rid="j_info1200_s_004">4</xref> describes statistical inference model for finding chapter, paragraph and sentence punctuation marks based on conditional random field model. Section <xref rid="j_info1200_s_005">5</xref> describes filtering algorithms for correcting alignments. In Section <xref rid="j_info1200_s_006">6</xref> we describe the various elements of user interaction model. Our view is that there is always the likelihood that the alignment algorithm can be stuck in a local minimum. In this case we suggest proactive learning model to query for an alignment anchor point. Section <xref rid="j_info1200_s_007">7</xref> describes empirical tests of the suggested method. Finally, concluding remarks and future work are presented at the end of the paper.</p>
</sec>
<sec id="j_info1200_s_002">
<label>2</label>
<title>Related Work</title>
<p>Several techniques have been proposed to align corpus at sentence level. Techniques that are unsupervised and language independent use mainly sentence length statistics in order to find relationships between sentences that are translations of each other. One of the first methods that used this approach was reported in Brown <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1200_ref_004">1991</xref>). It achieved an accuracy of 99% in a random selected set of 1000 sentence pairs from the English-French Hansard corpora. The method used only the number of tokens in the sentence without any use of the lexical details.</p>
<p>A similar approach but based on a simple statistical model of lengths in characters was proposed by Gale and Church (<xref ref-type="bibr" rid="j_info1200_ref_007">1993</xref>). Their suggested algorithm is based on a probabilistic score of length difference of the two sentences. In both papers dynamic programming techniques are applied to find the maximum likelihood alignment of bilingual sentences.</p>
<p>While these length based algorithms can achieve small error rates on literal translations they are not robust with respect to translations where some sentences are skipped or merged. In order to overcome these challenges, many improvements have been suggested. Chen (<xref ref-type="bibr" rid="j_info1200_ref_006">1993</xref>) developed sentence level alignment algorithm that requires an externally supplied bilingual lexicon. This algorithm gave better accuracy than the length based methods. Moore (<xref ref-type="bibr" rid="j_info1200_ref_012">2002</xref>) proposed a multi-pass search procedure that complements sentence length based statistics with the IBM Model-1 translation model (Brown <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1200_ref_005">1993</xref>). Varga <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1200_ref_019">2007</xref>) also suggested a similar method. The main improvement over (Moore, <xref ref-type="bibr" rid="j_info1200_ref_012">2002</xref>) approach is the usage of a word by word dictionary.</p>
<p>A more recent approach reported in Braune and Fraser (<xref ref-type="bibr" rid="j_info1200_ref_003">2010</xref>) used several alignment stages. In the first stage alignments were computed similarly to the method reported in Moore (<xref ref-type="bibr" rid="j_info1200_ref_012">2002</xref>). The second stage uses one-to-one alignments that are obtained through dynamic programming and then the merge procedure was used to obtain many-to-one alignments. The use of an automatic translation system to translate the source sentence for sentence alignment was proposed in Sennrich and Volk (<xref ref-type="bibr" rid="j_info1200_ref_014">2010</xref>). In this work a map of one-to-one alignments is generated based on the BLEU metric. Then various heuristics are used to refine one-to-one and to add many-to-one and one-to-many alignments.</p>
<p>Nevertheless, we found that precision and recall of existing algorithms are too low in order to consider them practical when it comes to an alignment of translated fiction books. As shown in Laukaitis and Vasilecas (<xref ref-type="bibr" rid="j_info1200_ref_010">2008</xref>), Laukaitis <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1200_ref_009">2011</xref>), the accuracy of these methods decreases drastically when we try to align a text that contains discrepancies, e.g. some book page layout segmentation strings, missing sentences and frequent one-to-many alignments.</p>
<p>Similar conclusions have also been drawn by reevaluation of state-of-the-art methods for large collections of publicly available novels in Xu <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1200_ref_020">2015</xref>). In this work authors used 24 English-French bilingual books and 17 English and Spanish bilingual books in order to evaluate their method which is based on maximum entropy approach (Berger <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1200_ref_002">1996</xref>). They found that by using existing algorithms and the two stage approach one can get slightly better precision than by using methods based on sentence length and lexical features. Our research in this respect on the corpus of 200 books shows that the precision reported in Xu <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1200_ref_020">2015</xref>) is possible only if books in corpus do not have additional strings, such as translator notes, headers, footers, etc.</p>
<p>We argue in this paper that in order to get high quality alignments on corpus of books we must consider a small number of interactions between the reader of the book and machine learning algorithms. Recent research on computer-assisted translation suggested several methods on how to improve translation using an iterative process in which the human translator interacts with statistical machine translation system (Barrachina <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1200_ref_001">2009</xref>). We found that research works in the area of active learning and crowdsourcing can help efficiently build up queries for the book reader. There have been many works in NLP area that investigated active learning approach, such as text classification (McCallum and Nigam, <xref ref-type="bibr" rid="j_info1200_ref_011">1998</xref>; Tong and Koller, <xref ref-type="bibr" rid="j_info1200_ref_018">2001</xref>) or information extraction (Thompson <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1200_ref_017">1999</xref>; Settles and Craven, <xref ref-type="bibr" rid="j_info1200_ref_015">2008</xref>). In this paper we suggest how to adopt these active learning strategies for book alignment on sentence level.</p>
</sec>
<sec id="j_info1200_s_003">
<label>3</label>
<title>General Framework</title>
<p>We start our presentation of the system by discussing general framework of alignment process. Many systems that we discussed in the related works section work in two stages. Usually, they perform sparse alignment using lexical features to find rare but highly probable anchor points. Then, these systems try to align corpus between anchor points using length and lexical based approach. As we mentioned above, such approach does not work on book corpus and more processing stages are required.</p>
<fig id="j_info1200_fig_002">
<label>Fig. 2</label>
<caption>
<p>The general framework for book corpus alignments.</p>
</caption>
<graphic xlink:href="info1200_g002.jpg"/>
</fig>
<p>Therefore, here, we present books alignment general framework through the description of processes that we developed in order to accomplish high quality alignment. In Fig. <xref rid="j_info1200_fig_002">2</xref> we can see the UML activity diagram which presents workflow in our alignment framework. We can see that workflow is divided into two parts (swimlanes in the UML diagram): 1) alignment system and 2) interaction system. All activities that require human interaction are put in ‘<italic>interaction</italic>’ swimlane. On the other hand, activities that are processed without reader intervention are put under ‘<italic>alignment system</italic>’ swimlane.</p>
<p>We start with the activity ‘<italic>registration and metadata</italic>’. The purpose of it is to put a pair of books into the processing pipeline. In our case it means that the reader puts English language book in one file system directory and Lithuanian language book into another directory. These books can be in any popular e-book format: <italic>pdf</italic>, <italic>mobi</italic>, <italic>epub</italic>, etc. Additionally, we encourage the reader to enter some metadata about the book. We try to extract that metadata from wikipedia pages but even if we successfully extract it we still require that the reader confirm that the extracted metadata is correct. This is done because we found that character, location or organization names can be very useful in finding high quality anchor points.</p>
<p>Once we have put books into processing pipeline, we can proceed to carry out the transformation of an e-book from a format like <italic>pdf</italic>, <italic>mobi</italic>, <italic>epub</italic> into the text file format. There are plenty of free programs and web services that can perform this kind of activity. As an example we can mention that for <italic>pdf</italic> files we use the Apache PDFBox library. We found that in most cases it correctly extracts text from <italic>pdf</italic>files.</p>
<p>After the transformation process we start the alignment process by entering two mandatory anchor points. We require that the beginning and the end of the book have alignment anchor points confirmed by the reader. This means that usually we must trim all the text before the first chapter in the book and all the text after the last sentence of the last chapter. We must do this activity because some e-books (e.g. books form project Gutenberg) have a significant amount of supplementary information that is not related to the context of a book and that is put either at the beginning or at the end of the book.</p>
<p>Once we have these two anchor points, an automatic alignment extraction can be started. Instead of using two stages approach, as many current systems do, we loop several stages of alignment until final convergence by alignment metric is achieved or until reader decides that he is not willing to improve alignments further. We start this loop with two alignment sub-processes: a process that is based on sentence length and then a process that is based on lexical feature functions. The length based sub-process is based on (Gale and Church, <xref ref-type="bibr" rid="j_info1200_ref_007">1993</xref>) approach. It produces initial set of anchor points. We do not try to align all the sentences but instead we align the group of sentences within a predefined window. Then we try to align segments of a text within each group of sentences using lexical feature functions. In all works that we investigated systems try to use maximum entropy approach (Berger <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1200_ref_002">1996</xref>) in order to process this stage. Nevertheless, we found that in many methods (like in Moore, <xref ref-type="bibr" rid="j_info1200_ref_012">2002</xref> and Varga <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1200_ref_019">2007</xref>) we get more than 50% error rate when we finish these two sub-processes.</p>
<p>Then, all other sub-processes we added in order to achieve error rates that, at the end of alignment, are less than 5%. All these sub-processes start from three threads: conditional random fields (CRF) alignment, filtering and reporting.</p>
<p>Our CRF alignment algorithm is a supplementary component to the traditional alignment method. We developed it in an effort to improve alignment quality and solve a problem that is related to the correlation between feature functions. It is well known that maximum entropy approach creates a label bias problem (Lafferty <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1200_ref_008">2001</xref>). In order to avoid this label bias problem we label all alignments from maximum entropy algorithm with the small set of labels. We learn this set of labels from a few manually aligned books using skip-chain conditional random fields (SCCRF). In the next section we present this algorithm in details.</p>
<p>The sentence length and lexical associations baseline alignment sub-processes do not remove footnotes, headers, page numbers, etc. All these additional strings can lead to a complete misalignment in the corpus. For example, let us compare English and Lithuanian translations of the novel ‘War and Peace’ by the Russian author Leo Tolstoy. In the original Russian text there are many text blocks that are written in French. English language translator decided to translate these French passages into English. Lithuanian language translator decided to leave them in French and add translations in the footnotes. As the consequence of this fact, we got more than 80% alignment error rate when we tried to align this novel translation using existing alignment algorithms.</p>
<p>Therefore, we need to filter out all the elements that have no counterpart in a translated text. During this research project, we developed a text filtering method that helps us to filter out more than 99% strings that have no translations. We found that after this filtering process even the existing methods gained a significant improvement in alignment quality. The output of this filtering process is a set of text segments that our alignment system suggests to delete from books. Section <xref rid="j_info1200_s_005">5</xref> describes this process in more details.</p>
<p>One of the keystones of the alignment model is the reader interaction component, which requires that a set of anchor points be generated after each alignment loop. Very often, due to figurative translation, standard alignment algorithm can be stuck and unable to find good alignments. In this case we need natural language understanding techniques that are beyond capabilities of currently available algorithms. In these cases only a human can help to find a new anchor point from which the alignment algorithm can start to align using sentence length information and lexical entries.</p>
<p>Therefore, the system must ask for a human help. Thus, to complement the automatic alignment loop (left arrow in Fig. <xref rid="j_info1200_fig_002">2</xref>) we create in parallel another loop for reader interaction. This parallel loop means that reader can intervene into the alignment process at any time he wishes. This loop starts with reporting activity during which alignment system produces comprehensive and highly readable report about current alignment progress. The reader can decide to investigate this report and help automatic alignment system by entering manually some alignment anchors.</p>
<p>The question remains which anchor points are most useful for automatic alignment procedure. We found that it is impractical to show all the context of the book and current alignment state for the reader and ask to find the best alignment in order to help the system. A more intelligent report is required. For this we developed a proactive learning based approach which we present in Section <xref rid="j_info1200_s_006">6</xref>.</p>
<p>This approach consists of alignment quality evaluation and query generation after each automatic alignment loop. These queries are sorted by their relevance (information entropy based metric which we discuss in details in Section <xref rid="j_info1200_s_006">6</xref>). It is up to the reader to decide which of them to answer. Usually we expect that he will answer at least one query so that the system can evaluate this new information and realign books accordingly.</p>
</sec>
<sec id="j_info1200_s_004">
<label>4</label>
<title>Conditional Random Fields for Sentence Alignment </title>
<p>A linear-chain conditional random fields are undirected graphical models, trained to maximize the conditional probability of the sequence labels (Lafferty <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1200_ref_008">2001</xref>). It is well known that the maximum entropy approach could suffer from label bias problems (Lafferty <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1200_ref_008">2001</xref>). The CRF models can be used to solve this label bias problem. In this section we present an algorithm that uses linear-chain conditional random fields’ models to improve the alignment on sentence level of a bilingual corpus (we use the R package CRF for model training and decoding).</p>
<p>It has been mentioned in previous section that, in our case, the CRF alignment algorithm is a supplementary component to the traditional alignment method. We do not apply it to get the alignment positions. Instead, we use it to label alignments found by algorithms described in Moore (<xref ref-type="bibr" rid="j_info1200_ref_012">2002</xref>) and Varga <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1200_ref_019">2007</xref>) with the labels: ‘<italic>good</italic>’, ‘<italic>bad</italic>’.</p>
<p>As an example, we can look at Fig. <xref rid="j_info1200_fig_003">3</xref>. In the first and second columns we can see a few English and Lithuanian sentences from the novel ‘War and Peace’ by the Russian author Leo Tolstoy. These pairs of sentences represent alignments that have been verified by the reader and labelled as ‘good’. On the other hand, the third column represents distorted alignments that have been misaligned intentionally by randomly shifting and joining some ‘good’ alignments. We label them as ‘bad’ alignments.</p>
<fig id="j_info1200_fig_003">
<label>Fig. 3</label>
<caption>
<p>Example of positive ‘<italic>good</italic>’ and negative ‘<italic>bad</italic>’ alignments for CRF model training. We randomly shift the target segments in order to get misaligned (negative) examples (third column in the table).</p>
</caption>
<graphic xlink:href="info1200_g003.jpg"/>
</fig>
<p>We start our analysis from the set of variables <inline-formula id="j_info1200_ineq_001"><alternatives><mml:math>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">a</mml:mi></mml:math><tex-math><![CDATA[$e,l,a$]]></tex-math></alternatives></inline-formula>. Variable <italic>e</italic> represents the English language sentences and variable <italic>l</italic> represents translation sentences in the Lithuanian language. A hidden alignment variables <italic>a</italic> describes a mapping between sentence punctuation marks. The relationship between these variables in the statistical machine translation model is defined by 
<disp-formula id="j_info1200_eq_001">
<label>(1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ P(e|l)=\sum \limits_{a}P(e,a|l).\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>In order to find variables <italic>a</italic> we must consider the optimization problem 
<disp-formula id="j_info1200_eq_002">
<label>(2)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
<mml:mo>=</mml:mo>
<mml:mo movablelimits="false">arg</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo movablelimits="false">max</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \hat{a}=\arg \underset{a}{\max }P(e,a|l),\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_info1200_ineq_002"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\hat{a}$]]></tex-math></alternatives></inline-formula> is alignment that has the highest probability and is called Viterbi alignment. This generative model for finding alignments can be used when we try to find alignments on word level and when we have at least several million of sentences parallel corpora. But, as we already mentioned, for the books corpus this model doesn’t give a required recall and precision. In order to align books on sentence punctuation marks we can use maximum entropy models. These discriminative models use conditional probabilities and relevant feature functions. The drawback of maximum entropy model (Berger <italic>et al.</italic>, <xref ref-type="bibr" rid="j_info1200_ref_002">1996</xref>) is that it makes decisions at each alignment point independently.</p>
<p>Nevertheless, we found that maximum entropy model alignment precision can be greatly improved if we try to align relatively small fragments between fixed anchor points from books’ corpus. That was why we decided to try CRF models to label alignments and then to use ‘good’ alignments as anchor points in the next alignment iteration.</p>
<p>Linear-chain conditional random fields’ model lets us use correlated feature functions, it avoids label bias problem and it lets us model the dependency between labels. Usually, linear-chain conditional random fields are defined on string sequence <italic>x</italic> with label <italic>y</italic> with a first order Markov assumption over the sequence 
<disp-formula id="j_info1200_eq_003">
<label>(3)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">λ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mo movablelimits="false">exp</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo movablelimits="false">exp</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ P(y|x,\lambda )=\frac{\exp {\textstyle\textstyle\sum _{t=1}^{T}}{\textstyle\textstyle\sum _{s=1}^{S}}{\lambda _{s}}{\xi _{s}}({y_{t-1}},{y_{t}},{x_{t}})}{{\textstyle\sum _{y}}\exp {\textstyle\textstyle\sum _{t=1}^{T}}{\textstyle\textstyle\sum _{s=1}^{S}}{\lambda _{s}}{\xi _{s}}({y_{t-1}},{y_{t}},{x_{t}})}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Variables <italic>x</italic> represent all sentence pairs in English–Lithuanian book corpus, i.e. <inline-formula id="j_info1200_ineq_003"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(e,l)$]]></tex-math></alternatives></inline-formula>. <inline-formula id="j_info1200_ineq_004"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\xi _{s}}$]]></tex-math></alternatives></inline-formula> are the feature functions that model relevant information about particular alignment position. Theoretically, it is possible to use alignment index <italic>a</italic> values as labels that are represented by variable <italic>y</italic>. However, such an approach would require much more data than is available for learning. Thus, as we already mentioned above, we use CRF model to label alignments that are generated from alignment generation processes based on maximum entropy models, sentence length and lexical properties. These labels are represented by variable <italic>y</italic> in Eq. (<xref rid="j_info1200_eq_003">3</xref>).</p>
<p>The training of the mode (<xref rid="j_info1200_eq_003">3</xref>) is done using two sets of labels. The first set represents labels that we already mentioned above:</p>
<list>
<list-item id="j_info1200_li_001">
<label>•</label>
<p>‘good’ – as good alignment, i.e. other alignment processes can use this information as new alignment feature.</p>
</list-item>
<list-item id="j_info1200_li_002">
<label>•</label>
<p>‘bad’ – bad alignment, i.e. other alignment processes can use this information to discourage these alignments to appear in the new alignment iteration.</p>
</list-item>
</list>
<p>We can try to use additional label set after we get <inline-formula id="j_info1200_ineq_005"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\hat{a}$]]></tex-math></alternatives></inline-formula> (Eq. (<xref rid="j_info1200_eq_002">2</xref>)) from maximum entropy alignment algorithms and tag them with labels ‘good’ or ‘bad’. These additional labels are:</p>
<list>
<list-item id="j_info1200_li_003">
<label>•</label>
<p>‘chapter’ – it is a tag for some ‘good labels and it is intended to find chapter names in a book.</p>
</list-item>
<list-item id="j_info1200_li_004">
<label>•</label>
<p>‘move left’ – it is a tag for some ‘bad’ labels and it defines the suggestion to move target position to the left in order to improve alignments.</p>
</list-item>
<list-item id="j_info1200_li_005">
<label>•</label>
<p>‘move right’ – it is a tag for some ‘bad’ labels and it defines the suggestion to move target position to the right in order to improve alignments.</p>
</list-item>
</list>
<p>Models like CRFs are appealing because they reduce each problem to a task of finding a feature set that satisfactory represents the problem at hand. Next, we describe the set of these feature functions that we used in order to label alignments.</p>
<p>1. The set of regular expressions that matches possible marks for book chapter (e.g. string pairs like ‘IX’–‘IX’, ‘Chapter 9’–‘IX’, etc.).</p>
<p>2. Paragraph indicators. We defined an indicator function that evaluates paragraph position likelihood.</p>
<p>3. Orthographic features. These are the features that measure how well sentences <italic>e</italic> and <italic>l</italic> match in terms of string overlap.</p>
<list>
<list-item id="j_info1200_li_006">
<label>•</label>
<p>Punctuation marks. We say that alignment that have matched punctuation marks like ‘?’, ‘!’, ‘)’, etc., are more likely to be ‘good’. These feature functions can be defined as <italic>p</italic>(<italic>alignment label</italic> = ‘<italic>good</italic>’ | <italic>matched puntuation mark</italic>).</p>
</list-item>
<list-item id="j_info1200_li_007">
<label>•</label>
<p>Capitalization. Feature function returns 1 if both sentences after alignment position <inline-formula id="j_info1200_ineq_006"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${a_{i}}$]]></tex-math></alternatives></inline-formula> start with capital letters.</p>
</list-item>
<list-item id="j_info1200_li_008">
<label>•</label>
<p>Utterance unit marks. We add additional score to alignment if sentences after alignment position start with marks that represent sentences from utterance, e.g. ‘"’ – marks some beginning of speech in English and ‘–’ in Lithuanian.</p>
</list-item>
</list>
<p>4. Bilingual dictionary. Dictionaries are an important source of information about sentence alignment position. We set some predefined window of width ‘w’ (measured in number of words) around alignment position <italic>a</italic> and count how many words we can find in this window that are translations of each other when we look at the translation dictionary.</p>
<p>5. Named entities match. Named entity recognition component that indicates if there are named entities around alignment position <italic>a</italic>. We require that at least 3 char sub-string of a named entity must match in a language pair.</p>
<p>7. Phrase match. The same as words dictionary match. Usually we have a ‘good’ alignment if there is a long phrase match on both sides of translation.</p>
<p>8. Length based match. We increase alignment scoring if the length of sentences between sequential alignment points is similar.</p>
</sec>
<sec id="j_info1200_s_005">
<label>5</label>
<title>CRF Model for Text Filtering</title>
<p>The purpose of the filtering algorithm is to find segments of a text in a book and to remove them if they are a part of book headers, page numbers, footnotes, etc. These formatting strings define the first type of strings that must be removed in order to get high quality alignments of bilingual books’ corpus. Inaccuracies of the translation, i.e. a too figurative translation or just omissions of some sentences from a text define the second type of strings that we must consider for removal.</p>
<p>For the first type of strings we developed an algorithm that is based on a set of regular expressions that selects fragments of a text and marks them for removal. Table <xref rid="j_info1200_tab_001">1</xref> presents a few examples of such regular expressions.</p>
<table-wrap id="j_info1200_tab_001">
<label>Table 1</label>
<caption>
<p>Example set of regular expressions used to remove some fragments of text.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Regular expression</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Description</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">1</td>
<td style="vertical-align: top; text-align: left"><monospace>^[ ]{0,2}(\d){1,3}[ \d]{0,3}$</monospace></td>
<td style="vertical-align: top; text-align: left">The line up to 8 char length. It consists only of digits and spaces</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">2</td>
<td style="vertical-align: top; text-align: left"><monospace>(\d){1,3}[ \d]*$</monospace></td>
<td style="vertical-align: top; text-align: left">Select all digits at the end of the line (we found that sometimes PDF file converters add page numbers at the end of text line instead of inserting them into new line)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">3</td>
<td style="vertical-align: top; text-align: left"><monospace>(\p{L})\^\~(\p{L})</monospace></td>
<td style="vertical-align: top; text-align: left">Matches two letters that have <monospace>‘^~’</monospace> string between them</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left"><monospace>\r\n\r\n\r\n</monospace></td>
<td style="vertical-align: top; text-align: left">Three sequential strings of carriage return</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">4</td>
<td style="vertical-align: top; text-align: left"><monospace>^.{1,30}[A-Z][A-Z].{1,30}$</monospace></td>
<td style="vertical-align: top; text-align: left">At least two UPPER letters and line of limited length</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">4</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><monospace>^[\*].+$</monospace></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Select the line that starts with asterisk (useful for footnote detection)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We tried several statistical models to define probability that a matched string with regular expression must be removed. Such strings as page headers can appear periodically. For this kind of strings the linear-chain CRF model can be used. On the other hand, we found that skip-chain CRF model (Sutton and McCallum, <xref ref-type="bibr" rid="j_info1200_ref_016">2006</xref>) is better for labelling string sequences when we need to model long distance dependencies. For example, translation footnotes can appear only a few times in a book and the skip-chain CRF model can capture these long distance dependencies. Another example of long distance dependencies can be a chapter name in a book which is printed on each second page in the page header area. We would like to keep the chapter name at the beginning of a chapter but to remove all subsequent headers. Thus, the probability of a label sequence is modelled as 
<disp-formula id="j_info1200_eq_004">
<label>(4)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">λ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mo movablelimits="false">exp</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo movablelimits="false">exp</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ P(y|x,\lambda )=\frac{\exp \big({\textstyle\textstyle\sum _{t=1}^{T}}{\textstyle\textstyle\sum _{s=1}^{S}}{\lambda _{s}}{\xi _{s}}({y_{t-1}},{y_{t}},x)+{\textstyle\textstyle\sum _{u,v}^{T}}{\textstyle\textstyle\sum _{s=1}^{S}}{\lambda _{s}}{\xi _{s}}({y_{u}},{y_{v}},x)\big)}{{\textstyle\sum _{y}}\exp {\textstyle\textstyle\sum _{t=1}^{T}}{\textstyle\textstyle\sum _{s=1}^{S}}{\lambda _{s}}{\xi _{s}}({y_{t-1}},{y_{t}},x)}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>We can see that the difference between a linear-chain CRF model and a skip-chain CRF model is that we use term <inline-formula id="j_info1200_ineq_007"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">S</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">v</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\textstyle\sum _{u,v}^{T}}{\textstyle\sum _{s=1}^{S}}{\lambda _{s}}{\xi _{s}}({y_{u}},{y_{v}},x))$]]></tex-math></alternatives></inline-formula> in a skip-chain CRF to model long-distance edges between text segments that were matched by the same regular expression. For filtering sequence <italic>y</italic> using skip-chain CRF model we define only three labels: ‘keep’, ‘delete’, ‘review’. Labels ‘keep’ and ‘delete’ define recommendation either to delete a string or keep it in the text. Label ‘review’ means that only part of the string must be removed.</p>
<p>There are two stages of filtering process at which we try to detect these text segments. At the first stage we try to use only monolingual feature functions. If we fail to detect text segments that we want to remove at first stage then there is a possibility to detect them later in the second stage at which we use bilingual features (the same subset as for alignment algorithm described in the previous section). We used the following set of monolingual feature functions in our text removal process.</p>
<p>1. Probability <inline-formula id="j_info1200_ineq_008"><alternatives><mml:math>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$P({r_{i}})$]]></tex-math></alternatives></inline-formula>. Each regular expression <italic>i</italic> has its own prior probability which can be interpreted as how probable it is that a matched string in a book must be deleted. As a simple example we can look at regular expression <monospace>^(\d){1,3}$</monospace>. It matches a line that has up to three digit numbers. Usually, such line will be a page number in a book, but occasionally it can be something else that we want to keep. Thus, initial probability that we want to delete content matched by this regular expression was set to 0.95. Coefficient <inline-formula id="j_info1200_ineq_009"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\lambda _{i}}$]]></tex-math></alternatives></inline-formula> for feature function that represents this regular expression was set to 0.05.</p>
<p>2. Indicator functions that mark if there are tokens that periodically appear in the text within the window of fixed width.</p>
<p>3. Line length. Indicator function that matches lines with a length shorter than twice compared with the average line length in the book.</p>
<p>4. Increase in the alignment probability <xref rid="j_info1200_eq_003">3</xref> for the alignment label ‘good’ when we remove a matched text segment.</p>
<p>Finally, we present an algorithm that uses expression (<xref rid="j_info1200_eq_004">4</xref>) and is capable of filtering out strings that can be interpreted as page formatting strings for bilingual book corpus alignment procedure. The input for the algorithm is a pair of books. The output is a set of strings that algorithm recommends to remove before the final alignment stage.</p>
<fig id="j_info1200_fig_004">
<label>Algorithm 1</label>
<caption>
<p>Filtering of noisy text segments</p>
</caption>
<graphic xlink:href="info1200_g004.jpg"/>
</fig>
<p>The first loop of the algorithm iterates through each possible text segment matched by regular expressions and puts these segments into the set <italic>U</italic>. The second loop takes any two segments of the set <italic>U</italic> and creates a new string that is the product of intersection between these two segments. If this intersection is not empty then the new segment is put into the set <italic>U</italic>. After these two loops, in the set <italic>U</italic> we have all hypothesis for text filtering.</p>
<p>The last loop iterated through all the text segments <italic>s</italic> in the set <italic>U</italic> and calculates:</p>
<list>
<list-item id="j_info1200_li_009">
<label>•</label>
<p>Probabilities (<xref rid="j_info1200_eq_003">3</xref>) of ‘good’ alignments with the text segment <italic>s</italic> removed and probability when it is not removed.</p>
</list-item>
<list-item id="j_info1200_li_010">
<label>•</label>
<p>These probabilities are used as feature functions in assigning labels to segment <italic>s</italic> using (<xref rid="j_info1200_eq_004">4</xref>).</p>
</list-item>
</list>
</sec>
<sec id="j_info1200_s_006">
<label>6</label>
<title>Proactive Learning</title>
<p>In the previous section we presented the algorithms for filtering and automatic alignment of translated books’ corpora. There is always a possibility that filtering and alignment processes will not achieve the required precision due to errors that appear during the translation and transformation of the original book. In order to correct these discrepancies we need to fully understand the world that is presented by the book author. Currently this full understanding of a text is beyond computer capabilities. Thus in our framework the alignment program can ask the reader for help by presenting several types of queries:</p>
<list>
<list-item id="j_info1200_li_011">
<label>1.</label>
<p>Alignment anchor point. The program can ask the reader to point two positions in a text: one in <italic>e</italic> and one in <italic>l</italic> that can be treated as an alignment.</p>
</list-item>
<list-item id="j_info1200_li_012">
<label>2.</label>
<p>Confirm filter decision to delete a sequence of text segments that have been matched by regular expressions.</p>
</list-item>
<list-item id="j_info1200_li_013">
<label>3.</label>
<p>Confirm filter decision to delete sentences from a book because they do not have translation equivalents.</p>
</list-item>
</list>
<p>The starting point for all these questions is how to formulate a comprehensive metric for query selection. One of the most common general methods for measuring informativeness of a query is information entropy. For a discrete random variable <italic>X</italic>, the information entropy is defined as: <inline-formula id="j_info1200_ineq_010"><alternatives><mml:math>
<mml:mi mathvariant="italic">H</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo movablelimits="false">log</mml:mo>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$H(X)=-{\textstyle\sum _{i}}P({x_{i}})\log P({x_{i}})$]]></tex-math></alternatives></inline-formula>.</p>
<p>We can use the entropy as an alignment informativeness <inline-formula id="j_info1200_ineq_011"><alternatives><mml:math>
<mml:mi mathvariant="italic">ϕ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\phi (a)$]]></tex-math></alternatives></inline-formula> as follows. Let <inline-formula id="j_info1200_ineq_012"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">˜</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\tilde{a}$]]></tex-math></alternatives></inline-formula> be the most informative alignment in the pool of all alignment hypothesis that we receive after we apply standard alignment procedure. We chose some alignment query strategy <inline-formula id="j_info1200_ineq_013"><alternatives><mml:math>
<mml:mi mathvariant="italic">ϕ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\phi (a)$]]></tex-math></alternatives></inline-formula>, which is a function used to evaluate each alignment <italic>a</italic> in the alignment hypothesis space <italic>A</italic>. 
<disp-formula id="j_info1200_eq_005">
<label>(5)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">ϕ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>−</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo movablelimits="false">log</mml:mo>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \phi (e,l)=-\sum \limits_{\bar{a}}P(\bar{a}|e,l)\log P(\bar{a}|e,l),\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_info1200_ineq_014"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\bar{a}$]]></tex-math></alternatives></inline-formula> ranges over all possible alignments between sentences in a book from a bilingual corpus.</p>
<p>Then, the most informative alignment <inline-formula id="j_info1200_ineq_015"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">˜</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\tilde{a}$]]></tex-math></alternatives></inline-formula> will be expressed as: 
<disp-formula id="j_info1200_eq_006">
<label>(6)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">˜</mml:mo></mml:mover>
<mml:mo>=</mml:mo>
<mml:mo movablelimits="false">arg</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo movablelimits="false">max</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mi mathvariant="italic">ϕ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \tilde{a}=\arg \underset{a\in A}{\max }\phi (e,l).\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>But for the alignment of books we must modify the query selection metric, which is based on entropy, in order to get better decision making process. Thus, to understand the problem with entropy as the metric of informativeness, we can imagine that we have a completely misaligned chapter between two books. It will be likely that maximum entropy of alignments will be somewhere in the middle of the chapter where we will have the highest uncertainty about alignments. Nevertheless, for the reader it will be difficult to point the ‘good’ alignment because that will certainly require for him to read through all the chapter in two languages. Then, we need to select queries that would be not just mostly informative but also convenient to for the reader.</p>
<p>There are few suggestions about possible modification in Settles and Craven (<xref ref-type="bibr" rid="j_info1200_ref_015">2008</xref>). In their study they have shown that in many situations one of the best metrics is information density metric. 
<disp-formula id="j_info1200_eq_007">
<label>(7)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">ϕ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">ϕ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>×</mml:mo>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">(</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">U</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi mathvariant="italic">sim</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {\phi ^{ID}}(e,l)=\phi (e,l)\times \bigg(\frac{1}{U}{\sum \limits_{u=1}^{U}}\mathit{sim}{\big(el,e{l^{(u)}}\big)^{\beta }}\bigg).\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Here the set <italic>U</italic> is a set of ‘good’ alignments in one page (excluding alignment <inline-formula id="j_info1200_ineq_016"><alternatives><mml:math>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi></mml:math><tex-math><![CDATA[$el$]]></tex-math></alternatives></inline-formula>) where the page size is determined by user interface software and <inline-formula id="j_info1200_ineq_017"><alternatives><mml:math>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$e{l^{(u)}}$]]></tex-math></alternatives></inline-formula> is some alignment from <italic>U</italic>. From Eq. (<xref rid="j_info1200_eq_007">7</xref>) we see that the informativeness of alignment between <italic>e</italic> and <italic>l</italic> is weighted by its average similarity to all other ‘good’ alignments in <italic>U</italic>. Parameter <italic>β</italic> controls the relative importance of the density term. Similarity function <inline-formula id="j_info1200_ineq_018"><alternatives><mml:math>
<mml:mi mathvariant="italic">sim</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathit{sim}()$]]></tex-math></alternatives></inline-formula> is defined as cosine distance between two vectors: 
<disp-formula id="j_info1200_eq_008">
<label>(8)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">sim</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">→</mml:mo></mml:mover>
<mml:mo>·</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">→</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">‖</mml:mo><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">→</mml:mo></mml:mover>
<mml:mo stretchy="false">‖</mml:mo>
<mml:mo>×</mml:mo>
<mml:mo stretchy="false">‖</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">→</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo stretchy="false">‖</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathit{sim}(el,e{l^{(u)}})=\frac{\vec{el}\cdot {\vec{el}^{(u)}}}{\| \vec{el}\| \times \| {\vec{el}^{(u)}}\| }.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Vector <inline-formula id="j_info1200_ineq_019"><alternatives><mml:math>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi></mml:math><tex-math><![CDATA[$el$]]></tex-math></alternatives></inline-formula> is defined from feature functions as a kernel vector: 
<disp-formula id="j_info1200_eq_009">
<label>(9)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">→</mml:mo></mml:mover>
<mml:mo>=</mml:mo>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true">[</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true">]</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \vec{el}=\bigg[{\sum \limits_{t=1}^{T}}{f_{1}}(e,l,{a_{t}},{r_{t}}),\dots ,{\sum \limits_{t=1}^{T}}{f_{J}}(e,l,{a_{t}},{r_{t}})\bigg],\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_info1200_ineq_020"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${f_{j}}(e,l,{a_{t}},{r_{t}})$]]></tex-math></alternatives></inline-formula> is the value of feature <inline-formula id="j_info1200_ineq_021"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${f_{j}}$]]></tex-math></alternatives></inline-formula> for alignment <inline-formula id="j_info1200_ineq_022"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${a_{t}}$]]></tex-math></alternatives></inline-formula>. <italic>e</italic> and <italic>l</italic> are a set of sentences in one page, which is presented in reader interface. Index <italic>t</italic> runs through all ‘good’ alignments in one page which we want to present for the reader and ask for some anchor point in alignment set or to confirm filter actions. <italic>T</italic> is the number of ‘good’ alignments in one page. The term <inline-formula id="j_info1200_ineq_023"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${r_{t}}$]]></tex-math></alternatives></inline-formula> in Eq. (<xref rid="j_info1200_eq_009">9</xref>) is a set of labels from the model that we presented in the section about CRF for sentence alignment.</p>
<p>Functions <inline-formula id="j_info1200_ineq_024"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${f_{j}}$]]></tex-math></alternatives></inline-formula> in Eq. (<xref rid="j_info1200_eq_009">9</xref>) can be similar to the features in CRF models that we used in sections above. What we found in this research was that it is possible to simplify this set of functions and have a measure of ‘good’ alignments density and to model some aspects of proactive learning. Then, in this research we chose the feature functions as follows: 
<list>
<list-item id="j_info1200_li_014">
<label>1.</label>
<p>Number of named entities that can be matched.</p>
</list-item>
<list-item id="j_info1200_li_015">
<label>2.</label>
<p>Probability of the phrase.</p>
</list-item>
<list-item id="j_info1200_li_016">
<label>3.</label>
<p>Number of matched infrequent words.</p>
</list-item>
<list-item id="j_info1200_li_017">
<label>4.</label>
<p>Number of question and exclamation marks.</p>
</list-item>
<list-item id="j_info1200_li_018">
<label>5.</label>
<p>Number of short utterance passages.</p>
</list-item>
</list>
</p>
</sec>
<sec id="j_info1200_s_007">
<label>7</label>
<title>Evaluation</title>
<p>We begin our evaluation of this framework for books’ corpus alignment by defining the alignment error rate (AER) (Och and Ney, <xref ref-type="bibr" rid="j_info1200_ref_013">2003</xref>). Originally, AER was defined on a word-to-word level and it requires a manually aligned set of ‘sure’ (used for measuring recall) and ‘possible’ (used for measuring precision) links (referred as <italic>S</italic> and <italic>P</italic>). We suggest to redefine AER on a sentence-to-sentence level by defining the sets S and P as follows: a link <inline-formula id="j_info1200_ineq_025"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">⊆</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi></mml:math><tex-math><![CDATA[${a_{i}}\subseteq S$]]></tex-math></alternatives></inline-formula> if it links the beginning of sentences and <inline-formula id="j_info1200_ineq_026"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">⊆</mml:mo>
<mml:mi mathvariant="italic">P</mml:mi></mml:math><tex-math><![CDATA[${a_{i}}\subseteq P$]]></tex-math></alternatives></inline-formula> if it links phrases in the middle of a sentence. 
<disp-formula id="j_info1200_eq_010">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">AER</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo>∩</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo>∩</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathit{AER}(A,P,S)=1-\frac{|P\cap A|+|S\cap A|}{|A|+|S|}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>We evaluate our framework using two other methods against which we compare quality of corpus alignment. The first (<italic>hunalign</italic>) is the method suggested by Varga <italic>et al.</italic> (<xref ref-type="bibr" rid="j_info1200_ref_019">2007</xref>). Originally it was used for medium density languages like Hungarian, Romanian, and Slovenian. We chose it because we think that Lithuanian language can be described as medium density language. As a second method for estimating alignment quality we have implemented a method from Sennrich and Volk (<xref ref-type="bibr" rid="j_info1200_ref_014">2010</xref>) (<italic>bleualign</italic>). Because this method requires an automatic translation system, we used Google Translate system to translate Lithuanian language into English.</p>
<p>Table <xref rid="j_info1200_tab_002">2</xref> shows the statistics of the corpus used to evaluate the method (we don’t use morphology analysis when counting vocabulary words).</p>
<table-wrap id="j_info1200_tab_002">
<label>Table 2</label>
<caption>
<p>Corpus statistics for alignment quality assessment.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Sentences</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Words</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Vocabulary</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1200_ineq_027"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|S|$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><inline-formula id="j_info1200_ineq_028"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$|P|$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td colspan="6" style="vertical-align: top; text-align: center">Fiction books</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">English</td>
<td style="vertical-align: top; text-align: left">1675466</td>
<td style="vertical-align: top; text-align: left">167914026</td>
<td style="vertical-align: top; text-align: left">115847</td>
<td style="vertical-align: top; text-align: left">1193836</td>
<td style="vertical-align: top; text-align: left">2034015</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Lithuanian</td>
<td style="vertical-align: top; text-align: left">1668577</td>
<td style="vertical-align: top; text-align: left">159958126</td>
<td style="vertical-align: top; text-align: left">307710</td>
<td style="vertical-align: top; text-align: left">1193836</td>
<td style="vertical-align: top; text-align: left">2034015</td>
</tr>
<tr>
<td colspan="6" style="vertical-align: top; text-align: center">Non-fiction books</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">English</td>
<td style="vertical-align: top; text-align: left">78952</td>
<td style="vertical-align: top; text-align: left">7912190</td>
<td style="vertical-align: top; text-align: left">20157</td>
<td style="vertical-align: top; text-align: left">58061</td>
<td style="vertical-align: top; text-align: left">95610</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Lithuanian</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">78668</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">7537294</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">53919</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">58061</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">95610</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>There were two questions that we considered, namely, whether the suggested sentence alignment method is useful for improving alignment precision of translated books and how alignment quality depends on a number of queries that a user must answer.</p>
<table-wrap id="j_info1200_tab_003">
<label>Table 3</label>
<caption>
<p>Fiction and non-fiction books sentence-to-sentence alignment error rate for different methods.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Anchor p. Nr.</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">0</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">1</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">2</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">5</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">10</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">15</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">20</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">40</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Alignment method</td>
<td colspan="8" style="vertical-align: top; text-align: center">Fiction books</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"> <italic>hunalign</italic></td>
<td style="vertical-align: top; text-align: left">0.56</td>
<td style="vertical-align: top; text-align: left">0.51</td>
<td style="vertical-align: top; text-align: left">0.43</td>
<td style="vertical-align: top; text-align: left">0.31</td>
<td style="vertical-align: top; text-align: left">0.27</td>
<td style="vertical-align: top; text-align: left">0.22</td>
<td style="vertical-align: top; text-align: left">0.19</td>
<td style="vertical-align: top; text-align: left">0.09</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"> <italic>bleualign</italic></td>
<td style="vertical-align: top; text-align: left">0.55</td>
<td style="vertical-align: top; text-align: left">0.50</td>
<td style="vertical-align: top; text-align: left">0.31</td>
<td style="vertical-align: top; text-align: left">0.30</td>
<td style="vertical-align: top; text-align: left">0.26</td>
<td style="vertical-align: top; text-align: left">0.21</td>
<td style="vertical-align: top; text-align: left">0.19</td>
<td style="vertical-align: top; text-align: left">0.09</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"> <italic>bookalign</italic></td>
<td style="vertical-align: top; text-align: left">0.48</td>
<td style="vertical-align: top; text-align: left">0.43</td>
<td style="vertical-align: top; text-align: left">0.37</td>
<td style="vertical-align: top; text-align: left">0.28</td>
<td style="vertical-align: top; text-align: left">0.23</td>
<td style="vertical-align: top; text-align: left">0.19</td>
<td style="vertical-align: top; text-align: left">0.15</td>
<td style="vertical-align: top; text-align: left">0.05</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Alignment method</td>
<td colspan="8" style="vertical-align: top; text-align: center">Non-fiction books</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"> <italic>hunalign</italic></td>
<td style="vertical-align: top; text-align: left">0.49</td>
<td style="vertical-align: top; text-align: left">0.44</td>
<td style="vertical-align: top; text-align: left">0.40</td>
<td style="vertical-align: top; text-align: left">0.27</td>
<td style="vertical-align: top; text-align: left">0.23</td>
<td style="vertical-align: top; text-align: left">0.20</td>
<td style="vertical-align: top; text-align: left">0.18</td>
<td style="vertical-align: top; text-align: left">0.08</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"> <italic>bleualign</italic></td>
<td style="vertical-align: top; text-align: left">0.46</td>
<td style="vertical-align: top; text-align: left">0.42</td>
<td style="vertical-align: top; text-align: left">0.40</td>
<td style="vertical-align: top; text-align: left">0.25</td>
<td style="vertical-align: top; text-align: left">0.23</td>
<td style="vertical-align: top; text-align: left">0.20</td>
<td style="vertical-align: top; text-align: left">0.17</td>
<td style="vertical-align: top; text-align: left">0.08</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"> <italic>bookalign</italic></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.42</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.38</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.35</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.21</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.18</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.15</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.11</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.04</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In order to answer these questions, we conducted the following experiment. We created the bilingual corpus of 200 books. Then we used all three methods (<italic>hunalign</italic>, <italic>bleualign</italic>, <italic>bookalign</italic>) to align this corpus. The error rate that we received after this step is shown in the first column of Table <xref rid="j_info1200_tab_003">3</xref>. Clearly, all three alignment methods scored bad. For example, error rate of <italic>hunalign</italic> method means that, on average, 56% alignments were erroneous.</p>
<p>After the first alignment iteration the system generated a query using the proactive learning approach described in this paper. Once the query has been answered, a new alignment iteration started. The second column in Table <xref rid="j_info1200_tab_003">3</xref> shows error rates obtained after this step. We continued this alignment loop until we answered 40 queries.</p>
<p>It is clear from examining the results in Table <xref rid="j_info1200_tab_003">3</xref> that all three methods improved performance after each answered query. What is particularly interesting, however, is that number of queries required to align books in order to get an error rate below 0.1 can differ significantly for each book. Nevertheless, the number of 40 queries appeared as a limit, after which all books can be aligned with acceptable quality.</p>
</sec>
<sec id="j_info1200_s_008">
<label>8</label>
<title>Conclusions</title>
<p>We have suggested a model for the alignment of the bilingual English–Lithuanian books’ corpus. In this research project we found that some translations appeared to be particularly difficult to align due to missing sentences or even paragraphs from the translated text. The important contribution of this study is that alignment accuracy was increased after we applied a new text filtering algorithm. This new filtering algorithm was developed using methods from previous studies in statistical machine translation and we have shown that its accuracy improves as new books are added to the bilingual corpus.</p>
<p>There are several factors that have a profound impact on alignment accuracy and that are impossible to sort out by fully automatic alignment procedure. Therefore, a new research is required to incorporate natural language processing and human computer interaction tools into bilingual corpus alignment system. One way to address these challenges is to use bilingual readers and integrate them with proactive learning framework. We have shown that alignments generated by currently available algorithms give errors that can be eliminated if a small number of proactive learning queries are used to filter text and create manual alignments.</p>
<p>In this paper, we have presented a solution of proactive learning for the task of aligning bilingual books. The proposed system is able to improve the alignment precision. Empirical results show that our method allows the alignment system to learn from each interaction with a human reader. We introduced several scenarios where users can choose between several options: confirm the suggestions from automatic alignment algorithm or enter alignments manually. Finally, it is worth to note that even if we tested the presented method on the English-Lithuanian language pair the ideas presented here can be used for other language pairs as well.</p>
</sec>
</body>
<back>
<ref-list id="j_info1200_reflist_001">
<title>References</title>
<ref id="j_info1200_ref_001">
<mixed-citation publication-type="journal"><string-name><surname>Barrachina</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Bender</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Casacuberta</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Civera</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Cubel</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Khadivi</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Lagarda</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Net</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Tomas</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Vidal</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Vilar</surname>, <given-names>J.M.</given-names></string-name> (<year>2009</year>). <article-title>Statistical approaches to computer-assisted translation</article-title>. <source>Computational Linguistics</source>, <volume>35</volume>(<issue>1</issue>), <fpage>3</fpage>–<lpage>28</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_002">
<mixed-citation publication-type="journal"><string-name><surname>Berger</surname>, <given-names>A.L.</given-names></string-name>, <string-name><surname>Della Pietra</surname>, <given-names>V.J.</given-names></string-name>, <string-name><surname>Della Pietra</surname>, <given-names>S.A.</given-names></string-name> (<year>1996</year>). <article-title>A maximum entropy approach to natural language processing</article-title>. <source>Computational Linguistics</source>, <volume>22</volume>(<issue>1</issue>), <fpage>39</fpage>–<lpage>72</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_003">
<mixed-citation publication-type="chapter"><string-name><surname>Braune</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Fraser</surname>, <given-names>A.</given-names></string-name> (<year>2010</year>). <chapter-title>Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora</chapter-title>. In: <source>Proceedings of the 23rd International Conference on Computational Linguistics: Posters</source>, pp. <fpage>81</fpage>–<lpage>89</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_004">
<mixed-citation publication-type="chapter"><string-name><surname>Brown</surname>, <given-names>P.F.</given-names></string-name>, <string-name><surname>Lai</surname>, <given-names>J.C.</given-names></string-name>, <string-name><surname>Mercer</surname>, <given-names>R.L.</given-names></string-name> (<year>1991</year>). <chapter-title>Aligning sentences in parallel corpora</chapter-title>. In: <source>Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics</source>, pp. <fpage>169</fpage>–<lpage>176</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_005">
<mixed-citation publication-type="journal"><string-name><surname>Brown</surname>, <given-names>P.F.</given-names></string-name>, <string-name><surname>Della Pietra</surname>, <given-names>V.J.</given-names></string-name>, <string-name><surname>Della Pietra</surname>, <given-names>S.A.</given-names></string-name>, <string-name><surname>Mercer</surname>, <given-names>R.L.</given-names></string-name> (<year>1993</year>). <article-title>The mathematics of statistical machine translation: parameter estimation</article-title>. <source>Computational Linguistics</source>, <volume>19</volume>(<issue>2</issue>), <fpage>263</fpage>–<lpage>311</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_006">
<mixed-citation publication-type="chapter"><string-name><surname>Chen</surname>, <given-names>S.F.</given-names></string-name> (<year>1993</year>). <chapter-title>Aligning sentences in bilingual corpora using lexical information</chapter-title>. In: <source>Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics</source>, pp. <fpage>9</fpage>–<lpage>16</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_007">
<mixed-citation publication-type="journal"><string-name><surname>Gale</surname>, <given-names>W.A.</given-names></string-name>, <string-name><surname>Church</surname>, <given-names>K.W.</given-names></string-name> (<year>1993</year>). <article-title>A program for aligning sentences in bilingual corpora</article-title>. <source>Computational Linguistics</source>, <volume>19</volume>(<issue>1</issue>), <fpage>75</fpage>–<lpage>102</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_008">
<mixed-citation publication-type="chapter"><string-name><surname>Lafferty</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>McCallum</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Pereira</surname>, <given-names>F.</given-names></string-name> (<year>2001</year>). <chapter-title>Conditional random fields: probabilistic models for segmenting and labeling sequence data</chapter-title>. In: <source>Proceedings of the Eighteenth International Conference on Machine Learning, ICML</source>, pp. <fpage>282</fpage>–<lpage>289</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_009">
<mixed-citation publication-type="journal"><string-name><surname>Laukaitis</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Vasilecas</surname>, <given-names>O.</given-names></string-name>, <string-name><surname>Laukaitis</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Plikynas</surname>, <given-names>D.</given-names></string-name> (<year>2011</year>). <article-title>Semi-automatic bilingual corpus creation with zero entropy alignments</article-title>. <source>Informatica</source>, <volume>22</volume>(<issue>2</issue>), <fpage>203</fpage>–<lpage>224</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_010">
<mixed-citation publication-type="journal"><string-name><surname>Laukaitis</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Vasilecas</surname>, <given-names>O.</given-names></string-name> (<year>2008</year>). <article-title>Multi-alignment templates induction</article-title>. <source>Informatica</source>, <volume>19</volume>(<issue>4</issue>), <fpage>535</fpage>–<lpage>554</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_011">
<mixed-citation publication-type="chapter"><string-name><surname>McCallum</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Nigam</surname>, <given-names>K.</given-names></string-name> (<year>1998</year>). <chapter-title>Employing EM and pool-based active learning for text classification</chapter-title>. In: <source>ICML</source>, Vol. <volume>98</volume>, pp. <fpage>359</fpage>–<lpage>367</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_012">
<mixed-citation publication-type="chapter"><string-name><surname>Moore</surname>, <given-names>R.C.</given-names></string-name> (<year>2002</year>). <chapter-title>Fast and accurate sentence alignment of bilingual corpora</chapter-title>. In: <source>Proceedings of the 5th Conference of the Association for Machine Translation in the Americas, LNAI</source>, Vol. <volume>2499</volume>, pp. <fpage>135</fpage>–<lpage>144</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_013">
<mixed-citation publication-type="journal"><string-name><surname>Och</surname>, <given-names>F.J.</given-names></string-name>, <string-name><surname>Ney</surname>, <given-names>H.</given-names></string-name> (<year>2003</year>). <article-title>A systematic comparison of various statistical alignment models</article-title>. <source>Computational Linguistics</source>, <volume>29</volume>(<issue>1</issue>), <fpage>19</fpage>–<lpage>51</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_014">
<mixed-citation publication-type="chapter"><string-name><surname>Sennrich</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Volk</surname>, <given-names>M.</given-names></string-name> (<year>2010</year>). <chapter-title>MT-based sentence alignment for OCR-generated parallel texts</chapter-title>. In: <source>The Ninth Conference of the Association for Machine Translation in the Americas</source>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_015">
<mixed-citation publication-type="chapter"><string-name><surname>Settles</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Craven</surname>, <given-names>M.</given-names></string-name> (<year>2008</year>). <chapter-title>An analysis of active learning strategies for sequence labeling tasks</chapter-title>. In: <source>Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>, pp. <fpage>1070</fpage>–<lpage>1079</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_016">
<mixed-citation publication-type="other"><string-name><surname>Sutton</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>McCallum</surname>, <given-names>A.</given-names></string-name> (2006). An introduction to conditional random fields for relational learning. <italic>Introduction to Statistical Relational Learning</italic>, 93–128.</mixed-citation>
</ref>
<ref id="j_info1200_ref_017">
<mixed-citation publication-type="chapter"><string-name><surname>Thompson</surname>, <given-names>C.A.</given-names></string-name>, <string-name><surname>Califf</surname>, <given-names>M.E.</given-names></string-name>, <string-name><surname>Mooney</surname>, <given-names>R.J.</given-names></string-name> (<year>1999</year>). <chapter-title>Active learning for natural language parsing and information extraction</chapter-title>. In: <source>ICML</source>, pp. <fpage>406</fpage>–<lpage>414</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_018">
<mixed-citation publication-type="journal"><string-name><surname>Tong</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Koller</surname>, <given-names>D.</given-names></string-name> (<year>2001</year>). <article-title>Support vector machine active learning with applications to text classification</article-title>. <source>Journal of Machine Learning Research</source>, <volume>2</volume>, <fpage>45</fpage>–<lpage>66</lpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_019">
<mixed-citation publication-type="journal"><string-name><surname>Varga</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Halacsy</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Kornai</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Nagy</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Nemeth</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Tron</surname>, <given-names>V.</given-names></string-name> (<year>2007</year>). <article-title>Parallel corpora for medium density languages</article-title>. <source>Amsterdam Studies in the Theory and History of Linguistic Science</source>, <volume>4</volume>(<issue>292</issue>), <fpage>247</fpage>.</mixed-citation>
</ref>
<ref id="j_info1200_ref_020">
<mixed-citation publication-type="other"><string-name><surname>Xu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Max</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Yvon</surname>, <given-names>F.</given-names></string-name> (2015). Sentence alignment for literary texts. <italic>LiLT (Linguistic Issues in Language Technology)</italic>, <italic>12</italic>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>