Informatica logo


Login Register

  1. Home
  2. Issues
  3. Volume 29, Issue 4 (2018)
  4. Sentence Level Alignment of Digitized Bo ...

Informatica

Information Submit your article For Referees Help ATTENTION!
  • Article info
  • Full article
  • Related articles
  • Cited by
  • More
    Article info Full article Related articles Cited by

Sentence Level Alignment of Digitized Books Parallel Corpora
Volume 29, Issue 4 (2018), pp. 693–710
Algirdas Laukaitis   Darius Plikynas   Egidijus Ostasius  

Authors

 
Placeholder
https://doi.org/10.15388/Informatica.2018.188
Pub. online: 1 January 2018      Type: Research Article      Open accessOpen Access

Received
1 September 2017
Accepted
1 September 2018
Published
1 January 2018

Abstract

In this paper, we propose a framework for extracting translation memory from a corpus of fiction and non-fiction books. In recent years, there have been several proposals to align bilingual corpus and extract translation memory from legal and technical documents. Yet, when it comes to an alignment of the corpus of translated fiction and non-fiction books, the existing alignment algorithms give low precision results. In order to solve this low precision problem, we propose a new method that incorporates existing alignment algorithms with proactive learning approach. We define several feature functions that are used to build two classifiers for text filtering and alignment. We report results on English-Lithuanian language pair and on bilingual corpus from 200 books. We demonstrate a significant improvement in alignment accuracy over currently available alignment systems.

References

 
Barrachina, S., Bender, O., Casacuberta, F., Civera, J., Cubel, E., Khadivi, S., Lagarda, A., Net, H., Tomas, J., Vidal, E., Vilar, J.M. (2009). Statistical approaches to computer-assisted translation. Computational Linguistics, 35(1), 3–28.
 
Berger, A.L., Della Pietra, V.J., Della Pietra, S.A. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–72.
 
Braune, F., Fraser, A. (2010). Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 81–89.
 
Brown, P.F., Lai, J.C., Mercer, R.L. (1991). Aligning sentences in parallel corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, pp. 169–176.
 
Brown, P.F., Della Pietra, V.J., Della Pietra, S.A., Mercer, R.L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263–311.
 
Chen, S.F. (1993). Aligning sentences in bilingual corpora using lexical information. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 9–16.
 
Gale, W.A., Church, K.W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.
 
Lafferty, J., McCallum, A., Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML, pp. 282–289.
 
Laukaitis, A., Vasilecas, O., Laukaitis, R., Plikynas, D. (2011). Semi-automatic bilingual corpus creation with zero entropy alignments. Informatica, 22(2), 203–224.
 
Laukaitis, A., Vasilecas, O. (2008). Multi-alignment templates induction. Informatica, 19(4), 535–554.
 
McCallum, A., Nigam, K. (1998). Employing EM and pool-based active learning for text classification. In: ICML, Vol. 98, pp. 359–367.
 
Moore, R.C. (2002). Fast and accurate sentence alignment of bilingual corpora. In: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas, LNAI, Vol. 2499, pp. 135–144.
 
Och, F.J., Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
 
Sennrich, R., Volk, M. (2010). MT-based sentence alignment for OCR-generated parallel texts. In: The Ninth Conference of the Association for Machine Translation in the Americas.
 
Settles, B., Craven, M. (2008). An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1070–1079.
 
Sutton, C., McCallum, A. (2006). An introduction to conditional random fields for relational learning. Introduction to Statistical Relational Learning, 93–128.
 
Thompson, C.A., Califf, M.E., Mooney, R.J. (1999). Active learning for natural language parsing and information extraction. In: ICML, pp. 406–414.
 
Tong, S., Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2, 45–66.
 
Varga, D., Halacsy, P., Kornai, A., Nagy, V., Nemeth, L., Tron, V. (2007). Parallel corpora for medium density languages. Amsterdam Studies in the Theory and History of Linguistic Science, 4(292), 247.
 
Xu, Y., Max, A., Yvon, F. (2015). Sentence alignment for literary texts. LiLT (Linguistic Issues in Language Technology), 12.

Biographies

Laukaitis Algirdas
algirdas.laukaitis@vgtu.lt

A. Laukaitis has graduated from Vilnius University Faculty of Physics. He received the PhD degree from the Institute of Mathematics and Informatics, Vilnius. He is a professor of the Information Systems Department of Vilnius Gediminas Technical University. His research interests include text mining, natural language interfaces, machine translation systems and knowledge management.

Plikynas Darius
darius.plikynas@mii.vu.lt

D. Plikynas is affiliated as professor, senior research fellow at the Institute of Data Science and Digital Technologies in Vilnius University. He is also affiliated as professor, chief research fellow at the Department of Business Technologies in Vilnius Gediminas Technical University. He has been involved in a number of EU and nationally financed research projects. He has published 2 monographs, 8 chapters in books, over 40 publications and over 50 conference papers. His main field of interest includes fundamental and applied research (modelling and simulation), covering interdisciplinary research domains in natural and social sciences, e.g. computational intelligence, agent based simulations, complexity research, social networks, distributed cognition.

Ostasius Egidijus
egidijus.ostasius@vgtu.lt

E. Ostasius is an associate professor of Vilnius Gediminas Technical University at the Faculty of Fundamental Sciences, Department of Information Technologies. He was awarded the candidate of mathematical sciences degree at Kaunas Polytechnic Institute in 1989, doctor of the mathematical sciences since 1993. His research interests include analysis of business processes and e-services, modelling, and evaluation in public and commercial sectors, their applications, and related issues.


Full article Related articles Cited by PDF XML
Full article Related articles Cited by PDF XML

Copyright
© 2018 Vilnius University
by logo by logo
Open access article under the CC BY license.

Keywords
alignment of corpora alignment of digitized books machine translation natural language processing

Metrics
since January 2020
1311

Article info
views

1088

Full article
views

632

PDF
downloads

224

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

INFORMATICA

  • Online ISSN: 1822-8844
  • Print ISSN: 0868-4952
  • Copyright © 2023 Vilnius University

About

  • About journal

For contributors

  • OA Policy
  • Submit your article
  • Instructions for Referees
    •  

    •  

Contact us

  • Institute of Data Science and Digital Technologies
  • Vilnius University

    Akademijos St. 4

    08412 Vilnius, Lithuania

    Phone: (+370 5) 2109 338

    E-mail: informatica@mii.vu.lt

    https://informatica.vu.lt/journal/INFORMATICA
Powered by PubliMill  •  Privacy policy