Sentence Level Alignment of Digitized Books Parallel Corpora

Laukaitis, Algirdas; Plikynas, Darius; Ostasius, Egidijus

doi:10.15388/Informatica.2018.188

Home
Issues
Volume 29, Issue 4 (2018)
Sentence Level Alignment of Digitized Bo ...

Informatica

Information Submit your article For Referees Help ATTENTION!

Article info
Full article
Related articles
Cited by
More
Article info Full article Related articles Cited by

Export citation

Copy and paste formatted citation

Formatted citation

Placeholder

Citation style

Download citation in file

Export format

Authors

Placeholder

Similar articles

Semi-Automatic Bilingual Corpus Creation with Zero Entropy Alignments

Algirdas Laukaitis Olegas Vasilecas Ricardas Laukaitis Darius Plikynas

https://doi.org/10.15388/Informatica.2011.323

Pub. online: 1 Jan 2011 Type: Research Article

Journal: Informatica Volume 22, Issue 2 (2011), pp. 203–224

Abstract

In this paper, we describe a model for aligning books and documents from bilingual corpus with a goal to create “perfectly” aligned bilingual corpus on word-to-word level. Presented algorithms differ from existing algorithms in consideration of the presence of human translator which usage we are trying to minimize. We treat human translator as an oracle who knows exact alignments and the goal of the system is to optimize (minimize) the use of this oracle. The effectiveness of the oracle is measured by the speed at which he can create “perfectly” aligned bilingual corpus. By “Perfectly” aligned corpus we mean zero entropy corpus because oracle can make alignments without any probabilistic interpretation, i.e., with 100% confidence. Sentence level alignments and word-to-word alignments, although treated separately in this paper, are integrated in a single framework. For sentence level alignments we provide a dynamic programming algorithm which achieves low precision and recall error rate. For word-to-word level alignments Expectation Maximization algorithm that integrates linguistic dictionaries is suggested as the main tool for the oracle to build “perfectly” aligned bilingual corpus. We show empirically that suggested pre-aligned corpus requires little interaction from the oracle and that creation of perfectly aligned corpus can be achieved almost with the speed of human reading. Presented algorithms are language independent but in this paper we verify them with English–Lithuanian language pair on two types of text: law documents and fiction literature.

Multi-Alignment Templates Induction

Algirdas Laukaitis Olegas Vasilecas

https://doi.org/10.15388/Informatica.2008.229

Pub. online: 1 Jan 2008 Type: Research Article

Journal: Informatica Volume 19, Issue 4 (2008), pp. 535–554

Abstract

This paper examins approaches for translation between English and morphology-rich languages. Experiment with English–Russian and English–Lithuanian revels that “pure” statistical approaches on 10 million word corpus gives unsatisfactory translation. Then, several Web-available linguistic resources are suggested for translation. Syntax parsers, bilingual and semantic dictionaries, bilingual parallel corpus and monolingualWeb-based corpus are integrated in one comprehensive statistical model. Multi-abstraction language representation is used for statistical induction of syntactic and semantic transformation rules called multi-alignment templates. The decodingmodel is described using the feature functions, a log-linear modeling approach and A* search algorithm. An evaluation of this approach is performed on the English–Lithuanian language pair. Presented experimental results demonstrates that the multi-abstraction approach and hybridization of learning methods can improve quality of translation.

Graph Representation of the Syntactic Structure of the Lithuanian Sentence

Daiva Šveikauskienė

https://doi.org/10.15388/Informatica.2005.107

Pub. online: 1 Jan 2005 Type: Research Article

Journal: Informatica Volume 16, Issue 3 (2005), pp. 407–418

Abstract

The paper offers a new way of presenting the structure of a sentence. None of the two widely known methods of representation the syntactic structure of a sentence can be of any avail when applied to the Lithuanian language. Neither the tree, based on the phrase structure principle, nor the tree, suggested by the dependency grammar, do reflect all the syntactic information, which a Lithuanian sentence contains.

The paper points out the differences between the Lithuanian language and other languages as well as presents the reasons why a Lithuanian sentence should be represented by a graph.

The paper presents a generalized structure of a simple sentence in the Lithuanian language, namely, such a structure, which would embrace all the possible instances of a Lithuanian simple sentence. Every sentence of the text would have to activate only one path in the generalized structure.

The Implementation of the Example-Based Machine Translation Technique for German-to-Polish Automatic Translation System

Mirosław Gajer

https://doi.org/10.3233/INF-2002-13404

Pub. online: 1 Jan 2002 Type: Research Article

Journal: Informatica Volume 13, Issue 4 (2002), pp. 417–440

Abstract

High-quality machine translation between human languages has for a long time been an unattainable dream for many computer scientists involved in this fascinating and interdisciplinary field of the application of computers. The developed quite recently example-based machine translation technique seems to be a serious alternative to the existing automatic translation techniques. In the paper the usage of the example based machine translation technique for the development of system, which would be able to translate an unrestricted German text into Polish is proposed. The new approach to the example-based machine translation technique that takes into account the peculiarity of the Polish grammar is developed. The obtained primary results of the development of proposed system seem to be very promising and appear to be a step made in the right direction towards a fully-automatic high quality German-into-Polish machine translation system for unrestricted text.

Language Egineering in Lithuania

Joana Lipeikienė Antanas Lipeika

https://doi.org/10.3233/INF-1998-9405

Pub. online: 1 Jan 1998 Type: Research Article

Journal: Informatica Volume 9, Issue 4 (1998), pp. 449–456

Abstract

Language engineering encompassing natural language processing and speech processing became very important for a development of every nation in multilingual Europe. After the Council of European Union approved conclucions on linguistic and cultural diversity, tools and systems created for every European language are necessary to overcome language barriers and to use all languages in various spheres of human cooperation. The paper gives an overview and a consideration of language engineering in Lithuania.

RSS

INFORMATICA

Online ISSN: 1822-8844
Print ISSN: 0868-4952

About

About journal

For contributors

OA Policy
Submit your article
Instructions for Referees

Contact us

Institute of Data Science and Digital Technologies
Vilnius University

Akademijos St. 4

08412 Vilnius, Lithuania

Phone: (+370 5) 2109 338

E-mail: informatica@mii.vu.lt
https://informatica.vu.lt/journal/INFORMATICA