Bilingual sentence alignment based on sentence length and word translation

Sentence alignment plays an important role in machine translation.It is an essential task inprocessingparallel corporawhich are ample andsubstantial resourcesfor natural language processing. In order to apply these abundant materials into useful applications, parallel corporafirst have to be align...

Full description

Saved in:
Bibliographic Details
Main Author: Triệu, Hải Long
Other Authors: Nguyễn, Phương Thái
Language:English
Published: ĐHCN 2017
Online Access:http://repository.vnu.edu.vn/handle/VNU_123/43268
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Vietnam National University, Hanoi
Language: English
Description
Summary:Sentence alignment plays an important role in machine translation.It is an essential task inprocessingparallel corporawhich are ample andsubstantial resourcesfor natural language processing. In order to apply these abundant materials into useful applications, parallel corporafirst have to be aligned at the sentence level.This process maps sentences in textsof source language to their corresponding units in textsof target language. Parallel corporaaligned at sentence levelbecome a useful resource for a number of applications innatural language processing including Statistical Machine Translation, word disambiguation, cross language information retrieval. This task also helps to extract structural information and derive statistical parameters from bilingual corpora.There have beena number of algorithms proposed with different approachesfor sentence alignment. However, they may be classified into some major categories. First of all, there are methods based on the similarity of sentence lengths which can be measured by words or characters of sentences. Thesemethods are simple but effective to apply for language pairs that have a high similarity in sentence lengths. The secondset ofmethods isbased on word correspondences or lexicon. These methods take into account the lexical information about texts, whichisbased on matching content in texts orusescognates. An external dictionary may be used in these methods, so these methods are more accuratebut slower than the first ones. There are also methods based on the hybridsof these first two approachesthatcombine their advantages, so they obtain quite high quality of alignments.In this thesis, I summarizegeneral issues related to sentence alignment, and I evaluate approaches proposed for this task and focus on thehybridmethod, especially the proposalof Moore(2002), an effective method with high performance in term of precision. From analyzing the limits of this method, I propose an algorithm usinga new feature, bilingual word clustering,to improve the quality of Moore‟s method.The baseline method (Moore, 2002) will be introducedbased on analyzing of the framework, and I describe advantages as well as weaknesses of this approach.In addition to this, I describe the basis knowledge, algorithmof bilingual word clustering, and the new featureusedin sentence alignment.Finally, experiments performed in this research are illustrated as well as evaluations to prove benefits of the proposed method.