Bilingual sentence alignment based on sentence length and word translation

Sentence alignment plays an important role in machine translation.It is an essential task inprocessingparallel corporawhich are ample andsubstantial resourcesfor natural language processing. In order to apply these abundant materials into useful applications, parallel corporafirst have to be align...

Full description

Saved in:
Bibliographic Details
Main Author: Triệu, Hải Long
Other Authors: Nguyễn, Phương Thái
Language:English
Published: ĐHCN 2017
Online Access:http://repository.vnu.edu.vn/handle/VNU_123/43268
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Vietnam National University, Hanoi
Language: English
id oai:112.137.131.14:VNU_123-43268
record_format dspace
spelling oai:112.137.131.14:VNU_123-432682018-07-26T07:45:45Z Bilingual sentence alignment based on sentence length and word translation Triệu, Hải Long Nguyễn, Phương Thái Sentence alignment plays an important role in machine translation.It is an essential task inprocessingparallel corporawhich are ample andsubstantial resourcesfor natural language processing. In order to apply these abundant materials into useful applications, parallel corporafirst have to be aligned at the sentence level.This process maps sentences in textsof source language to their corresponding units in textsof target language. Parallel corporaaligned at sentence levelbecome a useful resource for a number of applications innatural language processing including Statistical Machine Translation, word disambiguation, cross language information retrieval. This task also helps to extract structural information and derive statistical parameters from bilingual corpora.There have beena number of algorithms proposed with different approachesfor sentence alignment. However, they may be classified into some major categories. First of all, there are methods based on the similarity of sentence lengths which can be measured by words or characters of sentences. Thesemethods are simple but effective to apply for language pairs that have a high similarity in sentence lengths. The secondset ofmethods isbased on word correspondences or lexicon. These methods take into account the lexical information about texts, whichisbased on matching content in texts orusescognates. An external dictionary may be used in these methods, so these methods are more accuratebut slower than the first ones. There are also methods based on the hybridsof these first two approachesthatcombine their advantages, so they obtain quite high quality of alignments.In this thesis, I summarizegeneral issues related to sentence alignment, and I evaluate approaches proposed for this task and focus on thehybridmethod, especially the proposalof Moore(2002), an effective method with high performance in term of precision. From analyzing the limits of this method, I propose an algorithm usinga new feature, bilingual word clustering,to improve the quality of Moore‟s method.The baseline method (Moore, 2002) will be introducedbased on analyzing of the framework, and I describe advantages as well as weaknesses of this approach.In addition to this, I describe the basis knowledge, algorithmof bilingual word clustering, and the new featureusedin sentence alignment.Finally, experiments performed in this research are illustrated as well as evaluations to prove benefits of the proposed method. 2017-05-17T08:20:22Z 2017-05-17T08:20:22Z 2014 Triệu, H. L. (2014). Bilingual sentence alignment based on sentence length and word translation. Master's thesis, Vietnam National University, Hanoi 00051000190 http://repository.vnu.edu.vn/handle/VNU_123/43268 en Luận văn Ngành Khoa học Máy tính (Full) 61 p. + CD-ROM + Tóm tắt application/pdf ĐHCN
institution Vietnam National University, Hanoi
building VNU Library & Information Center
country Vietnam
collection VNU Digital Repository
language English
description Sentence alignment plays an important role in machine translation.It is an essential task inprocessingparallel corporawhich are ample andsubstantial resourcesfor natural language processing. In order to apply these abundant materials into useful applications, parallel corporafirst have to be aligned at the sentence level.This process maps sentences in textsof source language to their corresponding units in textsof target language. Parallel corporaaligned at sentence levelbecome a useful resource for a number of applications innatural language processing including Statistical Machine Translation, word disambiguation, cross language information retrieval. This task also helps to extract structural information and derive statistical parameters from bilingual corpora.There have beena number of algorithms proposed with different approachesfor sentence alignment. However, they may be classified into some major categories. First of all, there are methods based on the similarity of sentence lengths which can be measured by words or characters of sentences. Thesemethods are simple but effective to apply for language pairs that have a high similarity in sentence lengths. The secondset ofmethods isbased on word correspondences or lexicon. These methods take into account the lexical information about texts, whichisbased on matching content in texts orusescognates. An external dictionary may be used in these methods, so these methods are more accuratebut slower than the first ones. There are also methods based on the hybridsof these first two approachesthatcombine their advantages, so they obtain quite high quality of alignments.In this thesis, I summarizegeneral issues related to sentence alignment, and I evaluate approaches proposed for this task and focus on thehybridmethod, especially the proposalof Moore(2002), an effective method with high performance in term of precision. From analyzing the limits of this method, I propose an algorithm usinga new feature, bilingual word clustering,to improve the quality of Moore‟s method.The baseline method (Moore, 2002) will be introducedbased on analyzing of the framework, and I describe advantages as well as weaknesses of this approach.In addition to this, I describe the basis knowledge, algorithmof bilingual word clustering, and the new featureusedin sentence alignment.Finally, experiments performed in this research are illustrated as well as evaluations to prove benefits of the proposed method.
author2 Nguyễn, Phương Thái
author_facet Nguyễn, Phương Thái
Triệu, Hải Long
author Triệu, Hải Long
spellingShingle Triệu, Hải Long
Bilingual sentence alignment based on sentence length and word translation
author_sort Triệu, Hải Long
title Bilingual sentence alignment based on sentence length and word translation
title_short Bilingual sentence alignment based on sentence length and word translation
title_full Bilingual sentence alignment based on sentence length and word translation
title_fullStr Bilingual sentence alignment based on sentence length and word translation
title_full_unstemmed Bilingual sentence alignment based on sentence length and word translation
title_sort bilingual sentence alignment based on sentence length and word translation
publisher ĐHCN
publishDate 2017
url http://repository.vnu.edu.vn/handle/VNU_123/43268
_version_ 1680966653602430976