Building an English-Tagalog tourism corpus and lexicon for a statistical machine translation system

Statistical machine translation systems make use of an approach which relies on the extraction of a bilingual dictionary and translation rules from a large volume of bilingual data or training data and the selection of the most probable translation by statistically disambiguating structural ambiguit...

Full description

Saved in:
Bibliographic Details
Main Author: Ponay, Charmaine S.
Format: text
Language:English
Published: Animo Repository 2014
Online Access:https://animorepository.dlsu.edu.ph/etd_masteral/4772
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
Description
Summary:Statistical machine translation systems make use of an approach which relies on the extraction of a bilingual dictionary and translation rules from a large volume of bilingual data or training data and the selection of the most probable translation by statistically disambiguating structural ambiguity. The machine learns how to translate words by observing a large amount of examples and assuming constraints from them. The more translation examples there are, the more accurate the translation becomes. This study aimed to build a bilingual corpus of Philippine tourism data and a bilingual lexicon of named entities from the Philippine tourism domain. The output of this project is for the further enhancement of the Philippine component of the ASEAN-MT project, which is a statistical machine translation system. The corpus was built manually and manual translation was done on the retrieved data. The data were composed of documents from Philippine Tourism websites like itsmorefuninthephilippines.com, www.experiencephilippines.org, www.wowphilippines.ca and http://www.visitmyphilippines.com. Named-entities like peoples names, group names, company names, currency units, temporal entities, language names, locations, products, and artistic creations were manually annotated as specified by the guidelines set by the National Electronics and Computer Technology Centre (NECTEC) (Appendix A). NECTEC is the group which headed the ASEAN-MT project. Data were analysed and evaluated using a statistical machine translation system called MOSES. The corpus was tested according to categories Festivals and Events, Provincial Profile, Tourist Attractions and General Information where the category of Tourist Attraction got a BLEU score of 76.74. The corpus was also evaluated according to who did the manual translation and BLEU scores of 31.59, 31.87, 24.6 and 64.02 were computed based on the translations of translator1, translator2, translator3 and translator4 respectively. The corpus was further tested according to translator per category and a BLEU score of 76.57 and 69.69 for categories Provincial Profile and General Information under translator2 and 65.73 for translator translator4 under category Tourist Attractions. However, because of factors such as the number of function words, named-entities and numbers, as a whole, the BLEU score of the corpus was 34.42. The overall quality of the corpus based on the BLEU score was poor. However, since it got a significantly high BLEU score under the category of Tourist Attractions, the bilingual corpus of Tourist Attractions can contribute to the quality of translation of the ASEAN-MT project.