A bi-directional example-based English-Tagalog machine translation system

A bi-directional English-Tagalog machine translation system named Halo is created based on the example-based machine translation (EBMT) approach, wherein the translation is based primarily on knowledge obtained from analysis of parallel corpora. The system focused on the creation of a knowledge base...

Full description

Saved in:
Bibliographic Details
Main Author: Tolentino, Rufino C.
Format: text
Language:English
Published: Animo Repository 2006
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etd_masteral/3401
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
Description
Summary:A bi-directional English-Tagalog machine translation system named Halo is created based on the example-based machine translation (EBMT) approach, wherein the translation is based primarily on knowledge obtained from analysis of parallel corpora. The system focused on the creation of a knowledge base for translation, requiring no linguistic knowledge prior to and during translation. Halo is composed of two major phases, the knowledge extraction phase and the translation phase. From parallel corpora, databases of sentence pair examples are extracted. All the words that occurred in the stored sentence pairs are indexed with information on its frequency and position. A database structure for this purpose using the relational database concept was also developed. The Dice Coefficient formula is used to establish a relationship between words from two languages. The calculation is utilized to approximate the most probable translation of the words in the two languages. Algorithms on the following processes were developed: build-up of the correlation table (dictionary), input text segmentation, translation of the segments and the recombination of the translated segments to form the final translation for the whole input text. The system was tested on subsets of parallel corpora from the 1987 Philippine Constitution and the novel Alchemist. A scoring algorithm is used to generate the two candidate translations with high scores (1.0 as the highest value). The candidate translation with the highest score is taken as the correct translation. For the Philippine Constitution test data, the average translation scores for both chunk and sentence levels from English to Tagalog is 0.85 and from Tagalog to English is 0.72. Using the Alchemist corpus, the average scores for English to Tagalog is 0.56 in the chunk level and 0.64 in the sentence level for the Tagalog to English the scores in the chunk and sentence levels are 0.63 and 0.62, respectively. The percentage of the segments or chunks translated correctly as determined manually based on the expected translation for selected input sentences is highest (66%) for the Tagalog to English translation using the Alchemist corpus while the English to Tagalog translation of the said corpus has the lowest percent correct translation (40%). For the 1987 Philippine Constitution, percent correct translation was evaluated.to be 59% and 41% for English to Tagalog and Tagalog to English, respectively. The quality of translation depends heavily on the quality and nature of the corpus used. The Philippine Constitution test data had better translation scores since strict and proper translations are necessary for such a legal document. In contrast, the Alchemist test data produced low quality translations where most of the segments were not translated correctly because the sentences in the corpus were translated non-literally (or subjectively) since it is a literary document. In general, results show acceptable translations at the chunk level while translations of whole input text which are composed of several chunks tend to degenerate in thought because it is derived from different sentence examples.