Bridging Philippine languages with multilingual neural machine translation

The Philippines is home to more than 150 languages that are considered low- resourced, Resulting in a lack of pursuit in developing a translation system for most of its languages. To aid in improving the results and processes of translation systems for low-resource languages, multilingual NMT became...

Full description

Saved in:
Bibliographic Details
Main Author: Baliber, Renz Iver D.
Format: text
Language:English
Published: Animo Repository 2021
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etdm_comsci/8
https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1008&context=etdm_comsci
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
Description
Summary:The Philippines is home to more than 150 languages that are considered low- resourced, Resulting in a lack of pursuit in developing a translation system for most of its languages. To aid in improving the results and processes of translation systems for low-resource languages, multilingual NMT became an active area of research. However, existing works in multilingual NMT disregard the analysis of a multilingual model on a closely related and low-resource language group in the context of zero-resource translation. In this study, we have benchmarked translation systems for several Philip- pine Languages and provide an analysis of a transformer-based multilingual NMT system for morphologically rich and low-resource languages in terms of its ca- pabilities in translating unseen language pairs using zero-shot translation and pivot-based translation. Our studies show that due to the architectural design of the Transformer model, common words and sentence-length differences affect the performance of a multilingual NMT in translating both seen and unseen lan- guage pairs with Bicolano, Cebuano, and Hiligaynon consistently perform better than the other languages in various translation task by having a good balance of commonality and sentence length difference. This work also investigated the effect of increasing the model size and capacity that allowed the model to build a language invariant shared representation space and stronger decoding capabilities to do zero-shot translation where the previous model with smaller capacity failed to develop a language invariant shared represen- tation space and could only produce translations up to English when attempting a zero-shot translation. Since we are dealing with low-resource multilingual data, some of the risks involved are domain shift and out-of-vocabulary words. We have also shown how the multilingual NMT leverages joint byte-pair encoding and the shared represen- tation space to produce translation for unseen or rare words. Lastly, we have shown that the transformer-based multilingual NMT can com- pete with, or outperform other translation approaches as we have shown in a comparative analysis against the baseline statistical MT models where several statistical-based translation models were produced to compare its performance against a single multilingual NMT model. We have shown in the results that the translation performance of the multilingual NMT is superior to the Statisti- cal MT models both in bidirectional English and Philippine languages translation task and a pivot-based Philippine languages translation task where we have shown the capability of the multilingual NMT model to retain information and context across multilingual translation, something that the statistical MT models failed to do. The multilingual NMT model is also capable of producing competitive results against a directly trained NMT in a bidirectional Cebuano and Tagalog translation task where the pivot-based approach of the multilingual NMT pro- duced 6.72 and 7.20 BLEU scores against the 9.54 and 10.55 BLEU scores of a directly trained NMT for Tagalog to Cebuano and Cebuano to Tagalog transla- tion tasks even though the multilingual NMT does not have any parallel Cebuano and Tagalog datasets, proving the effectiveness of a multilingual NMT model in building translations systems for low-resource languages.