Bridging Philippine languages with multilingual neural machine translation

The Philippines is home to more than 150 languages that are considered low- resourced, Resulting in a lack of pursuit in developing a translation system for most of its languages. To aid in improving the results and processes of translation systems for low-resource languages, multilingual NMT became...

Full description

Saved in:
Bibliographic Details
Main Author: Baliber, Renz Iver D.
Format: text
Language:English
Published: Animo Repository 2021
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etdm_comsci/8
https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1008&context=etdm_comsci
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
id oai:animorepository.dlsu.edu.ph:etdm_comsci-1008
record_format eprints
spelling oai:animorepository.dlsu.edu.ph:etdm_comsci-10082021-11-22T07:23:33Z Bridging Philippine languages with multilingual neural machine translation Baliber, Renz Iver D. The Philippines is home to more than 150 languages that are considered low- resourced, Resulting in a lack of pursuit in developing a translation system for most of its languages. To aid in improving the results and processes of translation systems for low-resource languages, multilingual NMT became an active area of research. However, existing works in multilingual NMT disregard the analysis of a multilingual model on a closely related and low-resource language group in the context of zero-resource translation. In this study, we have benchmarked translation systems for several Philip- pine Languages and provide an analysis of a transformer-based multilingual NMT system for morphologically rich and low-resource languages in terms of its ca- pabilities in translating unseen language pairs using zero-shot translation and pivot-based translation. Our studies show that due to the architectural design of the Transformer model, common words and sentence-length differences affect the performance of a multilingual NMT in translating both seen and unseen lan- guage pairs with Bicolano, Cebuano, and Hiligaynon consistently perform better than the other languages in various translation task by having a good balance of commonality and sentence length difference. This work also investigated the effect of increasing the model size and capacity that allowed the model to build a language invariant shared representation space and stronger decoding capabilities to do zero-shot translation where the previous model with smaller capacity failed to develop a language invariant shared represen- tation space and could only produce translations up to English when attempting a zero-shot translation. Since we are dealing with low-resource multilingual data, some of the risks involved are domain shift and out-of-vocabulary words. We have also shown how the multilingual NMT leverages joint byte-pair encoding and the shared represen- tation space to produce translation for unseen or rare words. Lastly, we have shown that the transformer-based multilingual NMT can com- pete with, or outperform other translation approaches as we have shown in a comparative analysis against the baseline statistical MT models where several statistical-based translation models were produced to compare its performance against a single multilingual NMT model. We have shown in the results that the translation performance of the multilingual NMT is superior to the Statisti- cal MT models both in bidirectional English and Philippine languages translation task and a pivot-based Philippine languages translation task where we have shown the capability of the multilingual NMT model to retain information and context across multilingual translation, something that the statistical MT models failed to do. The multilingual NMT model is also capable of producing competitive results against a directly trained NMT in a bidirectional Cebuano and Tagalog translation task where the pivot-based approach of the multilingual NMT pro- duced 6.72 and 7.20 BLEU scores against the 9.54 and 10.55 BLEU scores of a directly trained NMT for Tagalog to Cebuano and Cebuano to Tagalog transla- tion tasks even though the multilingual NMT does not have any parallel Cebuano and Tagalog datasets, proving the effectiveness of a multilingual NMT model in building translations systems for low-resource languages. 2021-07-14T07:00:00Z text application/pdf https://animorepository.dlsu.edu.ph/etdm_comsci/8 https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1008&context=etdm_comsci Computer Science Master's Theses English Animo Repository Philippine languages—Translations Translators (Computer programs) Computer Sciences
institution De La Salle University
building De La Salle University Library
continent Asia
country Philippines
Philippines
content_provider De La Salle University Library
collection DLSU Institutional Repository
language English
topic Philippine languages—Translations
Translators (Computer programs)
Computer Sciences
spellingShingle Philippine languages—Translations
Translators (Computer programs)
Computer Sciences
Baliber, Renz Iver D.
Bridging Philippine languages with multilingual neural machine translation
description The Philippines is home to more than 150 languages that are considered low- resourced, Resulting in a lack of pursuit in developing a translation system for most of its languages. To aid in improving the results and processes of translation systems for low-resource languages, multilingual NMT became an active area of research. However, existing works in multilingual NMT disregard the analysis of a multilingual model on a closely related and low-resource language group in the context of zero-resource translation. In this study, we have benchmarked translation systems for several Philip- pine Languages and provide an analysis of a transformer-based multilingual NMT system for morphologically rich and low-resource languages in terms of its ca- pabilities in translating unseen language pairs using zero-shot translation and pivot-based translation. Our studies show that due to the architectural design of the Transformer model, common words and sentence-length differences affect the performance of a multilingual NMT in translating both seen and unseen lan- guage pairs with Bicolano, Cebuano, and Hiligaynon consistently perform better than the other languages in various translation task by having a good balance of commonality and sentence length difference. This work also investigated the effect of increasing the model size and capacity that allowed the model to build a language invariant shared representation space and stronger decoding capabilities to do zero-shot translation where the previous model with smaller capacity failed to develop a language invariant shared represen- tation space and could only produce translations up to English when attempting a zero-shot translation. Since we are dealing with low-resource multilingual data, some of the risks involved are domain shift and out-of-vocabulary words. We have also shown how the multilingual NMT leverages joint byte-pair encoding and the shared represen- tation space to produce translation for unseen or rare words. Lastly, we have shown that the transformer-based multilingual NMT can com- pete with, or outperform other translation approaches as we have shown in a comparative analysis against the baseline statistical MT models where several statistical-based translation models were produced to compare its performance against a single multilingual NMT model. We have shown in the results that the translation performance of the multilingual NMT is superior to the Statisti- cal MT models both in bidirectional English and Philippine languages translation task and a pivot-based Philippine languages translation task where we have shown the capability of the multilingual NMT model to retain information and context across multilingual translation, something that the statistical MT models failed to do. The multilingual NMT model is also capable of producing competitive results against a directly trained NMT in a bidirectional Cebuano and Tagalog translation task where the pivot-based approach of the multilingual NMT pro- duced 6.72 and 7.20 BLEU scores against the 9.54 and 10.55 BLEU scores of a directly trained NMT for Tagalog to Cebuano and Cebuano to Tagalog transla- tion tasks even though the multilingual NMT does not have any parallel Cebuano and Tagalog datasets, proving the effectiveness of a multilingual NMT model in building translations systems for low-resource languages.
format text
author Baliber, Renz Iver D.
author_facet Baliber, Renz Iver D.
author_sort Baliber, Renz Iver D.
title Bridging Philippine languages with multilingual neural machine translation
title_short Bridging Philippine languages with multilingual neural machine translation
title_full Bridging Philippine languages with multilingual neural machine translation
title_fullStr Bridging Philippine languages with multilingual neural machine translation
title_full_unstemmed Bridging Philippine languages with multilingual neural machine translation
title_sort bridging philippine languages with multilingual neural machine translation
publisher Animo Repository
publishDate 2021
url https://animorepository.dlsu.edu.ph/etdm_comsci/8
https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1008&context=etdm_comsci
_version_ 1718383353250447360