Bridging Philippine languages with multilingual neural machine translation
The Philippines is home to more than 150 languages that are considered low- resourced, Resulting in a lack of pursuit in developing a translation system for most of its languages. To aid in improving the results and processes of translation systems for low-resource languages, multilingual NMT became...
Saved in:
Main Author: | |
---|---|
Format: | text |
Language: | English |
Published: |
Animo Repository
2021
|
Subjects: | |
Online Access: | https://animorepository.dlsu.edu.ph/etdm_comsci/8 https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1008&context=etdm_comsci |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | De La Salle University |
Language: | English |
id |
oai:animorepository.dlsu.edu.ph:etdm_comsci-1008 |
---|---|
record_format |
eprints |
spelling |
oai:animorepository.dlsu.edu.ph:etdm_comsci-10082021-11-22T07:23:33Z Bridging Philippine languages with multilingual neural machine translation Baliber, Renz Iver D. The Philippines is home to more than 150 languages that are considered low- resourced, Resulting in a lack of pursuit in developing a translation system for most of its languages. To aid in improving the results and processes of translation systems for low-resource languages, multilingual NMT became an active area of research. However, existing works in multilingual NMT disregard the analysis of a multilingual model on a closely related and low-resource language group in the context of zero-resource translation. In this study, we have benchmarked translation systems for several Philip- pine Languages and provide an analysis of a transformer-based multilingual NMT system for morphologically rich and low-resource languages in terms of its ca- pabilities in translating unseen language pairs using zero-shot translation and pivot-based translation. Our studies show that due to the architectural design of the Transformer model, common words and sentence-length differences affect the performance of a multilingual NMT in translating both seen and unseen lan- guage pairs with Bicolano, Cebuano, and Hiligaynon consistently perform better than the other languages in various translation task by having a good balance of commonality and sentence length difference. This work also investigated the effect of increasing the model size and capacity that allowed the model to build a language invariant shared representation space and stronger decoding capabilities to do zero-shot translation where the previous model with smaller capacity failed to develop a language invariant shared represen- tation space and could only produce translations up to English when attempting a zero-shot translation. Since we are dealing with low-resource multilingual data, some of the risks involved are domain shift and out-of-vocabulary words. We have also shown how the multilingual NMT leverages joint byte-pair encoding and the shared represen- tation space to produce translation for unseen or rare words. Lastly, we have shown that the transformer-based multilingual NMT can com- pete with, or outperform other translation approaches as we have shown in a comparative analysis against the baseline statistical MT models where several statistical-based translation models were produced to compare its performance against a single multilingual NMT model. We have shown in the results that the translation performance of the multilingual NMT is superior to the Statisti- cal MT models both in bidirectional English and Philippine languages translation task and a pivot-based Philippine languages translation task where we have shown the capability of the multilingual NMT model to retain information and context across multilingual translation, something that the statistical MT models failed to do. The multilingual NMT model is also capable of producing competitive results against a directly trained NMT in a bidirectional Cebuano and Tagalog translation task where the pivot-based approach of the multilingual NMT pro- duced 6.72 and 7.20 BLEU scores against the 9.54 and 10.55 BLEU scores of a directly trained NMT for Tagalog to Cebuano and Cebuano to Tagalog transla- tion tasks even though the multilingual NMT does not have any parallel Cebuano and Tagalog datasets, proving the effectiveness of a multilingual NMT model in building translations systems for low-resource languages. 2021-07-14T07:00:00Z text application/pdf https://animorepository.dlsu.edu.ph/etdm_comsci/8 https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1008&context=etdm_comsci Computer Science Master's Theses English Animo Repository Philippine languages—Translations Translators (Computer programs) Computer Sciences |
institution |
De La Salle University |
building |
De La Salle University Library |
continent |
Asia |
country |
Philippines Philippines |
content_provider |
De La Salle University Library |
collection |
DLSU Institutional Repository |
language |
English |
topic |
Philippine languages—Translations Translators (Computer programs) Computer Sciences |
spellingShingle |
Philippine languages—Translations Translators (Computer programs) Computer Sciences Baliber, Renz Iver D. Bridging Philippine languages with multilingual neural machine translation |
description |
The Philippines is home to more than 150 languages that are considered low- resourced, Resulting in a lack of pursuit in developing a translation system for most of its languages. To aid in improving the results and processes of translation systems for low-resource languages, multilingual NMT became an active area of research. However, existing works in multilingual NMT disregard the analysis of a multilingual model on a closely related and low-resource language group in the context of zero-resource translation.
In this study, we have benchmarked translation systems for several Philip- pine Languages and provide an analysis of a transformer-based multilingual NMT system for morphologically rich and low-resource languages in terms of its ca- pabilities in translating unseen language pairs using zero-shot translation and pivot-based translation. Our studies show that due to the architectural design of the Transformer model, common words and sentence-length differences affect the performance of a multilingual NMT in translating both seen and unseen lan- guage pairs with Bicolano, Cebuano, and Hiligaynon consistently perform better than the other languages in various translation task by having a good balance of commonality and sentence length difference.
This work also investigated the effect of increasing the model size and capacity that allowed the model to build a language invariant shared representation space and stronger decoding capabilities to do zero-shot translation where the previous model with smaller capacity failed to develop a language invariant shared represen- tation space and could only produce translations up to English when attempting a zero-shot translation.
Since we are dealing with low-resource multilingual data, some of the risks involved are domain shift and out-of-vocabulary words. We have also shown how the multilingual NMT leverages joint byte-pair encoding and the shared represen- tation space to produce translation for unseen or rare words.
Lastly, we have shown that the transformer-based multilingual NMT can com- pete with, or outperform other translation approaches as we have shown in a comparative analysis against the baseline statistical MT models where several statistical-based translation models were produced to compare its performance against a single multilingual NMT model. We have shown in the results that the translation performance of the multilingual NMT is superior to the Statisti- cal MT models both in bidirectional English and Philippine languages translation
task and a pivot-based Philippine languages translation task where we have shown the capability of the multilingual NMT model to retain information and context across multilingual translation, something that the statistical MT models failed to do. The multilingual NMT model is also capable of producing competitive results against a directly trained NMT in a bidirectional Cebuano and Tagalog translation task where the pivot-based approach of the multilingual NMT pro- duced 6.72 and 7.20 BLEU scores against the 9.54 and 10.55 BLEU scores of a directly trained NMT for Tagalog to Cebuano and Cebuano to Tagalog transla- tion tasks even though the multilingual NMT does not have any parallel Cebuano and Tagalog datasets, proving the effectiveness of a multilingual NMT model in building translations systems for low-resource languages. |
format |
text |
author |
Baliber, Renz Iver D. |
author_facet |
Baliber, Renz Iver D. |
author_sort |
Baliber, Renz Iver D. |
title |
Bridging Philippine languages with multilingual neural machine translation |
title_short |
Bridging Philippine languages with multilingual neural machine translation |
title_full |
Bridging Philippine languages with multilingual neural machine translation |
title_fullStr |
Bridging Philippine languages with multilingual neural machine translation |
title_full_unstemmed |
Bridging Philippine languages with multilingual neural machine translation |
title_sort |
bridging philippine languages with multilingual neural machine translation |
publisher |
Animo Repository |
publishDate |
2021 |
url |
https://animorepository.dlsu.edu.ph/etdm_comsci/8 https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1008&context=etdm_comsci |
_version_ |
1718383353250447360 |