Building the language resource for a Cebuano-Filipino neural machine translation system

Parallel corpus is a critical resource in machine learning based translation. The task of collecting, extracting, and aligning texts in order to build an acceptable corpus for doing translation is very tedious most especially for low-resource languages. In this paper, we present the efforts made to...

全面介紹

Saved in:
書目詳細資料
Main Authors: Adlaon, Kristine Mae M., Marcos, Nelson
格式: text
出版: Animo Repository 2019
主題:
在線閱讀:https://animorepository.dlsu.edu.ph/faculty_research/2552
https://animorepository.dlsu.edu.ph/context/faculty_research/article/3551/type/native/viewcontent
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
機構: De La Salle University
實物特徵
總結:Parallel corpus is a critical resource in machine learning based translation. The task of collecting, extracting, and aligning texts in order to build an acceptable corpus for doing translation is very tedious most especially for low-resource languages. In this paper, we present the efforts made to build a parallel corpus for Cebuano and Filipino from two different domains: biblical texts and the web. For the biblical resource, subword unit translation for verbs and copy-able approach for nouns were applied to correct inconsistencies in translation. This correction mechanism was applied as a preprocessing technique. On the other hand, for Wikipedia being the main web resource, commonly occurring topic segments were extracted from both the source and the target languages. These observed topic segments are unique in 4 different categories. The identification of these topic segments may be used for automatic extraction of sentences. A Recurrent Neural Network was used to implement the translation using OpenNMT sequence modeling tool in TensorFlow. The two different corpora were then evaluated by using them as two separate inputs in the neural network. Results have shown a difference in BLEU score in both corpora. © 2019 Copyright is held by the owner/author(s). Publication rights licensed to ACM.