Building the language resource for a Cebuano-Filipino neural machine translation system

Parallel corpus is a critical resource in machine learning based translation. The task of collecting, extracting, and aligning texts in order to build an acceptable corpus for doing translation is very tedious most especially for low-resource languages. In this paper, we present the efforts made to...

Full description

Saved in:

Bibliographic Details
Main Authors:	Adlaon, Kristine Mae M., Marcos, Nelson
Format:	text
Published:	Animo Repository 2019
Subjects:	Cebuano language—Machine translating Cebuano language—Transliteration into Filipino Natural language processing (Computer science) Computer Sciences
Online Access:	https://animorepository.dlsu.edu.ph/faculty_research/2552 https://animorepository.dlsu.edu.ph/context/faculty_research/article/3551/type/native/viewcontent
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	De La Salle University

id	oai:animorepository.dlsu.edu.ph:faculty_research-3551
record_format	eprints
spelling	oai:animorepository.dlsu.edu.ph:faculty_research-35512021-09-06T02:31:38Z Building the language resource for a Cebuano-Filipino neural machine translation system Adlaon, Kristine Mae M. Marcos, Nelson Parallel corpus is a critical resource in machine learning based translation. The task of collecting, extracting, and aligning texts in order to build an acceptable corpus for doing translation is very tedious most especially for low-resource languages. In this paper, we present the efforts made to build a parallel corpus for Cebuano and Filipino from two different domains: biblical texts and the web. For the biblical resource, subword unit translation for verbs and copy-able approach for nouns were applied to correct inconsistencies in translation. This correction mechanism was applied as a preprocessing technique. On the other hand, for Wikipedia being the main web resource, commonly occurring topic segments were extracted from both the source and the target languages. These observed topic segments are unique in 4 different categories. The identification of these topic segments may be used for automatic extraction of sentences. A Recurrent Neural Network was used to implement the translation using OpenNMT sequence modeling tool in TensorFlow. The two different corpora were then evaluated by using them as two separate inputs in the neural network. Results have shown a difference in BLEU score in both corpora. © 2019 Copyright is held by the owner/author(s). Publication rights licensed to ACM. 2019-06-28T07:00:00Z text text/html https://animorepository.dlsu.edu.ph/faculty_research/2552 https://animorepository.dlsu.edu.ph/context/faculty_research/article/3551/type/native/viewcontent Faculty Research Work Animo Repository Cebuano language—Machine translating Cebuano language—Transliteration into Filipino Natural language processing (Computer science) Computer Sciences
institution	De La Salle University
building	De La Salle University Library
continent	Asia
country	Philippines Philippines
content_provider	De La Salle University Library
collection	DLSU Institutional Repository
topic	Cebuano language—Machine translating Cebuano language—Transliteration into Filipino Natural language processing (Computer science) Computer Sciences
spellingShingle	Cebuano language—Machine translating Cebuano language—Transliteration into Filipino Natural language processing (Computer science) Computer Sciences Adlaon, Kristine Mae M. Marcos, Nelson Building the language resource for a Cebuano-Filipino neural machine translation system
description	Parallel corpus is a critical resource in machine learning based translation. The task of collecting, extracting, and aligning texts in order to build an acceptable corpus for doing translation is very tedious most especially for low-resource languages. In this paper, we present the efforts made to build a parallel corpus for Cebuano and Filipino from two different domains: biblical texts and the web. For the biblical resource, subword unit translation for verbs and copy-able approach for nouns were applied to correct inconsistencies in translation. This correction mechanism was applied as a preprocessing technique. On the other hand, for Wikipedia being the main web resource, commonly occurring topic segments were extracted from both the source and the target languages. These observed topic segments are unique in 4 different categories. The identification of these topic segments may be used for automatic extraction of sentences. A Recurrent Neural Network was used to implement the translation using OpenNMT sequence modeling tool in TensorFlow. The two different corpora were then evaluated by using them as two separate inputs in the neural network. Results have shown a difference in BLEU score in both corpora. © 2019 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
format	text
author	Adlaon, Kristine Mae M. Marcos, Nelson
author_facet	Adlaon, Kristine Mae M. Marcos, Nelson
author_sort	Adlaon, Kristine Mae M.
title	Building the language resource for a Cebuano-Filipino neural machine translation system
title_short	Building the language resource for a Cebuano-Filipino neural machine translation system
title_full	Building the language resource for a Cebuano-Filipino neural machine translation system
title_fullStr	Building the language resource for a Cebuano-Filipino neural machine translation system
title_full_unstemmed	Building the language resource for a Cebuano-Filipino neural machine translation system
title_sort	building the language resource for a cebuano-filipino neural machine translation system
publisher	Animo Repository
publishDate	2019
url	https://animorepository.dlsu.edu.ph/faculty_research/2552 https://animorepository.dlsu.edu.ph/context/faculty_research/article/3551/type/native/viewcontent
_version_	1710755582375362560

Building the language resource for a Cebuano-Filipino neural machine translation system

Similar Items