Neural machine translation with limited resources

With the advent of deep neural networks in recent years, Neural Machine Translation (NMT) systems have achieved state-of-the-art performance on standard translation benchmarks. NMT is a way to translate from one language to another with a single neural network in an end-to-end manner. The NMT models...

Full description

Saved in:

Bibliographic Details
Main Author:	Mohiuddin, Tasnim
Other Authors:	Joty Shafiq Rayhan
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Online Access:	https://hdl.handle.net/10356/157475
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-157475
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Mohiuddin, Tasnim Neural machine translation with limited resources
description	With the advent of deep neural networks in recent years, Neural Machine Translation (NMT) systems have achieved state-of-the-art performance on standard translation benchmarks. NMT is a way to translate from one language to another with a single neural network in an end-to-end manner. The NMT models have emerged quickly, and within a few years of research, they have outperformed the traditional statistical systems with impressive performance. Despite the success of NMT models in standard benchmarks, there are some notable limitations. One of them is that NMT models are known to be data-hungry, i.e., they tend to work very well only when a massive amount of parallel training data (a.k.a. bitext) is available, but perform poorly when the data is limited. Except for some mainstream languages, e.g., English, French, or Chinese, most natural languages are low-resourced and lack large parallel data. Moreover, acquiring large bitext corpora is not viable in most scenarios, especially with resource-constrained conditions like low-resource languages. Researchers have made numerous endeavors to expand the success of NMT from high-resource to low-resource languages like transfer learning, data augmentation, and pivoting. However, they still require strong cross-lingual signals, i.e., lots of parallel data. One solution to this problem might be transferring cross-lingual signals through cross-lingual word embeddings (CLWEs), which can be learned from monolingual data in an unsupervised way or with the help of a small seed dictionary. CLWEs seem to be very promising in resource-constrained machine translation (MT). Most of the successful and predominant CLWE methods (a.k.a. word translation methods) learn a linear mapping function based on the isomorphic assumption, which is problematic. We hypothesize to learn the cross-lingual mapping in a projected latent space which would give the model enough flexibility to induce the required geometric structures such that it would be easier to align the embeddings. Based on this hypothesis, we propose two novel models for learning CLWEs. We empirically show that our methods are particularly very effective for low-resource languages. We then turn our attention from word- to sentence-level translation with limited resources. Specifically, we focus on data augmentation strategies widely used in NLP and Computer Vision to increase the robustness of the models in resource-constrained scenarios. We investigate the domain-mismatch issue thoroughly that hinders the all-embracing success of the existing techniques in NMT. Eventually, we introduce a novel data augmentation framework for low-resource NMT that leverages the neighboring samples of the original parallel data without explicitly using additional monolingual data. Our framework can diversify the in-domain parallel data in a controlled way. We perform extensive experiments on four low-resource language pairs comprising data from different domains. We have shown that our method is comparable to the traditional back-translation that uses extra in-domain monolingual data. Typically, NMT systems are trained on heterogeneous data from different domains, sources, topics, styles, and modalities. The quality of the data also varies a lot. Usually, during training, all the data are concatenated and randomly shuffled. However, not all of them may be useful, some data may be redundant, and some might even be noisy and detrimental to the final NMT system performance. These problems are more acute in low-resource languages compared to the high-resource ones. Consequently, we explore the possibilities of curriculum training for NMT systems, i.e., presenting the data to the NMT systems in a systematic order during training. We introduce a two-stage curriculum training framework for NMT where we fine-tune a base NMT model on subsets of data. To select the data subsets, we propose two scoring approaches --- deterministic scoring using pre-trained methods and online scoring that considers prediction scores of the emerging NMT model. Our curriculum strategies consistently demonstrate better translation quality and faster convergence (approximately 50% fewer updates) on both high- and low-resource languages.
author2	Joty Shafiq Rayhan
author_facet	Joty Shafiq Rayhan Mohiuddin, Tasnim
format	Thesis-Doctor of Philosophy
author	Mohiuddin, Tasnim
author_sort	Mohiuddin, Tasnim
title	Neural machine translation with limited resources
title_short	Neural machine translation with limited resources
title_full	Neural machine translation with limited resources
title_fullStr	Neural machine translation with limited resources
title_full_unstemmed	Neural machine translation with limited resources
title_sort	neural machine translation with limited resources
publisher	Nanyang Technological University
publishDate	2022
url	https://hdl.handle.net/10356/157475
_version_	1735491199439994880
spelling	sg-ntu-dr.10356-1574752022-06-03T14:25:12Z Neural machine translation with limited resources Mohiuddin, Tasnim Joty Shafiq Rayhan School of Computer Science and Engineering srjoty@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence With the advent of deep neural networks in recent years, Neural Machine Translation (NMT) systems have achieved state-of-the-art performance on standard translation benchmarks. NMT is a way to translate from one language to another with a single neural network in an end-to-end manner. The NMT models have emerged quickly, and within a few years of research, they have outperformed the traditional statistical systems with impressive performance. Despite the success of NMT models in standard benchmarks, there are some notable limitations. One of them is that NMT models are known to be data-hungry, i.e., they tend to work very well only when a massive amount of parallel training data (a.k.a. bitext) is available, but perform poorly when the data is limited. Except for some mainstream languages, e.g., English, French, or Chinese, most natural languages are low-resourced and lack large parallel data. Moreover, acquiring large bitext corpora is not viable in most scenarios, especially with resource-constrained conditions like low-resource languages. Researchers have made numerous endeavors to expand the success of NMT from high-resource to low-resource languages like transfer learning, data augmentation, and pivoting. However, they still require strong cross-lingual signals, i.e., lots of parallel data. One solution to this problem might be transferring cross-lingual signals through cross-lingual word embeddings (CLWEs), which can be learned from monolingual data in an unsupervised way or with the help of a small seed dictionary. CLWEs seem to be very promising in resource-constrained machine translation (MT). Most of the successful and predominant CLWE methods (a.k.a. word translation methods) learn a linear mapping function based on the isomorphic assumption, which is problematic. We hypothesize to learn the cross-lingual mapping in a projected latent space which would give the model enough flexibility to induce the required geometric structures such that it would be easier to align the embeddings. Based on this hypothesis, we propose two novel models for learning CLWEs. We empirically show that our methods are particularly very effective for low-resource languages. We then turn our attention from word- to sentence-level translation with limited resources. Specifically, we focus on data augmentation strategies widely used in NLP and Computer Vision to increase the robustness of the models in resource-constrained scenarios. We investigate the domain-mismatch issue thoroughly that hinders the all-embracing success of the existing techniques in NMT. Eventually, we introduce a novel data augmentation framework for low-resource NMT that leverages the neighboring samples of the original parallel data without explicitly using additional monolingual data. Our framework can diversify the in-domain parallel data in a controlled way. We perform extensive experiments on four low-resource language pairs comprising data from different domains. We have shown that our method is comparable to the traditional back-translation that uses extra in-domain monolingual data. Typically, NMT systems are trained on heterogeneous data from different domains, sources, topics, styles, and modalities. The quality of the data also varies a lot. Usually, during training, all the data are concatenated and randomly shuffled. However, not all of them may be useful, some data may be redundant, and some might even be noisy and detrimental to the final NMT system performance. These problems are more acute in low-resource languages compared to the high-resource ones. Consequently, we explore the possibilities of curriculum training for NMT systems, i.e., presenting the data to the NMT systems in a systematic order during training. We introduce a two-stage curriculum training framework for NMT where we fine-tune a base NMT model on subsets of data. To select the data subsets, we propose two scoring approaches --- deterministic scoring using pre-trained methods and online scoring that considers prediction scores of the emerging NMT model. Our curriculum strategies consistently demonstrate better translation quality and faster convergence (approximately 50% fewer updates) on both high- and low-resource languages. Doctor of Philosophy 2022-05-12T06:11:01Z 2022-05-12T06:11:01Z 2022 Thesis-Doctor of Philosophy Mohiuddin, T. (2022). Neural machine translation with limited resources. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/157475 https://hdl.handle.net/10356/157475 10.32657/10356/157475 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Neural machine translation with limited resources

Similar Items