Improving neural machine translation: data centric approaches
Neural machine translation (NMT), where neural networks are used to generate translations, has revolutionized the field of machine translation (MT) in the past ten years, thanks to the introduction of the attention mechanism. From the inefficient recurrent structures, NMT has evolved into the Transf...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/170533 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Neural machine translation (NMT), where neural networks are used to generate translations, has revolutionized the field of machine translation (MT) in the past ten years, thanks to the introduction of the attention mechanism. From the inefficient recurrent structures, NMT has evolved into the Transformer models that consist of purely attention layers, which improve scaling efficiency with parallelizable computations. With more data available, scaling has helped achieve outstanding performance not just in translation tasks, but also other natural language process- ing (NLP) tasks. This has led to wide adoptions in real-world products, like Google Translate.
Nonetheless, neural networks come with a major drawback: they require significant amounts of supervised or parallel data to excel. This leads to a reality that translation systems among over-represented languages like English and French can achieve near human-level and commercializable quality, while under-represented and distant languages, like Nepali, are still considerably difficult to translate and bring benefits to developing countries. As a result, the fruits of NMT technology are still unequitably distributed.
Given that human-annotated data are expensive to obtain, there are few directions to work around the lack of data. A highly-interested direction is to develop better models, layers and training techniques. However, newly developed models are often more complex to implement while the gains are often incremental. The second approach is to increase the effective size of the original training dataset, or the total information extractable from it, through either organic or synthetic means.
This thesis introduces a series of work that emphasize on the second direction, which attempts to synthetically increase the training data size using purely data manipulation and generation techniques as well as to inject alternative organic representations of the original data, such as constituency trees, into the NMT models. Each work tackles progressively more challenging problems in the field of machine translation, ranging from supervised translation in high-resource languages to fully unsupervised translation in extremely low-resource and distant languages. For that, this series of work strives to contribute to the effort of making translation technology more equitable and accessible.
The first work focuses on building a new Transformer architecture that can explicitly absorb and utilize effectively and efficiently the constituency tree representation of natural language, which represents valuable grammatical and semantic information. Specifically, our novel structure, called Hierarchical Accumulation, allows the attention layers to embed constituency trees in a bottom-up fashion that helps extract the most structural information, while maintaining the same asymptotic time complexity as self-attention. The resulting Tree-Transformer model is able to achieve large performance gain in supervised MT tasks, while consuming much less time than equivalent competitors.
Though organic forms like trees can be useful, they are not easily obtainable and consumable for many languages. Thus, in the second work, we shifted our effort to finding ways to dramatically increase the effective size of the training data synthetically using a model-based approach called Data Diversification. In particular, we train multiple distinct MT teacher models and use them to translate the source and target texts of the training data into multiple different versions of themselves, hence duplicating the unique data size by at least 7 folds. This scaling effect helps achieve the state of the art (SOTA) in the standard WMT translation tasks. It is also effective in improving translations in low-resource languages like Nepali and Sinhala.
In the third work, we step up the challenge in tackling fully unsupervised machine translation (UMT), where no parallel data is allowed. Modern UMT methods typically involve iterative back-translation, which means using the models to generate synthetic parallel data from unlabeled monolingual data to train themselves in a positive feedback loop. While back-translation is already a synthetic parallel data generation process, the model’s back-translation data become unchanged as the model converges, thus starving itself off a diverse data source for further gain. Moreover, adapting the Data Diversification method does not empirically increase data diversity in unsupervised setups. To that end, we proposed a new strategy, namely Cross-model Back-translated Distillation (CBD), to generate extra diverse synthetic data by using two distinct UMT models via a two-way back-translation process. In our experiments, CBD achieves the SOTA in the standard WMT English-French, German and Romanian unsupervised translation tasks. We also show it is effective because it embraces data diversity while other similar variants do not.
From our further study, it turns out scaling CBD more does not bring in additional benefits. Meanwhile, the existing UMT pipeline may introduce noises in the beginning because the back-translated data is low-quality. In our quest to find the better way to obtain more synthetic data, we hypothesized that real parallel data may exist among the unlabeled corpora and we can pair sentences of similar meanings from different languages to form new synthetic, but higher-quality, parallel data. Such extra data is then used to augment UMT training alongside back-translation. This task is called unsupervised pseudo-parallel data mining. In particular, we introduce a novel language-agnostic contrastive clustering loss (LAgSwAV), which is used to finetune pre-trained encoders and generate sentence embeddings from the unlabeled corpora are semantically clustered such that it becomes easier to mine pseudo-parallel data with higher accuracy. Our method further extends the state of the art in the WMT English-French, German and Romanian unsupervised tasks. We also show that the mined data obtained from our model is much more accurate compared to the baselines.
Finally, we tackled the most difficult task in MT - unsupervised machine translation in low-resource and distant languages, such as Nepali and Sinhala. In such languages, there are not even enough unlabeled monolingual data, causing the model to fail to converge without the help of unlabeled data from other high- resource languages via multilingual pre-training. Furthermore, the aforementioned pseudo-parallel data hypothesis no longer holds in reality for these languages. In our final work, we propose to gradually separate multiple languages from a multi- lingual UMT model and make it specialized in only one language pair individually through different finetuning stages. Our method manages to score the best performances in 16 different low-resource unsupervised tasks, with languages ranging from the Indic group like Nepali and Sinhala to one spoken in Kazakhstan. |
---|