Transfer learning for language model adaptation
Language is the pathway to democratize the boundary of land and culture. Bridging the gap between languages is one of the biggest challenges of Artificial Intelligent (AI) systems. The current success of AI systems is dominated by the supervised learning paradigm where gradient-based learning algori...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/169892 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Language is the pathway to democratize the boundary of land and culture. Bridging the gap between languages is one of the biggest challenges of Artificial Intelligent (AI) systems. The current success of AI systems is dominated by the supervised learning paradigm where gradient-based learning algorithms (i.e., SGD, Adam) are designed to optimize complex high-dimensional planes. These algorithms learn from statistical observations that are typically collected with the intention of a specific task (i.e., product review, sentiment analysis). The use of task-dependent samples makes the learning procedure tedious as it requires manually annotated data. On the contrary, without having a sufficient amount of samples to represent the distribution, deep learning models tend to suffer from the lack of robustness. Due to the natural puzzle of randomness, not all sets of observations are observed in the data collection procedure, thus creating Out-of-Distribution (OOD) problem in learning algorithms.
In the search of a generic task agnostic distribution, a large collection of text over various domains can be considered as - Standard Natural Text Distribution (SNTD). The generic idea of Transfer Learning for Natural Language Processing (NLP) is to utilize the SNTD knowledge for any other task-dependent training. Learning SNTD, followed by a task-adaptation method with a smaller volume of annotated data has yielded state-of-the-art (SOTA) results in various supervised NLP tasks. However, annotated data for every task in every language is rare.
In language models, there are many kinds of distributional variance. One of the most common ways that distributional variance is encoded into a language model is when it is trained with mono-lingual text and learned dis-jointly. Word embeddings produced from these language models are then used as pre-trained embedding vectors to adapt to a downstream task. We propose adversarial training to project two monolingual distributions into the same space and then improve the robustness of the model by augmented fine-tuning with parameter sharing. By projecting mono-lingual language distribution into the same cross-lingual space, the distribution of the language become aware of each other. These projected distribution are semantically aware of each other in the latent space. Thus when we train one distribution the other distribution automatically adapt to the training data, making it easier to transfer (exchange) knowledge. On top of that, the proposed novel self-training architecture improves cross-lingual transfer by a large margin.
Then, we focus on the jointly trained multi-lingual language model, where there is no pre-dominant distributional variance. In the multi-lingual language model, we put more focus on downstream task adaptation. We found that semi-supervised learning with pseudo-augmented data from a pre-trained language model can greatly improve the performance of tasks further down the line. In the end, we introduce a new novel data augmentation framework that uses the neighboring (vicinal) samples of the original training data without explicitly using any parallel text corpora or machine translation system. Our proposed method performs simultaneous self-training with data augmentation and unsupervised sample selection. It also proposes curriculum strategies for different domain samples. With extensive experiments on three different cross-lingual tasks spanning many language pairs, we have demonstrated the effectiveness of our proposed method.
While all of the work done above focused on improving task adaptation in multiple languages without supervision, we further investigate how adding a few samples affects multi-lingual task adaptation. To this end, we leverage a small number of support samples from each language and propose an inference-time transductive nearest neighbor-based approach which utilizes the entropy of the query samples for predictions. We show that our proposed method outperforms full model/full head fine-tuning as well as cross-task fine-tuning. We also show impressive performance gain (~37x) in computational costs for the full inference prediction. However, as the language models grow bigger, it becomes increasingly difficult to perform efficient inferences, especially for multiple tasks.
The jointly optimized multi-lingual distribution helps transfer knowledge from a high-resource language to a low-resource language. While working on transductive nearest neighbor inference, we observe that language models are significantly prone to task distributions. Unless we use an extremely large language model (>100B), a model used for a specific task adaptation cannot be used for another task. In this dissertation, our final proposed method addresses this issue by multitask prompted learning.
Multitask prompted learning can help generalization through a diverse set of tasks and domains at once, thus enhancing the potential for removing distributional variance for downstream tasks. We propose a semi-parametric prompt tuning method for multitask prompted learning. The novel component of our proposed method is a memory bank from where memory prompts are retrieved based on discrete prompts. Extensive experiments on 31 different tasks from 8 different domains demonstrate the effectiveness of our proposed method.
This dissertation aims to explore the adaptability of language models across multiple languages, tasks, and domains. It begins with the fundamental multi-lingual adaptation problem and from there extends to many different OOD cases in regard to various resource availability over languages, tasks, and domains. |
---|