Transfer learning for language model adaptation

Language is the pathway to democratize the boundary of land and culture. Bridging the gap between languages is one of the biggest challenges of Artificial Intelligent (AI) systems. The current success of AI systems is dominated by the supervised learning paradigm where gradient-based learning algori...

Full description

Saved in:
Bibliographic Details
Main Author: Bari M. Saiful
Other Authors: Joty Shafiq Rayhan
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/169892
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-169892
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Bari M. Saiful
Transfer learning for language model adaptation
description Language is the pathway to democratize the boundary of land and culture. Bridging the gap between languages is one of the biggest challenges of Artificial Intelligent (AI) systems. The current success of AI systems is dominated by the supervised learning paradigm where gradient-based learning algorithms (i.e., SGD, Adam) are designed to optimize complex high-dimensional planes. These algorithms learn from statistical observations that are typically collected with the intention of a specific task (i.e., product review, sentiment analysis). The use of task-dependent samples makes the learning procedure tedious as it requires manually annotated data. On the contrary, without having a sufficient amount of samples to represent the distribution, deep learning models tend to suffer from the lack of robustness. Due to the natural puzzle of randomness, not all sets of observations are observed in the data collection procedure, thus creating Out-of-Distribution (OOD) problem in learning algorithms. In the search of a generic task agnostic distribution, a large collection of text over various domains can be considered as - Standard Natural Text Distribution (SNTD). The generic idea of Transfer Learning for Natural Language Processing (NLP) is to utilize the SNTD knowledge for any other task-dependent training. Learning SNTD, followed by a task-adaptation method with a smaller volume of annotated data has yielded state-of-the-art (SOTA) results in various supervised NLP tasks. However, annotated data for every task in every language is rare. In language models, there are many kinds of distributional variance. One of the most common ways that distributional variance is encoded into a language model is when it is trained with mono-lingual text and learned dis-jointly. Word embeddings produced from these language models are then used as pre-trained embedding vectors to adapt to a downstream task. We propose adversarial training to project two monolingual distributions into the same space and then improve the robustness of the model by augmented fine-tuning with parameter sharing. By projecting mono-lingual language distribution into the same cross-lingual space, the distribution of the language become aware of each other. These projected distribution are semantically aware of each other in the latent space. Thus when we train one distribution the other distribution automatically adapt to the training data, making it easier to transfer (exchange) knowledge. On top of that, the proposed novel self-training architecture improves cross-lingual transfer by a large margin. Then, we focus on the jointly trained multi-lingual language model, where there is no pre-dominant distributional variance. In the multi-lingual language model, we put more focus on downstream task adaptation. We found that semi-supervised learning with pseudo-augmented data from a pre-trained language model can greatly improve the performance of tasks further down the line. In the end, we introduce a new novel data augmentation framework that uses the neighboring (vicinal) samples of the original training data without explicitly using any parallel text corpora or machine translation system. Our proposed method performs simultaneous self-training with data augmentation and unsupervised sample selection. It also proposes curriculum strategies for different domain samples. With extensive experiments on three different cross-lingual tasks spanning many language pairs, we have demonstrated the effectiveness of our proposed method. While all of the work done above focused on improving task adaptation in multiple languages without supervision, we further investigate how adding a few samples affects multi-lingual task adaptation. To this end, we leverage a small number of support samples from each language and propose an inference-time transductive nearest neighbor-based approach which utilizes the entropy of the query samples for predictions. We show that our proposed method outperforms full model/full head fine-tuning as well as cross-task fine-tuning. We also show impressive performance gain (~37x) in computational costs for the full inference prediction. However, as the language models grow bigger, it becomes increasingly difficult to perform efficient inferences, especially for multiple tasks. The jointly optimized multi-lingual distribution helps transfer knowledge from a high-resource language to a low-resource language. While working on transductive nearest neighbor inference, we observe that language models are significantly prone to task distributions. Unless we use an extremely large language model (>100B), a model used for a specific task adaptation cannot be used for another task. In this dissertation, our final proposed method addresses this issue by multitask prompted learning. Multitask prompted learning can help generalization through a diverse set of tasks and domains at once, thus enhancing the potential for removing distributional variance for downstream tasks. We propose a semi-parametric prompt tuning method for multitask prompted learning. The novel component of our proposed method is a memory bank from where memory prompts are retrieved based on discrete prompts. Extensive experiments on 31 different tasks from 8 different domains demonstrate the effectiveness of our proposed method. This dissertation aims to explore the adaptability of language models across multiple languages, tasks, and domains. It begins with the fundamental multi-lingual adaptation problem and from there extends to many different OOD cases in regard to various resource availability over languages, tasks, and domains.
author2 Joty Shafiq Rayhan
author_facet Joty Shafiq Rayhan
Bari M. Saiful
format Thesis-Doctor of Philosophy
author Bari M. Saiful
author_sort Bari M. Saiful
title Transfer learning for language model adaptation
title_short Transfer learning for language model adaptation
title_full Transfer learning for language model adaptation
title_fullStr Transfer learning for language model adaptation
title_full_unstemmed Transfer learning for language model adaptation
title_sort transfer learning for language model adaptation
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/169892
_version_ 1779156492723683328
spelling sg-ntu-dr.10356-1698922023-09-04T07:32:08Z Transfer learning for language model adaptation Bari M. Saiful Joty Shafiq Rayhan School of Computer Science and Engineering srjoty@ntu.edu.sg Engineering::Computer science and engineering Language is the pathway to democratize the boundary of land and culture. Bridging the gap between languages is one of the biggest challenges of Artificial Intelligent (AI) systems. The current success of AI systems is dominated by the supervised learning paradigm where gradient-based learning algorithms (i.e., SGD, Adam) are designed to optimize complex high-dimensional planes. These algorithms learn from statistical observations that are typically collected with the intention of a specific task (i.e., product review, sentiment analysis). The use of task-dependent samples makes the learning procedure tedious as it requires manually annotated data. On the contrary, without having a sufficient amount of samples to represent the distribution, deep learning models tend to suffer from the lack of robustness. Due to the natural puzzle of randomness, not all sets of observations are observed in the data collection procedure, thus creating Out-of-Distribution (OOD) problem in learning algorithms. In the search of a generic task agnostic distribution, a large collection of text over various domains can be considered as - Standard Natural Text Distribution (SNTD). The generic idea of Transfer Learning for Natural Language Processing (NLP) is to utilize the SNTD knowledge for any other task-dependent training. Learning SNTD, followed by a task-adaptation method with a smaller volume of annotated data has yielded state-of-the-art (SOTA) results in various supervised NLP tasks. However, annotated data for every task in every language is rare. In language models, there are many kinds of distributional variance. One of the most common ways that distributional variance is encoded into a language model is when it is trained with mono-lingual text and learned dis-jointly. Word embeddings produced from these language models are then used as pre-trained embedding vectors to adapt to a downstream task. We propose adversarial training to project two monolingual distributions into the same space and then improve the robustness of the model by augmented fine-tuning with parameter sharing. By projecting mono-lingual language distribution into the same cross-lingual space, the distribution of the language become aware of each other. These projected distribution are semantically aware of each other in the latent space. Thus when we train one distribution the other distribution automatically adapt to the training data, making it easier to transfer (exchange) knowledge. On top of that, the proposed novel self-training architecture improves cross-lingual transfer by a large margin. Then, we focus on the jointly trained multi-lingual language model, where there is no pre-dominant distributional variance. In the multi-lingual language model, we put more focus on downstream task adaptation. We found that semi-supervised learning with pseudo-augmented data from a pre-trained language model can greatly improve the performance of tasks further down the line. In the end, we introduce a new novel data augmentation framework that uses the neighboring (vicinal) samples of the original training data without explicitly using any parallel text corpora or machine translation system. Our proposed method performs simultaneous self-training with data augmentation and unsupervised sample selection. It also proposes curriculum strategies for different domain samples. With extensive experiments on three different cross-lingual tasks spanning many language pairs, we have demonstrated the effectiveness of our proposed method. While all of the work done above focused on improving task adaptation in multiple languages without supervision, we further investigate how adding a few samples affects multi-lingual task adaptation. To this end, we leverage a small number of support samples from each language and propose an inference-time transductive nearest neighbor-based approach which utilizes the entropy of the query samples for predictions. We show that our proposed method outperforms full model/full head fine-tuning as well as cross-task fine-tuning. We also show impressive performance gain (~37x) in computational costs for the full inference prediction. However, as the language models grow bigger, it becomes increasingly difficult to perform efficient inferences, especially for multiple tasks. The jointly optimized multi-lingual distribution helps transfer knowledge from a high-resource language to a low-resource language. While working on transductive nearest neighbor inference, we observe that language models are significantly prone to task distributions. Unless we use an extremely large language model (>100B), a model used for a specific task adaptation cannot be used for another task. In this dissertation, our final proposed method addresses this issue by multitask prompted learning. Multitask prompted learning can help generalization through a diverse set of tasks and domains at once, thus enhancing the potential for removing distributional variance for downstream tasks. We propose a semi-parametric prompt tuning method for multitask prompted learning. The novel component of our proposed method is a memory bank from where memory prompts are retrieved based on discrete prompts. Extensive experiments on 31 different tasks from 8 different domains demonstrate the effectiveness of our proposed method. This dissertation aims to explore the adaptability of language models across multiple languages, tasks, and domains. It begins with the fundamental multi-lingual adaptation problem and from there extends to many different OOD cases in regard to various resource availability over languages, tasks, and domains. Doctor of Philosophy 2023-08-15T04:15:09Z 2023-08-15T04:15:09Z 2023 Thesis-Doctor of Philosophy Bari M. Saiful (2023). Transfer learning for language model adaptation. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/169892 https://hdl.handle.net/10356/169892 10.32657/10356/169892 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University