Data-efficient domain adaptation for pretrained language models
Recent advances in Natural Language Processing (NLP) are built on a range of large-scale pretrained language models (PLMs), which are based on deep transformer neural networks. These PLMs simultaneously learn contextualized word representations and language modeling by training the entire model on m...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/167965 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Recent advances in Natural Language Processing (NLP) are built on a range of large-scale pretrained language models (PLMs), which are based on deep transformer neural networks. These PLMs simultaneously learn contextualized word representations and language modeling by training the entire model on massive unlabeled corpora using self-supervised learning techniques, bringing about a paradigm shift that moves our focus from customizing different models for different tasks to adapting one PLM to all tasks.
Studying how to adapt a general-purpose PLM to a specific domain of interest is of great significance to the deployment of PLMs. The mainstream practice is to finetune a PLM with a task-specific head on a labeled dataset from the target domain. However, for most target applications, labeled data is limited and even scarce in many low-resource scenarios. The huge number of parameters in a PLM often leaves those small datasets struggling to harness the power of the language priors. As a result, even under the same task, when a PLM finetuned on one dataset is applied to another dataset with some domain gap, it sometimes encounters performance degradation due to overfitting the previous training set. This phenomenon hinders the wide adoption of PLMs in practice, particularly in the face of new domains, calling for approaches to enhance the generalization performance of PLMs during adaptation without requesting more labeled data.
Early domain adaptation methods, which leverage similar source domains to boost model performance on the target domains, are developed based on customized models using traditional neural networks such as LSTMs. These models are shallow, require longer training time to converge, and have no prior knowledge compared to PLMs. Studies show that some popular domain adaptation methods can even harm the generalization performance of PLMs on the target domains. The unique characteristics of PLMs such as unprecedented scales, rich language priors, and many hitherto underexplored skills could be uncontrollable factors that make them exhibit different learning behaviors compared to traditional models. To this end, there is a need to develop algorithms for PLMs to enhance their domain adaptation performance, thereby accelerating their wide adoption in real-world scenarios.
This thesis aims to explore techniques that can efficiently make use of the target domain labeled data and better adapt a given PLM to the target domains of interest by effectively transferring knowledge from similar source domains to the target domains. To achieve this goal, I conduct research from three perspectives throughout a machine learning pipeline, each assuming only specified locations can be updated with available computing resources. That is, we keep all other conditions fixed and only make updates to the input data, model representations, and output predictions respectively. We show how to achieve better generalization performance with limited labeled data from the target domains under each scenario. To sum up, we propose a new algorithm to generate adversarial perturbations using the domain adaptation objective to enhance the transferability of soft prompt tuning in low-resource scenarios, a new model optimization algorithm that takes into account the next-step gradients of adversarial domain discriminator when optimizing the task classifiers to accommodate competing losses and a new federated learning framework that calibrates the conditional probability distribution to adapt the same PLM to multiple domains under different label distributions. We present the specific problems, related works, detailed methods, extensive experiments, and thorough discussions in the following chapters, and shed light on how to base on traditional machine learning methods while catering to newly emerging learning paradigms. |
---|