Adaptation of language models via text augmentation
This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Research |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/171247 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity of training texts by introducing variations of sentences that may be missing in the training corpus.
We addressed two domains to adapt language models to through text augmentations. Firstly, a general language model trained on Gigaspeech dataset is adapted to a specialized medical domain. An abstractive summarization module to generate medical texts are employed, improving perplexity by 9.7% for n-grams and 5.56 % for recurrent neural network language models. Secondly, a language model trained on monolingual English and Malay texts is adapted to a code-switching test set. Augmentation through a Bayesian classifier with part-of-speech tags reduces perplexity by 1.79%, 1.85%, and 0.42% on three
test sets for n-grams and 1.5% in recurrent neural network language models |
---|