Adaptation of language models via text augmentation

This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity...

Full description

Saved in:
Bibliographic Details
Main Author: Prachaseree, Chaiyasait
Other Authors: Chng Eng Siong
Format: Thesis-Master by Research
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/171247
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity of training texts by introducing variations of sentences that may be missing in the training corpus. We addressed two domains to adapt language models to through text augmentations. Firstly, a general language model trained on Gigaspeech dataset is adapted to a specialized medical domain. An abstractive summarization module to generate medical texts are employed, improving perplexity by 9.7% for n-grams and 5.56 % for recurrent neural network language models. Secondly, a language model trained on monolingual English and Malay texts is adapted to a code-switching test set. Augmentation through a Bayesian classifier with part-of-speech tags reduces perplexity by 1.79%, 1.85%, and 0.42% on three test sets for n-grams and 1.5% in recurrent neural network language models