Adaptation of language models via text augmentation
This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Research |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/171247 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-171247 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1712472023-11-02T02:20:48Z Adaptation of language models via text augmentation Prachaseree, Chaiyasait Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Document and text processing This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity of training texts by introducing variations of sentences that may be missing in the training corpus. We addressed two domains to adapt language models to through text augmentations. Firstly, a general language model trained on Gigaspeech dataset is adapted to a specialized medical domain. An abstractive summarization module to generate medical texts are employed, improving perplexity by 9.7% for n-grams and 5.56 % for recurrent neural network language models. Secondly, a language model trained on monolingual English and Malay texts is adapted to a code-switching test set. Augmentation through a Bayesian classifier with part-of-speech tags reduces perplexity by 1.79%, 1.85%, and 0.42% on three test sets for n-grams and 1.5% in recurrent neural network language models Master of Engineering 2023-10-25T00:22:22Z 2023-10-25T00:22:22Z 2023 Thesis-Master by Research Prachaseree, C. (2023). Adaptation of language models via text augmentation. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/171247 https://hdl.handle.net/10356/171247 10.32657/10356/171247 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Document and text processing |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Document and text processing Prachaseree, Chaiyasait Adaptation of language models via text augmentation |
description |
This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity of training texts by introducing variations of sentences that may be missing in the training corpus.
We addressed two domains to adapt language models to through text augmentations. Firstly, a general language model trained on Gigaspeech dataset is adapted to a specialized medical domain. An abstractive summarization module to generate medical texts are employed, improving perplexity by 9.7% for n-grams and 5.56 % for recurrent neural network language models. Secondly, a language model trained on monolingual English and Malay texts is adapted to a code-switching test set. Augmentation through a Bayesian classifier with part-of-speech tags reduces perplexity by 1.79%, 1.85%, and 0.42% on three
test sets for n-grams and 1.5% in recurrent neural network language models |
author2 |
Chng Eng Siong |
author_facet |
Chng Eng Siong Prachaseree, Chaiyasait |
format |
Thesis-Master by Research |
author |
Prachaseree, Chaiyasait |
author_sort |
Prachaseree, Chaiyasait |
title |
Adaptation of language models via text augmentation |
title_short |
Adaptation of language models via text augmentation |
title_full |
Adaptation of language models via text augmentation |
title_fullStr |
Adaptation of language models via text augmentation |
title_full_unstemmed |
Adaptation of language models via text augmentation |
title_sort |
adaptation of language models via text augmentation |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/171247 |
_version_ |
1781793708331499520 |