Adaptation of language models via text augmentation

This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity...

Full description

Saved in:
Bibliographic Details
Main Author: Prachaseree, Chaiyasait
Other Authors: Chng Eng Siong
Format: Thesis-Master by Research
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/171247
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-171247
record_format dspace
spelling sg-ntu-dr.10356-1712472023-11-02T02:20:48Z Adaptation of language models via text augmentation Prachaseree, Chaiyasait Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Document and text processing This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity of training texts by introducing variations of sentences that may be missing in the training corpus. We addressed two domains to adapt language models to through text augmentations. Firstly, a general language model trained on Gigaspeech dataset is adapted to a specialized medical domain. An abstractive summarization module to generate medical texts are employed, improving perplexity by 9.7% for n-grams and 5.56 % for recurrent neural network language models. Secondly, a language model trained on monolingual English and Malay texts is adapted to a code-switching test set. Augmentation through a Bayesian classifier with part-of-speech tags reduces perplexity by 1.79%, 1.85%, and 0.42% on three test sets for n-grams and 1.5% in recurrent neural network language models Master of Engineering 2023-10-25T00:22:22Z 2023-10-25T00:22:22Z 2023 Thesis-Master by Research Prachaseree, C. (2023). Adaptation of language models via text augmentation. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/171247 https://hdl.handle.net/10356/171247 10.32657/10356/171247 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Document and text processing
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Prachaseree, Chaiyasait
Adaptation of language models via text augmentation
description This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity of training texts by introducing variations of sentences that may be missing in the training corpus. We addressed two domains to adapt language models to through text augmentations. Firstly, a general language model trained on Gigaspeech dataset is adapted to a specialized medical domain. An abstractive summarization module to generate medical texts are employed, improving perplexity by 9.7% for n-grams and 5.56 % for recurrent neural network language models. Secondly, a language model trained on monolingual English and Malay texts is adapted to a code-switching test set. Augmentation through a Bayesian classifier with part-of-speech tags reduces perplexity by 1.79%, 1.85%, and 0.42% on three test sets for n-grams and 1.5% in recurrent neural network language models
author2 Chng Eng Siong
author_facet Chng Eng Siong
Prachaseree, Chaiyasait
format Thesis-Master by Research
author Prachaseree, Chaiyasait
author_sort Prachaseree, Chaiyasait
title Adaptation of language models via text augmentation
title_short Adaptation of language models via text augmentation
title_full Adaptation of language models via text augmentation
title_fullStr Adaptation of language models via text augmentation
title_full_unstemmed Adaptation of language models via text augmentation
title_sort adaptation of language models via text augmentation
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/171247
_version_ 1781793708331499520