Adaptation of language models via text augmentation

This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity...

Full description

Saved in:

Bibliographic Details
Main Author:	Prachaseree, Chaiyasait
Other Authors:	Chng Eng Siong
Format:	Thesis-Master by Research
Language:	English
Published:	Nanyang Technological University 2023
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Online Access:	https://hdl.handle.net/10356/171247
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-171247
record_format	dspace
spelling	sg-ntu-dr.10356-1712472023-11-02T02:20:48Z Adaptation of language models via text augmentation Prachaseree, Chaiyasait Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Document and text processing This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity of training texts by introducing variations of sentences that may be missing in the training corpus. We addressed two domains to adapt language models to through text augmentations. Firstly, a general language model trained on Gigaspeech dataset is adapted to a specialized medical domain. An abstractive summarization module to generate medical texts are employed, improving perplexity by 9.7% for n-grams and 5.56 % for recurrent neural network language models. Secondly, a language model trained on monolingual English and Malay texts is adapted to a code-switching test set. Augmentation through a Bayesian classifier with part-of-speech tags reduces perplexity by 1.79%, 1.85%, and 0.42% on three test sets for n-grams and 1.5% in recurrent neural network language models Master of Engineering 2023-10-25T00:22:22Z 2023-10-25T00:22:22Z 2023 Thesis-Master by Research Prachaseree, C. (2023). Adaptation of language models via text augmentation. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/171247 https://hdl.handle.net/10356/171247 10.32657/10356/171247 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Document and text processing
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Document and text processing Prachaseree, Chaiyasait Adaptation of language models via text augmentation
description	This research aims to adapt a language model to a specific domain using text augmentation techniques. A robust language model requires a lot of domain-specific texts. This thesis focuses on text augmentation to circumvent limited data and domain mismatch. It aims to increase the amount and diversity of training texts by introducing variations of sentences that may be missing in the training corpus. We addressed two domains to adapt language models to through text augmentations. Firstly, a general language model trained on Gigaspeech dataset is adapted to a specialized medical domain. An abstractive summarization module to generate medical texts are employed, improving perplexity by 9.7% for n-grams and 5.56 % for recurrent neural network language models. Secondly, a language model trained on monolingual English and Malay texts is adapted to a code-switching test set. Augmentation through a Bayesian classifier with part-of-speech tags reduces perplexity by 1.79%, 1.85%, and 0.42% on three test sets for n-grams and 1.5% in recurrent neural network language models
author2	Chng Eng Siong
author_facet	Chng Eng Siong Prachaseree, Chaiyasait
format	Thesis-Master by Research
author	Prachaseree, Chaiyasait
author_sort	Prachaseree, Chaiyasait
title	Adaptation of language models via text augmentation
title_short	Adaptation of language models via text augmentation
title_full	Adaptation of language models via text augmentation
title_fullStr	Adaptation of language models via text augmentation
title_full_unstemmed	Adaptation of language models via text augmentation
title_sort	adaptation of language models via text augmentation
publisher	Nanyang Technological University
publishDate	2023
url	https://hdl.handle.net/10356/171247
_version_	1781793708331499520

Adaptation of language models via text augmentation

Similar Items