Data augmentation for name entity recognition

The objective of this thesis is to develop text augmentation approaches for Name Entity Recognition tasks under low-resource domain settings. The field of Name Entity Recognition has advanced rapidly due to the contributions of Deep Learning. Deep Learning techniques have become the mainstream ap...

Full description

Saved in:

Bibliographic Details
Main Author:	Kyaw, Zin Tun
Other Authors:	Chng Eng Siong
Format:	Thesis-Master by Research
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Engineering::Computer science and engineering::Information systems Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Online Access:	https://hdl.handle.net/10356/161703
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-161703
record_format	dspace
spelling	sg-ntu-dr.10356-1617032022-10-04T01:04:35Z Data augmentation for name entity recognition Kyaw, Zin Tun Chng Eng Siong School of Computer Science and Engineering Media and Interactive Computing Lab (MICL) ASESChng@ntu.edu.sg Engineering::Computer science and engineering::Information systems Engineering::Computer science and engineering::Information systems::Information storage and retrieval The objective of this thesis is to develop text augmentation approaches for Name Entity Recognition tasks under low-resource domain settings. The field of Name Entity Recognition has advanced rapidly due to the contributions of Deep Learning. Deep Learning techniques have become the mainstream approach for the majority of Natural Language Processing tasks. Neural Network models are able to learn data more efficiently and produce state-of-the-art results compared to traditional approaches. However, one constraint of Deep Learning approach is the need for a large volume of annotated data. The Name Entity Recognition (NER) often faces low-resource issues i.e there are insufficient annotated examples with entities. When NER systems based on Deep Learning techniques are trained with relatively small dataset, such models are unable to learn good representation for the name entities, and hence these models’ performance degrades drastically. Text augmentation is the approach to generate additional artificial text derived from existing data. The idea is to increase the size of training data and ultimately improve the model performance. In this thesis, we aim to explore the domain-independent text augmentation approaches for NER tasks via text generation using two approaches: Finite State Transducer and Abstractive Text Summarization techniques. The objective is to evaluate the effectiveness of Finite State Transducer and Abstractive Text Summarization techniques data augmentation for NER by comparing performance on the original datasets against augmented datasets. Finite state transducer is a template based approach while abstractive text summarization will use Google’s Pegasus deep NN model. The proposed approach consists of following steps. Firstly, with the use of OpenGrm Thrax Grammar compiler and OpenFST library, sentences in the original dataset were handcrafted into regular expressions and the name entities values are replaced by a set of variables. Each variable then stores the possible values of the particular entity found in the dataset. By performing word replacements for each entity variable, more sentences can be generated using the template to create a never-seen combination of entity values. Secondly, we use the pre-trained Google’s Pegasus summarization tool to transform the original sentence into several semantically similar sentences. Finally, the text generated by these two methods are combined to form the augmented text. To verify the effectiveness of these two approaches, we will train three state-of-the-art BERT-based NER models, namely, BERT, DistilBERT, RoBERTa systems. Our results on the Groningen Meaning Bank corpus showed that text augmentation fine-tuning improved F1 score of BERT and DistilBERT NER model by 0.3%and 0.7% respectively over the baseline system , while RoBERTa based model performance reduced by 0.2%. Our conclusion is that the performance of our proposed text augmentation is model dependent as it showed improvement on smaller pre-trained models such as BERT and DistilBERT but not on the relatively large RoBERTa model. Master of Engineering 2022-09-15T08:57:10Z 2022-09-15T08:57:10Z 2022 Thesis-Master by Research Kyaw, Z. T. (2022). Data augmentation for name entity recognition. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/161703 https://hdl.handle.net/10356/161703 10.32657/10356/161703 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Information systems Engineering::Computer science and engineering::Information systems::Information storage and retrieval
spellingShingle	Engineering::Computer science and engineering::Information systems Engineering::Computer science and engineering::Information systems::Information storage and retrieval Kyaw, Zin Tun Data augmentation for name entity recognition
description	The objective of this thesis is to develop text augmentation approaches for Name Entity Recognition tasks under low-resource domain settings. The field of Name Entity Recognition has advanced rapidly due to the contributions of Deep Learning. Deep Learning techniques have become the mainstream approach for the majority of Natural Language Processing tasks. Neural Network models are able to learn data more efficiently and produce state-of-the-art results compared to traditional approaches. However, one constraint of Deep Learning approach is the need for a large volume of annotated data. The Name Entity Recognition (NER) often faces low-resource issues i.e there are insufficient annotated examples with entities. When NER systems based on Deep Learning techniques are trained with relatively small dataset, such models are unable to learn good representation for the name entities, and hence these models’ performance degrades drastically. Text augmentation is the approach to generate additional artificial text derived from existing data. The idea is to increase the size of training data and ultimately improve the model performance. In this thesis, we aim to explore the domain-independent text augmentation approaches for NER tasks via text generation using two approaches: Finite State Transducer and Abstractive Text Summarization techniques. The objective is to evaluate the effectiveness of Finite State Transducer and Abstractive Text Summarization techniques data augmentation for NER by comparing performance on the original datasets against augmented datasets. Finite state transducer is a template based approach while abstractive text summarization will use Google’s Pegasus deep NN model. The proposed approach consists of following steps. Firstly, with the use of OpenGrm Thrax Grammar compiler and OpenFST library, sentences in the original dataset were handcrafted into regular expressions and the name entities values are replaced by a set of variables. Each variable then stores the possible values of the particular entity found in the dataset. By performing word replacements for each entity variable, more sentences can be generated using the template to create a never-seen combination of entity values. Secondly, we use the pre-trained Google’s Pegasus summarization tool to transform the original sentence into several semantically similar sentences. Finally, the text generated by these two methods are combined to form the augmented text. To verify the effectiveness of these two approaches, we will train three state-of-the-art BERT-based NER models, namely, BERT, DistilBERT, RoBERTa systems. Our results on the Groningen Meaning Bank corpus showed that text augmentation fine-tuning improved F1 score of BERT and DistilBERT NER model by 0.3%and 0.7% respectively over the baseline system , while RoBERTa based model performance reduced by 0.2%. Our conclusion is that the performance of our proposed text augmentation is model dependent as it showed improvement on smaller pre-trained models such as BERT and DistilBERT but not on the relatively large RoBERTa model.
author2	Chng Eng Siong
author_facet	Chng Eng Siong Kyaw, Zin Tun
format	Thesis-Master by Research
author	Kyaw, Zin Tun
author_sort	Kyaw, Zin Tun
title	Data augmentation for name entity recognition
title_short	Data augmentation for name entity recognition
title_full	Data augmentation for name entity recognition
title_fullStr	Data augmentation for name entity recognition
title_full_unstemmed	Data augmentation for name entity recognition
title_sort	data augmentation for name entity recognition
publisher	Nanyang Technological University
publishDate	2022
url	https://hdl.handle.net/10356/161703
_version_	1746219672813961216

Data augmentation for name entity recognition

Similar Items