Data augmentation for name entity recognition
The objective of this thesis is to develop text augmentation approaches for Name Entity Recognition tasks under low-resource domain settings. The field of Name Entity Recognition has advanced rapidly due to the contributions of Deep Learning. Deep Learning techniques have become the mainstream ap...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Research |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/161703 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The objective of this thesis is to develop text augmentation approaches for Name Entity
Recognition tasks under low-resource domain settings. The field of Name Entity Recognition
has advanced rapidly due to the contributions of Deep Learning. Deep Learning techniques
have become the mainstream approach for the majority of Natural Language Processing tasks.
Neural Network models are able to learn data more efficiently and produce state-of-the-art
results compared to traditional approaches. However, one constraint of Deep Learning approach
is the need for a large volume of annotated data. The Name Entity Recognition (NER) often
faces low-resource issues i.e there are insufficient annotated examples with entities. When NER
systems based on Deep Learning techniques are trained with relatively small dataset, such
models are unable to learn good representation for the name entities, and hence these models’
performance degrades drastically.
Text augmentation is the approach to generate additional artificial text derived from existing
data. The idea is to increase the size of training data and ultimately improve the model
performance. In this thesis, we aim to explore the domain-independent text augmentation
approaches for NER tasks via text generation using two approaches: Finite State Transducer
and Abstractive Text Summarization techniques. The objective is to evaluate the effectiveness
of Finite State Transducer and Abstractive Text Summarization techniques data augmentation
for NER by comparing performance on the original datasets against augmented datasets. Finite
state transducer is a template based approach while abstractive text summarization will use
Google’s Pegasus deep NN model. The proposed approach consists of following steps. Firstly,
with the use of OpenGrm Thrax Grammar compiler and OpenFST library, sentences in the
original dataset were handcrafted into regular expressions and the name entities values are
replaced by a set of variables. Each variable then stores the possible values of the particular
entity found in the dataset. By performing word replacements for each entity variable, more
sentences can be generated using the template to create a never-seen combination of entity
values. Secondly, we use the pre-trained Google’s Pegasus summarization tool to transform the
original sentence into several semantically similar sentences. Finally, the text generated by these
two methods are combined to form the augmented text.
To verify the effectiveness of these two approaches, we will train three state-of-the-art BERT-based
NER models, namely, BERT, DistilBERT, RoBERTa systems. Our results on the Groningen
Meaning Bank corpus showed that text augmentation fine-tuning improved F1 score of BERT
and DistilBERT NER model by 0.3%and 0.7% respectively over the baseline system , while
RoBERTa based model performance reduced by 0.2%. Our conclusion is that the performance
of our proposed text augmentation is model dependent as it showed improvement on smaller
pre-trained models such as BERT and DistilBERT but not on the relatively large RoBERTa
model. |
---|