Deep learning-based text augmentation for named entity recognition
This thesis is focused on the development of an effective text augmentation method for Named Entity Recognition (NER) in the low-resource setting. NER, an important sequence labeling task in Natural Language Processing, is used to identify predefined entities in text. NER datasets tend to be smal...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Research |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/171105 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | This thesis is focused on the development of an effective text augmentation method for Named Entity Recognition (NER) in the low-resource setting.
NER, an important sequence labeling task in Natural Language Processing, is used to identify predefined entities in text. NER datasets tend to be small, making the creation of additional text via text augmentation a plausible solution.
Existing NER text augmentation works suffer from label corruption and lack of context diversity. To address these limitations, this thesis proposes Contextual and Semantic Structure-based Interpolation (CASSI) - a structure-based text augmentation scheme that produces a combination of two semantically similar sentences. This is done by producing candidate augmentations via replacements of sub-trees of their dependency parse trees containing subjects, objects, or complements. The final augmentation is selected by filtering candidates through Language Model scoring and a metric that uses Jaccard Similarity between the original pair and the candidates to improve specificity.
Experiments show that CASSI consistently outperforms existing methods on multiple resource levels and multiple languages. When compared to the best-performing baseline, it shows an average relative improvement in the Micro-F1 of 4.28% to 25.97% on subsets of CoNLL 2002/03, and 1.56% across three noisy text datasets. |
---|