Deep learning-based text augmentation for named entity recognition

This thesis is focused on the development of an effective text augmentation method for Named Entity Recognition (NER) in the low-resource setting. NER, an important sequence labeling task in Natural Language Processing, is used to identify predefined entities in text. NER datasets tend to be smal...

Full description

Saved in:
Bibliographic Details
Main Author: Surana, Tanmay
Other Authors: Chng Eng Siong
Format: Thesis-Master by Research
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/171105
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This thesis is focused on the development of an effective text augmentation method for Named Entity Recognition (NER) in the low-resource setting. NER, an important sequence labeling task in Natural Language Processing, is used to identify predefined entities in text. NER datasets tend to be small, making the creation of additional text via text augmentation a plausible solution. Existing NER text augmentation works suffer from label corruption and lack of context diversity. To address these limitations, this thesis proposes Contextual and Semantic Structure-based Interpolation (CASSI) - a structure-based text augmentation scheme that produces a combination of two semantically similar sentences. This is done by producing candidate augmentations via replacements of sub-trees of their dependency parse trees containing subjects, objects, or complements. The final augmentation is selected by filtering candidates through Language Model scoring and a metric that uses Jaccard Similarity between the original pair and the candidates to improve specificity. Experiments show that CASSI consistently outperforms existing methods on multiple resource levels and multiple languages. When compared to the best-performing baseline, it shows an average relative improvement in the Micro-F1 of 4.28% to 25.97% on subsets of CoNLL 2002/03, and 1.56% across three noisy text datasets.