Deep learning-based text augmentation for named entity recognition

This thesis is focused on the development of an effective text augmentation method for Named Entity Recognition (NER) in the low-resource setting. NER, an important sequence labeling task in Natural Language Processing, is used to identify predefined entities in text. NER datasets tend to be smal...

Full description

Saved in:
Bibliographic Details
Main Author: Surana, Tanmay
Other Authors: Chng Eng Siong
Format: Thesis-Master by Research
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/171105
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-171105
record_format dspace
spelling sg-ntu-dr.10356-1711052023-11-02T02:20:48Z Deep learning-based text augmentation for named entity recognition Surana, Tanmay Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Document and text processing Engineering::Computer science and engineering Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Information systems::Information storage and retrieval This thesis is focused on the development of an effective text augmentation method for Named Entity Recognition (NER) in the low-resource setting. NER, an important sequence labeling task in Natural Language Processing, is used to identify predefined entities in text. NER datasets tend to be small, making the creation of additional text via text augmentation a plausible solution. Existing NER text augmentation works suffer from label corruption and lack of context diversity. To address these limitations, this thesis proposes Contextual and Semantic Structure-based Interpolation (CASSI) - a structure-based text augmentation scheme that produces a combination of two semantically similar sentences. This is done by producing candidate augmentations via replacements of sub-trees of their dependency parse trees containing subjects, objects, or complements. The final augmentation is selected by filtering candidates through Language Model scoring and a metric that uses Jaccard Similarity between the original pair and the candidates to improve specificity. Experiments show that CASSI consistently outperforms existing methods on multiple resource levels and multiple languages. When compared to the best-performing baseline, it shows an average relative improvement in the Micro-F1 of 4.28% to 25.97% on subsets of CoNLL 2002/03, and 1.56% across three noisy text datasets. Master of Engineering 2023-10-16T02:23:27Z 2023-10-16T02:23:27Z 2023 Thesis-Master by Research Surana, T. (2023). Deep learning-based text augmentation for named entity recognition. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/171105 https://hdl.handle.net/10356/171105 10.32657/10356/171105 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Engineering::Computer science and engineering
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Information systems::Information storage and retrieval
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Engineering::Computer science and engineering
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Surana, Tanmay
Deep learning-based text augmentation for named entity recognition
description This thesis is focused on the development of an effective text augmentation method for Named Entity Recognition (NER) in the low-resource setting. NER, an important sequence labeling task in Natural Language Processing, is used to identify predefined entities in text. NER datasets tend to be small, making the creation of additional text via text augmentation a plausible solution. Existing NER text augmentation works suffer from label corruption and lack of context diversity. To address these limitations, this thesis proposes Contextual and Semantic Structure-based Interpolation (CASSI) - a structure-based text augmentation scheme that produces a combination of two semantically similar sentences. This is done by producing candidate augmentations via replacements of sub-trees of their dependency parse trees containing subjects, objects, or complements. The final augmentation is selected by filtering candidates through Language Model scoring and a metric that uses Jaccard Similarity between the original pair and the candidates to improve specificity. Experiments show that CASSI consistently outperforms existing methods on multiple resource levels and multiple languages. When compared to the best-performing baseline, it shows an average relative improvement in the Micro-F1 of 4.28% to 25.97% on subsets of CoNLL 2002/03, and 1.56% across three noisy text datasets.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Surana, Tanmay
format Thesis-Master by Research
author Surana, Tanmay
author_sort Surana, Tanmay
title Deep learning-based text augmentation for named entity recognition
title_short Deep learning-based text augmentation for named entity recognition
title_full Deep learning-based text augmentation for named entity recognition
title_fullStr Deep learning-based text augmentation for named entity recognition
title_full_unstemmed Deep learning-based text augmentation for named entity recognition
title_sort deep learning-based text augmentation for named entity recognition
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/171105
_version_ 1781793875131629568