On the data scarcity problem of neural-based named entity recognition

The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models. Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes ev...

Full description

Saved in:

Bibliographic Details
Main Author:	Zhou, Ran
Other Authors:	Erik Cambria
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science
Online Access:	https://hdl.handle.net/10356/173481
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-173481
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science
spellingShingle	Computer and Information Science Zhou, Ran On the data scarcity problem of neural-based named entity recognition
description	The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models. Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes even impractical. This lack of labeled data can hinder the performance of neural-based NER models, as they require a substantial amount of annotated examples to learn effectively. With limited training data, neural-based NER models may struggle to generalize well and accurately identify unseen named entities in out-of-domain text or from a different language. They may be prone to overfitting, where the model becomes too specific to the training data and fails to generalize to new data, leading to reduced overall performance. Addressing the data scarcity problem in neural-based NER involves exploring alternative approaches to mitigate the impact of limited labeled data. Some strategies include data augmentation techniques, such as word or entity replacement, synthetic data generation, or leveraging external resources like knowledge bases or dictionaries. Many works focus on the popular data-scarce scenario of cross-lingual NER, where there is training data in the source language but few or no annotations in the target language. For example, consistency training encourages the model's predictions to be consistent across different representations of the same input, and can be used to improve the robustness and generalization of NER models across different languages. Moreover, self-training has been applied to enhance the NER model's knowledge of the target language's linguistic characteristics and entity patterns by taking advantage of the abundant unlabeled text in the target language. In this thesis, we present our research to address the data scarcity problem of neural-based NER. Our contributions are as follows. Firstly, we propose a novel data augmentation framework for low-resource NER, which effectively improves entity diversity and alleviates the token-label misalignment problem, and is proven effective under monolingual, cross-lingual, and multilingual experimental settings. Secondly, we present a consistency training method for cross-lingual NER, which propagates reliable supervision signals from the source language to the target language, aligns the representation space between languages, and alleviates overfitting on the source language. Evaluated on various cross-lingual transfer pairs, our method demonstrates superior performance over various baseline methods. Finally, we introduce an improved self-training method for cross-lingual NER, where contrastive learning is utilized to facilitate classification and prototype learning is used for iteratively denoising pseudo-labeled target language data. The proposed self-training method presents significant improvements over existing self-training methods and achieves state-of-the-art performance. In conclusion, we have shown that by proposing effective data augmentation methods, consistency training frameworks and improved self-training schema, the data scarcity problem in neural-based named entity recognition can be largely alleviated.
author2	Erik Cambria
author_facet	Erik Cambria Zhou, Ran
format	Thesis-Doctor of Philosophy
author	Zhou, Ran
author_sort	Zhou, Ran
title	On the data scarcity problem of neural-based named entity recognition
title_short	On the data scarcity problem of neural-based named entity recognition
title_full	On the data scarcity problem of neural-based named entity recognition
title_fullStr	On the data scarcity problem of neural-based named entity recognition
title_full_unstemmed	On the data scarcity problem of neural-based named entity recognition
title_sort	on the data scarcity problem of neural-based named entity recognition
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/173481
_version_	1794549416901738496
spelling	sg-ntu-dr.10356-1734812024-03-07T08:52:06Z On the data scarcity problem of neural-based named entity recognition Zhou, Ran Erik Cambria Miao Chun Yan School of Computer Science and Engineering ASCYMiao@ntu.edu.sg, cambria@ntu.edu.sg Computer and Information Science The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models. Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes even impractical. This lack of labeled data can hinder the performance of neural-based NER models, as they require a substantial amount of annotated examples to learn effectively. With limited training data, neural-based NER models may struggle to generalize well and accurately identify unseen named entities in out-of-domain text or from a different language. They may be prone to overfitting, where the model becomes too specific to the training data and fails to generalize to new data, leading to reduced overall performance. Addressing the data scarcity problem in neural-based NER involves exploring alternative approaches to mitigate the impact of limited labeled data. Some strategies include data augmentation techniques, such as word or entity replacement, synthetic data generation, or leveraging external resources like knowledge bases or dictionaries. Many works focus on the popular data-scarce scenario of cross-lingual NER, where there is training data in the source language but few or no annotations in the target language. For example, consistency training encourages the model's predictions to be consistent across different representations of the same input, and can be used to improve the robustness and generalization of NER models across different languages. Moreover, self-training has been applied to enhance the NER model's knowledge of the target language's linguistic characteristics and entity patterns by taking advantage of the abundant unlabeled text in the target language. In this thesis, we present our research to address the data scarcity problem of neural-based NER. Our contributions are as follows. Firstly, we propose a novel data augmentation framework for low-resource NER, which effectively improves entity diversity and alleviates the token-label misalignment problem, and is proven effective under monolingual, cross-lingual, and multilingual experimental settings. Secondly, we present a consistency training method for cross-lingual NER, which propagates reliable supervision signals from the source language to the target language, aligns the representation space between languages, and alleviates overfitting on the source language. Evaluated on various cross-lingual transfer pairs, our method demonstrates superior performance over various baseline methods. Finally, we introduce an improved self-training method for cross-lingual NER, where contrastive learning is utilized to facilitate classification and prototype learning is used for iteratively denoising pseudo-labeled target language data. The proposed self-training method presents significant improvements over existing self-training methods and achieves state-of-the-art performance. In conclusion, we have shown that by proposing effective data augmentation methods, consistency training frameworks and improved self-training schema, the data scarcity problem in neural-based named entity recognition can be largely alleviated. Doctor of Philosophy 2024-02-07T05:22:01Z 2024-02-07T05:22:01Z 2023 Thesis-Doctor of Philosophy Zhou, R. (2023). On the data scarcity problem of neural-based named entity recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/173481 https://hdl.handle.net/10356/173481 10.32657/10356/173481 en Alibaba Group through Alibaba Innovative Research (AIR) Program Alibaba-NTU Singapore Joint Research Institute (JRI) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

On the data scarcity problem of neural-based named entity recognition

Similar Items