On the data scarcity problem of neural-based named entity recognition
The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models. Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes ev...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/173481 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-173481 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science |
spellingShingle |
Computer and Information Science Zhou, Ran On the data scarcity problem of neural-based named entity recognition |
description |
The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models.
Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes even impractical.
This lack of labeled data can hinder the performance of neural-based NER models, as they require a substantial amount of annotated examples to learn effectively.
With limited training data, neural-based NER models may struggle to generalize well and accurately identify unseen named entities in out-of-domain text or from a different language. They may be prone to overfitting, where the model becomes too specific to the training data and fails to generalize to new data, leading to reduced overall performance.
Addressing the data scarcity problem in neural-based NER involves exploring alternative approaches to mitigate the impact of limited labeled data.
Some strategies include data augmentation techniques, such as word or entity replacement, synthetic data generation, or leveraging external resources like knowledge bases or dictionaries.
Many works focus on the popular data-scarce scenario of cross-lingual NER, where there is training data in the source language but few or no annotations in the target language.
For example, consistency training encourages the model's predictions to be consistent across different representations of the same input, and can be used to improve the robustness and generalization of NER models across different languages.
Moreover, self-training has been applied to enhance the NER model's knowledge of the target language's linguistic characteristics and entity patterns by taking advantage of the abundant unlabeled text in the target language.
In this thesis, we present our research to address the data scarcity problem of neural-based NER. Our contributions are as follows.
Firstly, we propose a novel data augmentation framework for low-resource NER, which effectively improves entity diversity and alleviates the token-label misalignment problem, and is proven effective under monolingual, cross-lingual, and multilingual experimental settings.
Secondly, we present a consistency training method for cross-lingual NER, which propagates reliable supervision signals from the source language to the target language, aligns the representation space between languages, and alleviates overfitting on the source language. Evaluated on various cross-lingual transfer pairs, our method demonstrates superior performance over various baseline methods.
Finally, we introduce an improved self-training method for cross-lingual NER, where contrastive learning is utilized to facilitate classification and prototype learning is used for iteratively denoising pseudo-labeled target language data. The proposed self-training method presents significant improvements over existing self-training methods and achieves state-of-the-art performance.
In conclusion, we have shown that by proposing effective data augmentation methods, consistency training frameworks and improved self-training schema, the data scarcity problem in neural-based named entity recognition can be largely alleviated. |
author2 |
Erik Cambria |
author_facet |
Erik Cambria Zhou, Ran |
format |
Thesis-Doctor of Philosophy |
author |
Zhou, Ran |
author_sort |
Zhou, Ran |
title |
On the data scarcity problem of neural-based named entity recognition |
title_short |
On the data scarcity problem of neural-based named entity recognition |
title_full |
On the data scarcity problem of neural-based named entity recognition |
title_fullStr |
On the data scarcity problem of neural-based named entity recognition |
title_full_unstemmed |
On the data scarcity problem of neural-based named entity recognition |
title_sort |
on the data scarcity problem of neural-based named entity recognition |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/173481 |
_version_ |
1794549416901738496 |
spelling |
sg-ntu-dr.10356-1734812024-03-07T08:52:06Z On the data scarcity problem of neural-based named entity recognition Zhou, Ran Erik Cambria Miao Chun Yan School of Computer Science and Engineering ASCYMiao@ntu.edu.sg, cambria@ntu.edu.sg Computer and Information Science The data scarcity problem in neural-based Named Entity Recognition (NER) refers to the challenge of limited annotated data available for training NER models. Collecting and annotating large amounts of labeled data for various languages and domains can be time-consuming, expensive, and sometimes even impractical. This lack of labeled data can hinder the performance of neural-based NER models, as they require a substantial amount of annotated examples to learn effectively. With limited training data, neural-based NER models may struggle to generalize well and accurately identify unseen named entities in out-of-domain text or from a different language. They may be prone to overfitting, where the model becomes too specific to the training data and fails to generalize to new data, leading to reduced overall performance. Addressing the data scarcity problem in neural-based NER involves exploring alternative approaches to mitigate the impact of limited labeled data. Some strategies include data augmentation techniques, such as word or entity replacement, synthetic data generation, or leveraging external resources like knowledge bases or dictionaries. Many works focus on the popular data-scarce scenario of cross-lingual NER, where there is training data in the source language but few or no annotations in the target language. For example, consistency training encourages the model's predictions to be consistent across different representations of the same input, and can be used to improve the robustness and generalization of NER models across different languages. Moreover, self-training has been applied to enhance the NER model's knowledge of the target language's linguistic characteristics and entity patterns by taking advantage of the abundant unlabeled text in the target language. In this thesis, we present our research to address the data scarcity problem of neural-based NER. Our contributions are as follows. Firstly, we propose a novel data augmentation framework for low-resource NER, which effectively improves entity diversity and alleviates the token-label misalignment problem, and is proven effective under monolingual, cross-lingual, and multilingual experimental settings. Secondly, we present a consistency training method for cross-lingual NER, which propagates reliable supervision signals from the source language to the target language, aligns the representation space between languages, and alleviates overfitting on the source language. Evaluated on various cross-lingual transfer pairs, our method demonstrates superior performance over various baseline methods. Finally, we introduce an improved self-training method for cross-lingual NER, where contrastive learning is utilized to facilitate classification and prototype learning is used for iteratively denoising pseudo-labeled target language data. The proposed self-training method presents significant improvements over existing self-training methods and achieves state-of-the-art performance. In conclusion, we have shown that by proposing effective data augmentation methods, consistency training frameworks and improved self-training schema, the data scarcity problem in neural-based named entity recognition can be largely alleviated. Doctor of Philosophy 2024-02-07T05:22:01Z 2024-02-07T05:22:01Z 2023 Thesis-Doctor of Philosophy Zhou, R. (2023). On the data scarcity problem of neural-based named entity recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/173481 https://hdl.handle.net/10356/173481 10.32657/10356/173481 en Alibaba Group through Alibaba Innovative Research (AIR) Program Alibaba-NTU Singapore Joint Research Institute (JRI) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |