Active learning with applications in biomedical document annotation

The rapidly increasing volume of published biomedical research lit- erature is challenging individual biomedical researchers to keep up to date with all the latest development in their own research ﬁelds. With the advancement of natural language processing and Seman- tic Web technologies, more a...

全面介紹

Saved in:

書目詳細資料
主要作者:	Han, Xu
其他作者:	Kwoh Chee Keong
格式:	Theses and Dissertations
語言:	English
出版:	2017
主題:	DRNTU::Engineering::Computer science and engineering
在線閱讀:	http://hdl.handle.net/10356/71680
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University
語言:	English

id	sg-ntu-dr.10356-71680
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Computer science and engineering
spellingShingle	DRNTU::Engineering::Computer science and engineering Han, Xu Active learning with applications in biomedical document annotation
description	The rapidly increasing volume of published biomedical research lit- erature is challenging individual biomedical researchers to keep up to date with all the latest development in their own research ﬁelds. With the advancement of natural language processing and Seman- tic Web technologies, more and more sophisticated biomedical natu- ral language processing systems are becoming available to lessen the burden of information overload individual biomedical researchers face. However, the building and evaluation of such biomedical text mining systems need a critical requirement of the manually anno- tated corpora; yet the construction process of the corpora is time- consuming and expensive as it requires much yet tedious effort from human annotators. Active learning is an approach to resolve this issue and aid annota- tors to reduce the time and effort needed for the corpus annotation process. In the widely used passive learning method, the documents are randomly and independently selected from the underlying dis- tribution, while in active learning method, a selection module is al- lowed to, repeatedly, query un-annotated documents in order to sin- gle out the most informative document to be manually annotated and to update its learned rules to achieve the overall maximized ef- ﬁciency. In this study, we ﬁrst propose a document scoring based active learn- ing method for ontological event extraction. Our method can signif- icantly reduce the amount of annotated corpora to saturate event ex- traction performance, compared to random selection of corpora for annotation, which is the common practice, and previous active learn- ing methods for corpus selection. We evaluated the performance of all the active learning methods using the TEES event extraction system against the BioNLP Shared Tasks datasets, showing that our method can help the system achieve its previously reported perfor- mance only with 60%-70% of the original training data. We then propose a committee-based active learning method for the event extraction and named entity recognition. The method is based on two systems as follows: We ﬁrst employ an event extraction sys- tem to ﬁlter potential false negatives among unlabeled documents, where the system does not extract any event. We then adopt a statis- tical method to rank the potential false negatives of unlabeled docu- ments 1) by using a language model that measures the probabilities of the expression of multiple events in documents and 2) by using a named entity recognition system that locates the named entities that can be event arguments (e.g., proteins). The proposed method fur- ther deals with unknown words in test data by using word similarity measures. We also apply our active learning method for the task of named entity recognition. We evaluate the proposed method against the BioNLP Shared Tasks datasets, and show that our method can achieve better performance than such previous methods as entropy and Gibbs error based methods and a conventional committee-based method. We also show that the incorporation of named entity recog- nition into the active learning for event extraction and the unknown word handling further improves the active learning method. In addi- tion, the adaptation of the active learning method into named entity recognition tasks also improves the document selection for manual annotation of named entities. Finally, we propose a novel clustering based active learning method for the biomedical NER task. We show that the underlying NER sys- tem using the proposed method outperforms those with other state of the art active learning methods, including density, Gibbs error and entropy based approaches, as well as the random selection. We com- pare variations of our proposed method and ﬁnd the optimal design of the active learning method, which is to use the vector representa- tion of named entities, and to select documents that are ‘representa- tive’ and ‘informative’, as well as to use the Shared Nearest Neighbor (SNN) clustering approach. In particular, the optimal variant of the proposed method achieves a deﬁciency gain of 36.3% over random selection. The proposed active learning method is a promising research direc- tion and we will conduct further research to exploit the full potential of this method.
author2	Kwoh Chee Keong
author_facet	Kwoh Chee Keong Han, Xu
format	Theses and Dissertations
author	Han, Xu
author_sort	Han, Xu
title	Active learning with applications in biomedical document annotation
title_short	Active learning with applications in biomedical document annotation
title_full	Active learning with applications in biomedical document annotation
title_fullStr	Active learning with applications in biomedical document annotation
title_full_unstemmed	Active learning with applications in biomedical document annotation
title_sort	active learning with applications in biomedical document annotation
publishDate	2017
url	http://hdl.handle.net/10356/71680
_version_	1759858060828868608
spelling	sg-ntu-dr.10356-716802023-03-04T00:50:01Z Active learning with applications in biomedical document annotation Han, Xu Kwoh Chee Keong School of Computer Science and Engineering Bioinformatics Research Centre DRNTU::Engineering::Computer science and engineering The rapidly increasing volume of published biomedical research lit- erature is challenging individual biomedical researchers to keep up to date with all the latest development in their own research ﬁelds. With the advancement of natural language processing and Seman- tic Web technologies, more and more sophisticated biomedical natu- ral language processing systems are becoming available to lessen the burden of information overload individual biomedical researchers face. However, the building and evaluation of such biomedical text mining systems need a critical requirement of the manually anno- tated corpora; yet the construction process of the corpora is time- consuming and expensive as it requires much yet tedious effort from human annotators. Active learning is an approach to resolve this issue and aid annota- tors to reduce the time and effort needed for the corpus annotation process. In the widely used passive learning method, the documents are randomly and independently selected from the underlying dis- tribution, while in active learning method, a selection module is al- lowed to, repeatedly, query un-annotated documents in order to sin- gle out the most informative document to be manually annotated and to update its learned rules to achieve the overall maximized ef- ﬁciency. In this study, we ﬁrst propose a document scoring based active learn- ing method for ontological event extraction. Our method can signif- icantly reduce the amount of annotated corpora to saturate event ex- traction performance, compared to random selection of corpora for annotation, which is the common practice, and previous active learn- ing methods for corpus selection. We evaluated the performance of all the active learning methods using the TEES event extraction system against the BioNLP Shared Tasks datasets, showing that our method can help the system achieve its previously reported perfor- mance only with 60%-70% of the original training data. We then propose a committee-based active learning method for the event extraction and named entity recognition. The method is based on two systems as follows: We ﬁrst employ an event extraction sys- tem to ﬁlter potential false negatives among unlabeled documents, where the system does not extract any event. We then adopt a statis- tical method to rank the potential false negatives of unlabeled docu- ments 1) by using a language model that measures the probabilities of the expression of multiple events in documents and 2) by using a named entity recognition system that locates the named entities that can be event arguments (e.g., proteins). The proposed method fur- ther deals with unknown words in test data by using word similarity measures. We also apply our active learning method for the task of named entity recognition. We evaluate the proposed method against the BioNLP Shared Tasks datasets, and show that our method can achieve better performance than such previous methods as entropy and Gibbs error based methods and a conventional committee-based method. We also show that the incorporation of named entity recog- nition into the active learning for event extraction and the unknown word handling further improves the active learning method. In addi- tion, the adaptation of the active learning method into named entity recognition tasks also improves the document selection for manual annotation of named entities. Finally, we propose a novel clustering based active learning method for the biomedical NER task. We show that the underlying NER sys- tem using the proposed method outperforms those with other state of the art active learning methods, including density, Gibbs error and entropy based approaches, as well as the random selection. We com- pare variations of our proposed method and ﬁnd the optimal design of the active learning method, which is to use the vector representa- tion of named entities, and to select documents that are ‘representa- tive’ and ‘informative’, as well as to use the Shared Nearest Neighbor (SNN) clustering approach. In particular, the optimal variant of the proposed method achieves a deﬁciency gain of 36.3% over random selection. The proposed active learning method is a promising research direc- tion and we will conduct further research to exploit the full potential of this method. Doctor of Philosophy 2017-05-18T07:57:40Z 2017-05-18T07:57:40Z 2017 Thesis Han, X. (2017). Active learning with applications in biomedical document annotation. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/71680 10.32657/10356/71680 en 134 p. application/pdf

Active learning with applications in biomedical document annotation

相似書籍