Active learning with applications in biomedical document annotation

The rapidly increasing volume of published biomedical research lit- erature is challenging individual biomedical researchers to keep up to date with all the latest development in their own research fields. With the advancement of natural language processing and Seman- tic Web technologies, more a...

Full description

Saved in:
Bibliographic Details
Main Author: Han, Xu
Other Authors: Kwoh Chee Keong
Format: Theses and Dissertations
Language:English
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10356/71680
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-71680
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
spellingShingle DRNTU::Engineering::Computer science and engineering
Han, Xu
Active learning with applications in biomedical document annotation
description The rapidly increasing volume of published biomedical research lit- erature is challenging individual biomedical researchers to keep up to date with all the latest development in their own research fields. With the advancement of natural language processing and Seman- tic Web technologies, more and more sophisticated biomedical natu- ral language processing systems are becoming available to lessen the burden of information overload individual biomedical researchers face. However, the building and evaluation of such biomedical text mining systems need a critical requirement of the manually anno- tated corpora; yet the construction process of the corpora is time- consuming and expensive as it requires much yet tedious effort from human annotators. Active learning is an approach to resolve this issue and aid annota- tors to reduce the time and effort needed for the corpus annotation process. In the widely used passive learning method, the documents are randomly and independently selected from the underlying dis- tribution, while in active learning method, a selection module is al- lowed to, repeatedly, query un-annotated documents in order to sin- gle out the most informative document to be manually annotated and to update its learned rules to achieve the overall maximized ef- ficiency. In this study, we first propose a document scoring based active learn- ing method for ontological event extraction. Our method can signif- icantly reduce the amount of annotated corpora to saturate event ex- traction performance, compared to random selection of corpora for annotation, which is the common practice, and previous active learn- ing methods for corpus selection. We evaluated the performance of all the active learning methods using the TEES event extraction system against the BioNLP Shared Tasks datasets, showing that our method can help the system achieve its previously reported perfor- mance only with 60%-70% of the original training data. We then propose a committee-based active learning method for the event extraction and named entity recognition. The method is based on two systems as follows: We first employ an event extraction sys- tem to filter potential false negatives among unlabeled documents, where the system does not extract any event. We then adopt a statis- tical method to rank the potential false negatives of unlabeled docu- ments 1) by using a language model that measures the probabilities of the expression of multiple events in documents and 2) by using a named entity recognition system that locates the named entities that can be event arguments (e.g., proteins). The proposed method fur- ther deals with unknown words in test data by using word similarity measures. We also apply our active learning method for the task of named entity recognition. We evaluate the proposed method against the BioNLP Shared Tasks datasets, and show that our method can achieve better performance than such previous methods as entropy and Gibbs error based methods and a conventional committee-based method. We also show that the incorporation of named entity recog- nition into the active learning for event extraction and the unknown word handling further improves the active learning method. In addi- tion, the adaptation of the active learning method into named entity recognition tasks also improves the document selection for manual annotation of named entities. Finally, we propose a novel clustering based active learning method for the biomedical NER task. We show that the underlying NER sys- tem using the proposed method outperforms those with other state of the art active learning methods, including density, Gibbs error and entropy based approaches, as well as the random selection. We com- pare variations of our proposed method and find the optimal design of the active learning method, which is to use the vector representa- tion of named entities, and to select documents that are ‘representa- tive’ and ‘informative’, as well as to use the Shared Nearest Neighbor (SNN) clustering approach. In particular, the optimal variant of the proposed method achieves a deficiency gain of 36.3% over random selection. The proposed active learning method is a promising research direc- tion and we will conduct further research to exploit the full potential of this method.
author2 Kwoh Chee Keong
author_facet Kwoh Chee Keong
Han, Xu
format Theses and Dissertations
author Han, Xu
author_sort Han, Xu
title Active learning with applications in biomedical document annotation
title_short Active learning with applications in biomedical document annotation
title_full Active learning with applications in biomedical document annotation
title_fullStr Active learning with applications in biomedical document annotation
title_full_unstemmed Active learning with applications in biomedical document annotation
title_sort active learning with applications in biomedical document annotation
publishDate 2017
url http://hdl.handle.net/10356/71680
_version_ 1759858060828868608
spelling sg-ntu-dr.10356-716802023-03-04T00:50:01Z Active learning with applications in biomedical document annotation Han, Xu Kwoh Chee Keong School of Computer Science and Engineering Bioinformatics Research Centre DRNTU::Engineering::Computer science and engineering The rapidly increasing volume of published biomedical research lit- erature is challenging individual biomedical researchers to keep up to date with all the latest development in their own research fields. With the advancement of natural language processing and Seman- tic Web technologies, more and more sophisticated biomedical natu- ral language processing systems are becoming available to lessen the burden of information overload individual biomedical researchers face. However, the building and evaluation of such biomedical text mining systems need a critical requirement of the manually anno- tated corpora; yet the construction process of the corpora is time- consuming and expensive as it requires much yet tedious effort from human annotators. Active learning is an approach to resolve this issue and aid annota- tors to reduce the time and effort needed for the corpus annotation process. In the widely used passive learning method, the documents are randomly and independently selected from the underlying dis- tribution, while in active learning method, a selection module is al- lowed to, repeatedly, query un-annotated documents in order to sin- gle out the most informative document to be manually annotated and to update its learned rules to achieve the overall maximized ef- ficiency. In this study, we first propose a document scoring based active learn- ing method for ontological event extraction. Our method can signif- icantly reduce the amount of annotated corpora to saturate event ex- traction performance, compared to random selection of corpora for annotation, which is the common practice, and previous active learn- ing methods for corpus selection. We evaluated the performance of all the active learning methods using the TEES event extraction system against the BioNLP Shared Tasks datasets, showing that our method can help the system achieve its previously reported perfor- mance only with 60%-70% of the original training data. We then propose a committee-based active learning method for the event extraction and named entity recognition. The method is based on two systems as follows: We first employ an event extraction sys- tem to filter potential false negatives among unlabeled documents, where the system does not extract any event. We then adopt a statis- tical method to rank the potential false negatives of unlabeled docu- ments 1) by using a language model that measures the probabilities of the expression of multiple events in documents and 2) by using a named entity recognition system that locates the named entities that can be event arguments (e.g., proteins). The proposed method fur- ther deals with unknown words in test data by using word similarity measures. We also apply our active learning method for the task of named entity recognition. We evaluate the proposed method against the BioNLP Shared Tasks datasets, and show that our method can achieve better performance than such previous methods as entropy and Gibbs error based methods and a conventional committee-based method. We also show that the incorporation of named entity recog- nition into the active learning for event extraction and the unknown word handling further improves the active learning method. In addi- tion, the adaptation of the active learning method into named entity recognition tasks also improves the document selection for manual annotation of named entities. Finally, we propose a novel clustering based active learning method for the biomedical NER task. We show that the underlying NER sys- tem using the proposed method outperforms those with other state of the art active learning methods, including density, Gibbs error and entropy based approaches, as well as the random selection. We com- pare variations of our proposed method and find the optimal design of the active learning method, which is to use the vector representa- tion of named entities, and to select documents that are ‘representa- tive’ and ‘informative’, as well as to use the Shared Nearest Neighbor (SNN) clustering approach. In particular, the optimal variant of the proposed method achieves a deficiency gain of 36.3% over random selection. The proposed active learning method is a promising research direc- tion and we will conduct further research to exploit the full potential of this method. Doctor of Philosophy 2017-05-18T07:57:40Z 2017-05-18T07:57:40Z 2017 Thesis Han, X. (2017). Active learning with applications in biomedical document annotation. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/71680 10.32657/10356/71680 en 134 p. application/pdf