Active learning with applications in biomedical document annotation
The rapidly increasing volume of published biomedical research lit- erature is challenging individual biomedical researchers to keep up to date with all the latest development in their own research fields. With the advancement of natural language processing and Seman- tic Web technologies, more a...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2017
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/71680 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The rapidly increasing volume of published biomedical research lit-
erature is challenging individual biomedical researchers to keep up
to date with all the latest development in their own research fields.
With the advancement of natural language processing and Seman-
tic Web technologies, more and more sophisticated biomedical natu-
ral language processing systems are becoming available to lessen the
burden of information overload individual biomedical researchers
face. However, the building and evaluation of such biomedical text
mining systems need a critical requirement of the manually anno-
tated corpora; yet the construction process of the corpora is time-
consuming and expensive as it requires much yet tedious effort from
human annotators.
Active learning is an approach to resolve this issue and aid annota-
tors to reduce the time and effort needed for the corpus annotation
process. In the widely used passive learning method, the documents
are randomly and independently selected from the underlying dis-
tribution, while in active learning method, a selection module is al-
lowed to, repeatedly, query un-annotated documents in order to sin-
gle out the most informative document to be manually annotated
and to update its learned rules to achieve the overall maximized ef-
ficiency.
In this study, we first propose a document scoring based active learn-
ing method for ontological event extraction. Our method can signif-
icantly reduce the amount of annotated corpora to saturate event ex-
traction performance, compared to random selection of corpora for
annotation, which is the common practice, and previous active learn-
ing methods for corpus selection. We evaluated the performance
of all the active learning methods using the TEES event extraction
system against the BioNLP Shared Tasks datasets, showing that our
method can help the system achieve its previously reported perfor-
mance only with 60%-70% of the original training data.
We then propose a committee-based active learning method for the
event extraction and named entity recognition. The method is based
on two systems as follows: We first employ an event extraction sys-
tem to filter potential false negatives among unlabeled documents,
where the system does not extract any event. We then adopt a statis-
tical method to rank the potential false negatives of unlabeled docu-
ments 1) by using a language model that measures the probabilities
of the expression of multiple events in documents and 2) by using a
named entity recognition system that locates the named entities that
can be event arguments (e.g., proteins). The proposed method fur-
ther deals with unknown words in test data by using word similarity
measures. We also apply our active learning method for the task of
named entity recognition. We evaluate the proposed method against
the BioNLP Shared Tasks datasets, and show that our method can
achieve better performance than such previous methods as entropy
and Gibbs error based methods and a conventional committee-based
method. We also show that the incorporation of named entity recog-
nition into the active learning for event extraction and the unknown
word handling further improves the active learning method. In addi-
tion, the adaptation of the active learning method into named entity
recognition tasks also improves the document selection for manual
annotation of named entities.
Finally, we propose a novel clustering based active learning method
for the biomedical NER task. We show that the underlying NER sys-
tem using the proposed method outperforms those with other state
of the art active learning methods, including density, Gibbs error and
entropy based approaches, as well as the random selection. We com-
pare variations of our proposed method and find the optimal design
of the active learning method, which is to use the vector representa-
tion of named entities, and to select documents that are ‘representa-
tive’ and ‘informative’, as well as to use the Shared Nearest Neighbor
(SNN) clustering approach. In particular, the optimal variant of the
proposed method achieves a deficiency gain of 36.3% over random
selection.
The proposed active learning method is a promising research direc-
tion and we will conduct further research to exploit the full potential
of this method. |
---|