EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA
The constant expansion of the World Wide Web had created the biggest repository of natural language text with various types of information which is served in varying forms such as web pages, news articles, social media posts or blogs. Numerous Geographic Information Retrieval (GIR) systems have b...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/53182 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | The constant expansion of the World Wide Web had created the biggest repository
of natural language text with various types of information which is served in
varying forms such as web pages, news articles, social media posts or blogs.
Numerous Geographic Information Retrieval (GIR) systems have been developed
with the objective of extraction of geospatial information from unstructured text
data and the retrieval of those information efficiently via some mapping interface.
One of the core component of a GIR is the geoparser, which typically performs
toponym recognition, disambiguation and coordinate resolution of toponyms from
unstructured text domain.
However, geoparsing task is still an open problem due to the ambiguities of
toponyms and other noises presented in the text, especially within news stories
where many events across places are mentioned together with event argument of
various types such as geospatial, temporal and numerical. Existing geoparsers
have been able to resolve at toponym level or document level, but they lack the
capability to optimally resolve the event-level scope of resolution. For this purpose,
the integration with event extraction methods seem to be a promising approach.
However, it has not been extensively studied, much less in Indonesian news corpus
domain. The main hypothesis of this dissertation is that the integration of event
extraction would benefit the performance and improve the quality of event
geolocation from text. The second hypothesis is that semantic exploration process
would improve the generalizability of the model.
The research explored geoparsing techniques with event-level resolution scope
with four main contributions. The first contribution is a novel event geoparser
model and its implementation which improves the quality and performance of
resolving location from event by integrating event extraction method within three
stages: 1) toponym-level geoparsing 2) event extraction and 3) event-level
geoparsing. The geoparsing task is modeled mostly as sequence labeling problem
vi
by LSTM-CRF architecture which uses handcrafted features provided by our
proposed Aggregated Topic Model (ATM). The ATM provides semantically related
event keywords for event triggers based on very large number of document tags
which provides keywords matching feature which increase weighted F-1 accuracies
for entity and event extraction tasks. This is further exploited as a binary Smallest
Administrative Level (SAL) document-level geospatial feature along with event
label feature to improve identification and classification of pseudo-location
entities.
The second contribution is a labeled topic model called Aggregated Topic Model
(ATM), which enable the exploration of semantic relatedness between tokens based
on multilabeled document tags. ATM solves the limitation of Labeled LDA by
splitting corpus into partitions and trained them separately, which will then be
aggregated to build the final model.
Our third contribution is the Spatial Minimality Centroid Distance (SMCD-ADM)
algorithm which improves the Spatial Minimality (SM) algorithm by adding
Centroid Distance metric in order to avoid degenerate cases in disambiguations.
This also improves the toponym resolution step by 5.71% compared to SM.
On the fourth contribution, we constructed the first annotated event geoparsing
dataset with disambiguated toponyms and event labels in bahasa Indonesia. The
main dataset used in this work is the first geoparsed and event extraction corpus in
Bahasa Indonesia, which are constructed from several news outlet in four main
topics: earthquake, flood, accidents and fire.
The ATM is used together with semantic similarity provided by Word2Vec to
explore the corpus to construct semantic gazetteer based on large numbers of
document tags. We compared the performance with baseline LSTM-CRF with
standard gazetteer and part-of-speech tags, resulting improvement around F-1
2.46% for the entity extraction step, 10.76% improvement for event classification
step, 13.88% on the argument extraction step and eventually resulted in significant
improvement of 23.43% on pseudo-location identification step. As an implication
of event extraction, the model is also able to extract various numerical arguments
that is associated with events that happened in the grounded toponyms in the text.
This concludes that integration of event extraction into geoparsing, with pseudolocation identification and semantic exploration did able to increase the quality
and performance of geoparsing.
|
---|