EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA
The constant expansion of the World Wide Web had created the biggest repository of natural language text with various types of information which is served in varying forms such as web pages, news articles, social media posts or blogs. Numerous Geographic Information Retrieval (GIR) systems have b...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/53182 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:53182 |
---|---|
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
The constant expansion of the World Wide Web had created the biggest repository
of natural language text with various types of information which is served in
varying forms such as web pages, news articles, social media posts or blogs.
Numerous Geographic Information Retrieval (GIR) systems have been developed
with the objective of extraction of geospatial information from unstructured text
data and the retrieval of those information efficiently via some mapping interface.
One of the core component of a GIR is the geoparser, which typically performs
toponym recognition, disambiguation and coordinate resolution of toponyms from
unstructured text domain.
However, geoparsing task is still an open problem due to the ambiguities of
toponyms and other noises presented in the text, especially within news stories
where many events across places are mentioned together with event argument of
various types such as geospatial, temporal and numerical. Existing geoparsers
have been able to resolve at toponym level or document level, but they lack the
capability to optimally resolve the event-level scope of resolution. For this purpose,
the integration with event extraction methods seem to be a promising approach.
However, it has not been extensively studied, much less in Indonesian news corpus
domain. The main hypothesis of this dissertation is that the integration of event
extraction would benefit the performance and improve the quality of event
geolocation from text. The second hypothesis is that semantic exploration process
would improve the generalizability of the model.
The research explored geoparsing techniques with event-level resolution scope
with four main contributions. The first contribution is a novel event geoparser
model and its implementation which improves the quality and performance of
resolving location from event by integrating event extraction method within three
stages: 1) toponym-level geoparsing 2) event extraction and 3) event-level
geoparsing. The geoparsing task is modeled mostly as sequence labeling problem
vi
by LSTM-CRF architecture which uses handcrafted features provided by our
proposed Aggregated Topic Model (ATM). The ATM provides semantically related
event keywords for event triggers based on very large number of document tags
which provides keywords matching feature which increase weighted F-1 accuracies
for entity and event extraction tasks. This is further exploited as a binary Smallest
Administrative Level (SAL) document-level geospatial feature along with event
label feature to improve identification and classification of pseudo-location
entities.
The second contribution is a labeled topic model called Aggregated Topic Model
(ATM), which enable the exploration of semantic relatedness between tokens based
on multilabeled document tags. ATM solves the limitation of Labeled LDA by
splitting corpus into partitions and trained them separately, which will then be
aggregated to build the final model.
Our third contribution is the Spatial Minimality Centroid Distance (SMCD-ADM)
algorithm which improves the Spatial Minimality (SM) algorithm by adding
Centroid Distance metric in order to avoid degenerate cases in disambiguations.
This also improves the toponym resolution step by 5.71% compared to SM.
On the fourth contribution, we constructed the first annotated event geoparsing
dataset with disambiguated toponyms and event labels in bahasa Indonesia. The
main dataset used in this work is the first geoparsed and event extraction corpus in
Bahasa Indonesia, which are constructed from several news outlet in four main
topics: earthquake, flood, accidents and fire.
The ATM is used together with semantic similarity provided by Word2Vec to
explore the corpus to construct semantic gazetteer based on large numbers of
document tags. We compared the performance with baseline LSTM-CRF with
standard gazetteer and part-of-speech tags, resulting improvement around F-1
2.46% for the entity extraction step, 10.76% improvement for event classification
step, 13.88% on the argument extraction step and eventually resulted in significant
improvement of 23.43% on pseudo-location identification step. As an implication
of event extraction, the model is also able to extract various numerical arguments
that is associated with events that happened in the grounded toponyms in the text.
This concludes that integration of event extraction into geoparsing, with pseudolocation identification and semantic exploration did able to increase the quality
and performance of geoparsing.
|
format |
Dissertations |
author |
Dewandaru, Agung |
spellingShingle |
Dewandaru, Agung EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA |
author_facet |
Dewandaru, Agung |
author_sort |
Dewandaru, Agung |
title |
EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA |
title_short |
EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA |
title_full |
EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA |
title_fullStr |
EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA |
title_full_unstemmed |
EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA |
title_sort |
event geoparser with event extraction integration, pseudo-location entity identification and semantic exploration from indonesian news corpora |
url |
https://digilib.itb.ac.id/gdl/view/53182 |
_version_ |
1822929254196707328 |
spelling |
id-itb.:531822021-03-01T13:24:12ZEVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA Dewandaru, Agung Indonesia Dissertations geoparser, event geolocation, geographic information retrieval, event extraction, semantic relatedness, semantic similarity, toponym resolution INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/53182 The constant expansion of the World Wide Web had created the biggest repository of natural language text with various types of information which is served in varying forms such as web pages, news articles, social media posts or blogs. Numerous Geographic Information Retrieval (GIR) systems have been developed with the objective of extraction of geospatial information from unstructured text data and the retrieval of those information efficiently via some mapping interface. One of the core component of a GIR is the geoparser, which typically performs toponym recognition, disambiguation and coordinate resolution of toponyms from unstructured text domain. However, geoparsing task is still an open problem due to the ambiguities of toponyms and other noises presented in the text, especially within news stories where many events across places are mentioned together with event argument of various types such as geospatial, temporal and numerical. Existing geoparsers have been able to resolve at toponym level or document level, but they lack the capability to optimally resolve the event-level scope of resolution. For this purpose, the integration with event extraction methods seem to be a promising approach. However, it has not been extensively studied, much less in Indonesian news corpus domain. The main hypothesis of this dissertation is that the integration of event extraction would benefit the performance and improve the quality of event geolocation from text. The second hypothesis is that semantic exploration process would improve the generalizability of the model. The research explored geoparsing techniques with event-level resolution scope with four main contributions. The first contribution is a novel event geoparser model and its implementation which improves the quality and performance of resolving location from event by integrating event extraction method within three stages: 1) toponym-level geoparsing 2) event extraction and 3) event-level geoparsing. The geoparsing task is modeled mostly as sequence labeling problem vi by LSTM-CRF architecture which uses handcrafted features provided by our proposed Aggregated Topic Model (ATM). The ATM provides semantically related event keywords for event triggers based on very large number of document tags which provides keywords matching feature which increase weighted F-1 accuracies for entity and event extraction tasks. This is further exploited as a binary Smallest Administrative Level (SAL) document-level geospatial feature along with event label feature to improve identification and classification of pseudo-location entities. The second contribution is a labeled topic model called Aggregated Topic Model (ATM), which enable the exploration of semantic relatedness between tokens based on multilabeled document tags. ATM solves the limitation of Labeled LDA by splitting corpus into partitions and trained them separately, which will then be aggregated to build the final model. Our third contribution is the Spatial Minimality Centroid Distance (SMCD-ADM) algorithm which improves the Spatial Minimality (SM) algorithm by adding Centroid Distance metric in order to avoid degenerate cases in disambiguations. This also improves the toponym resolution step by 5.71% compared to SM. On the fourth contribution, we constructed the first annotated event geoparsing dataset with disambiguated toponyms and event labels in bahasa Indonesia. The main dataset used in this work is the first geoparsed and event extraction corpus in Bahasa Indonesia, which are constructed from several news outlet in four main topics: earthquake, flood, accidents and fire. The ATM is used together with semantic similarity provided by Word2Vec to explore the corpus to construct semantic gazetteer based on large numbers of document tags. We compared the performance with baseline LSTM-CRF with standard gazetteer and part-of-speech tags, resulting improvement around F-1 2.46% for the entity extraction step, 10.76% improvement for event classification step, 13.88% on the argument extraction step and eventually resulted in significant improvement of 23.43% on pseudo-location identification step. As an implication of event extraction, the model is also able to extract various numerical arguments that is associated with events that happened in the grounded toponyms in the text. This concludes that integration of event extraction into geoparsing, with pseudolocation identification and semantic exploration did able to increase the quality and performance of geoparsing. text |