EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA

The constant expansion of the World Wide Web had created the biggest repository of natural language text with various types of information which is served in varying forms such as web pages, news articles, social media posts or blogs. Numerous Geographic Information Retrieval (GIR) systems have b...

Full description

Saved in:

Bibliographic Details
Main Author:	Dewandaru, Agung
Format:	Dissertations
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/53182
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:53182
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	The constant expansion of the World Wide Web had created the biggest repository of natural language text with various types of information which is served in varying forms such as web pages, news articles, social media posts or blogs. Numerous Geographic Information Retrieval (GIR) systems have been developed with the objective of extraction of geospatial information from unstructured text data and the retrieval of those information efficiently via some mapping interface. One of the core component of a GIR is the geoparser, which typically performs toponym recognition, disambiguation and coordinate resolution of toponyms from unstructured text domain. However, geoparsing task is still an open problem due to the ambiguities of toponyms and other noises presented in the text, especially within news stories where many events across places are mentioned together with event argument of various types such as geospatial, temporal and numerical. Existing geoparsers have been able to resolve at toponym level or document level, but they lack the capability to optimally resolve the event-level scope of resolution. For this purpose, the integration with event extraction methods seem to be a promising approach. However, it has not been extensively studied, much less in Indonesian news corpus domain. The main hypothesis of this dissertation is that the integration of event extraction would benefit the performance and improve the quality of event geolocation from text. The second hypothesis is that semantic exploration process would improve the generalizability of the model. The research explored geoparsing techniques with event-level resolution scope with four main contributions. The first contribution is a novel event geoparser model and its implementation which improves the quality and performance of resolving location from event by integrating event extraction method within three stages: 1) toponym-level geoparsing 2) event extraction and 3) event-level geoparsing. The geoparsing task is modeled mostly as sequence labeling problem vi by LSTM-CRF architecture which uses handcrafted features provided by our proposed Aggregated Topic Model (ATM). The ATM provides semantically related event keywords for event triggers based on very large number of document tags which provides keywords matching feature which increase weighted F-1 accuracies for entity and event extraction tasks. This is further exploited as a binary Smallest Administrative Level (SAL) document-level geospatial feature along with event label feature to improve identification and classification of pseudo-location entities. The second contribution is a labeled topic model called Aggregated Topic Model (ATM), which enable the exploration of semantic relatedness between tokens based on multilabeled document tags. ATM solves the limitation of Labeled LDA by splitting corpus into partitions and trained them separately, which will then be aggregated to build the final model. Our third contribution is the Spatial Minimality Centroid Distance (SMCD-ADM) algorithm which improves the Spatial Minimality (SM) algorithm by adding Centroid Distance metric in order to avoid degenerate cases in disambiguations. This also improves the toponym resolution step by 5.71% compared to SM. On the fourth contribution, we constructed the first annotated event geoparsing dataset with disambiguated toponyms and event labels in bahasa Indonesia. The main dataset used in this work is the first geoparsed and event extraction corpus in Bahasa Indonesia, which are constructed from several news outlet in four main topics: earthquake, flood, accidents and fire. The ATM is used together with semantic similarity provided by Word2Vec to explore the corpus to construct semantic gazetteer based on large numbers of document tags. We compared the performance with baseline LSTM-CRF with standard gazetteer and part-of-speech tags, resulting improvement around F-1 2.46% for the entity extraction step, 10.76% improvement for event classification step, 13.88% on the argument extraction step and eventually resulted in significant improvement of 23.43% on pseudo-location identification step. As an implication of event extraction, the model is also able to extract various numerical arguments that is associated with events that happened in the grounded toponyms in the text. This concludes that integration of event extraction into geoparsing, with pseudolocation identification and semantic exploration did able to increase the quality and performance of geoparsing.
format	Dissertations
author	Dewandaru, Agung
spellingShingle	Dewandaru, Agung EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA
author_facet	Dewandaru, Agung
author_sort	Dewandaru, Agung
title	EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA
title_short	EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA
title_full	EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA
title_fullStr	EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA
title_full_unstemmed	EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA
title_sort	event geoparser with event extraction integration, pseudo-location entity identification and semantic exploration from indonesian news corpora
url	https://digilib.itb.ac.id/gdl/view/53182
_version_	1822929254196707328
spelling	id-itb.:531822021-03-01T13:24:12ZEVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA Dewandaru, Agung Indonesia Dissertations geoparser, event geolocation, geographic information retrieval, event extraction, semantic relatedness, semantic similarity, toponym resolution INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/53182 The constant expansion of the World Wide Web had created the biggest repository of natural language text with various types of information which is served in varying forms such as web pages, news articles, social media posts or blogs. Numerous Geographic Information Retrieval (GIR) systems have been developed with the objective of extraction of geospatial information from unstructured text data and the retrieval of those information efficiently via some mapping interface. One of the core component of a GIR is the geoparser, which typically performs toponym recognition, disambiguation and coordinate resolution of toponyms from unstructured text domain. However, geoparsing task is still an open problem due to the ambiguities of toponyms and other noises presented in the text, especially within news stories where many events across places are mentioned together with event argument of various types such as geospatial, temporal and numerical. Existing geoparsers have been able to resolve at toponym level or document level, but they lack the capability to optimally resolve the event-level scope of resolution. For this purpose, the integration with event extraction methods seem to be a promising approach. However, it has not been extensively studied, much less in Indonesian news corpus domain. The main hypothesis of this dissertation is that the integration of event extraction would benefit the performance and improve the quality of event geolocation from text. The second hypothesis is that semantic exploration process would improve the generalizability of the model. The research explored geoparsing techniques with event-level resolution scope with four main contributions. The first contribution is a novel event geoparser model and its implementation which improves the quality and performance of resolving location from event by integrating event extraction method within three stages: 1) toponym-level geoparsing 2) event extraction and 3) event-level geoparsing. The geoparsing task is modeled mostly as sequence labeling problem vi by LSTM-CRF architecture which uses handcrafted features provided by our proposed Aggregated Topic Model (ATM). The ATM provides semantically related event keywords for event triggers based on very large number of document tags which provides keywords matching feature which increase weighted F-1 accuracies for entity and event extraction tasks. This is further exploited as a binary Smallest Administrative Level (SAL) document-level geospatial feature along with event label feature to improve identification and classification of pseudo-location entities. The second contribution is a labeled topic model called Aggregated Topic Model (ATM), which enable the exploration of semantic relatedness between tokens based on multilabeled document tags. ATM solves the limitation of Labeled LDA by splitting corpus into partitions and trained them separately, which will then be aggregated to build the final model. Our third contribution is the Spatial Minimality Centroid Distance (SMCD-ADM) algorithm which improves the Spatial Minimality (SM) algorithm by adding Centroid Distance metric in order to avoid degenerate cases in disambiguations. This also improves the toponym resolution step by 5.71% compared to SM. On the fourth contribution, we constructed the first annotated event geoparsing dataset with disambiguated toponyms and event labels in bahasa Indonesia. The main dataset used in this work is the first geoparsed and event extraction corpus in Bahasa Indonesia, which are constructed from several news outlet in four main topics: earthquake, flood, accidents and fire. The ATM is used together with semantic similarity provided by Word2Vec to explore the corpus to construct semantic gazetteer based on large numbers of document tags. We compared the performance with baseline LSTM-CRF with standard gazetteer and part-of-speech tags, resulting improvement around F-1 2.46% for the entity extraction step, 10.76% improvement for event classification step, 13.88% on the argument extraction step and eventually resulted in significant improvement of 23.43% on pseudo-location identification step. As an implication of event extraction, the model is also able to extract various numerical arguments that is associated with events that happened in the grounded toponyms in the text. This concludes that integration of event extraction into geoparsing, with pseudolocation identification and semantic exploration did able to increase the quality and performance of geoparsing. text

EVENT GEOPARSER WITH EVENT EXTRACTION INTEGRATION, PSEUDO-LOCATION ENTITY IDENTIFICATION AND SEMANTIC EXPLORATION FROM INDONESIAN NEWS CORPORA

Similar Items