COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE

Coreference resolution is a problem in the field of text processing to find all mentions that refer to the same entity in the real world. Coreference resolution can be used to help solve problems in other natural language processing, namely entity linking, machine translation, summarization, chat...

Full description

Saved in:
Bibliographic Details
Main Author: Muslim, Fajar
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/66654
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:66654
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Coreference resolution is a problem in the field of text processing to find all mentions that refer to the same entity in the real world. Coreference resolution can be used to help solve problems in other natural language processing, namely entity linking, machine translation, summarization, chatbots, and question answering. Research on coreference resolution (coref) for Indonesian is still minimal. The coref research in Indonesian is relatively incomparable with each other because the data used are relatively different. The problems found in the Indonesian coref are problems with the dataset and problems with the algorithm. The problem with the dataset is that there is no standard dataset that can be used as a benchmark. The problem with the algorithm is that there is no research that uses the latest methods from several deep learning architectures that achieve competitive performance as in the English dataset. Another algorithm problem is that the best previous research still uses the pipelined system approach. This thesis is part of a joint research conducted by ITB, Prosa.ai, and AI Singapore. The research includes the creation of a Coreference Resolution in the Indonesian Language (COIN) dataset with standards that are adapted to the OntoNotes dataset standards and modeling using the c2f-coref and wl-coref architectures. This thesis has scope to build program code and carry out experiments using word level coreference resolution (wl-coref) architecture. In addition, there is an architectural experiment of Higher-order Coreference Resolution with Coarse-to-fine Inference (c2f-coref) with a variation of the BERT encoder carried out by engineers from AI Singapore. The analysis is carried out together to compare and analyze the performance of the model. The wl-coref architecture was chosen as the solution in this thesis due to its efficiency and competitive performance. The step in the wl-coref architecture is to find coreference links between word tokens, then perform span construction from IV tokens that have coreference links. The adaptation process carried out on the wl- coref architecture includes changes to pairwise features (hand crafted features) using only the distance between spans because other pairwise features are not available in the COIN dataset. In addition, the wl-coref architecture requires data dependency relations to be used as data in the span construction module. Meanwhile, in the COIN dataset this information is not available, so the data dependency relation is generated using the stanza library. Based on the experimental results, the wl-coref architecture (F1 score 76.24) is better than the c2f-coref architecture (F1 score 76.02). But the difference in performance between the two is not too big. This can be caused by the dependency relation data used in Indonesian wl-coref generated using stanzas, while in English the data is annotated manually. So that it can cause more errors in the Indonesian wl-coref architecture. The best encoder for wl-coref and c2f-coref architecture in Indonesian is XLM-RoBERTa-large. In addition, IndoSpanBERT-large provides competitive performance under XLM-RoBERTa-large, so it can be a good choice for encoder with a lighter model size. Tests on the LEA metric show that there is a tendency for a model that is good on the CoNLL metric to be good on the LEA metric as well. Although LEA and CoNLL metrics have different calculation approaches. Based on the observation of mention recall in several variations of mention types and mention length, it shows that mention types with many instances tend to have better mention recall than mention recall types with few instances. In addition, the longer the mention, the model tends to get fewer recall mentions. The hyperparameter tuning experiment in this thesis proves that the default hyperparameter from Dobrovolskii's (2021) study is the best hyperparameter.
format Theses
author Muslim, Fajar
spellingShingle Muslim, Fajar
COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
author_facet Muslim, Fajar
author_sort Muslim, Fajar
title COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
title_short COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
title_full COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
title_fullStr COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
title_full_unstemmed COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
title_sort coreference resolution in indonesian language using word level coreference resolution arhitecture
url https://digilib.itb.ac.id/gdl/view/66654
_version_ 1822005222352879616
spelling id-itb.:666542022-06-29T18:46:16ZCOREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE Muslim, Fajar Indonesia Theses coreference resolution, dataset, word-level architecture, XLM- RoBERTa, IndoSpanBERT, mention recall. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/66654 Coreference resolution is a problem in the field of text processing to find all mentions that refer to the same entity in the real world. Coreference resolution can be used to help solve problems in other natural language processing, namely entity linking, machine translation, summarization, chatbots, and question answering. Research on coreference resolution (coref) for Indonesian is still minimal. The coref research in Indonesian is relatively incomparable with each other because the data used are relatively different. The problems found in the Indonesian coref are problems with the dataset and problems with the algorithm. The problem with the dataset is that there is no standard dataset that can be used as a benchmark. The problem with the algorithm is that there is no research that uses the latest methods from several deep learning architectures that achieve competitive performance as in the English dataset. Another algorithm problem is that the best previous research still uses the pipelined system approach. This thesis is part of a joint research conducted by ITB, Prosa.ai, and AI Singapore. The research includes the creation of a Coreference Resolution in the Indonesian Language (COIN) dataset with standards that are adapted to the OntoNotes dataset standards and modeling using the c2f-coref and wl-coref architectures. This thesis has scope to build program code and carry out experiments using word level coreference resolution (wl-coref) architecture. In addition, there is an architectural experiment of Higher-order Coreference Resolution with Coarse-to-fine Inference (c2f-coref) with a variation of the BERT encoder carried out by engineers from AI Singapore. The analysis is carried out together to compare and analyze the performance of the model. The wl-coref architecture was chosen as the solution in this thesis due to its efficiency and competitive performance. The step in the wl-coref architecture is to find coreference links between word tokens, then perform span construction from IV tokens that have coreference links. The adaptation process carried out on the wl- coref architecture includes changes to pairwise features (hand crafted features) using only the distance between spans because other pairwise features are not available in the COIN dataset. In addition, the wl-coref architecture requires data dependency relations to be used as data in the span construction module. Meanwhile, in the COIN dataset this information is not available, so the data dependency relation is generated using the stanza library. Based on the experimental results, the wl-coref architecture (F1 score 76.24) is better than the c2f-coref architecture (F1 score 76.02). But the difference in performance between the two is not too big. This can be caused by the dependency relation data used in Indonesian wl-coref generated using stanzas, while in English the data is annotated manually. So that it can cause more errors in the Indonesian wl-coref architecture. The best encoder for wl-coref and c2f-coref architecture in Indonesian is XLM-RoBERTa-large. In addition, IndoSpanBERT-large provides competitive performance under XLM-RoBERTa-large, so it can be a good choice for encoder with a lighter model size. Tests on the LEA metric show that there is a tendency for a model that is good on the CoNLL metric to be good on the LEA metric as well. Although LEA and CoNLL metrics have different calculation approaches. Based on the observation of mention recall in several variations of mention types and mention length, it shows that mention types with many instances tend to have better mention recall than mention recall types with few instances. In addition, the longer the mention, the model tends to get fewer recall mentions. The hyperparameter tuning experiment in this thesis proves that the default hyperparameter from Dobrovolskii's (2021) study is the best hyperparameter. text