COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE

Coreference resolution is a problem in the field of text processing to find all mentions that refer to the same entity in the real world. Coreference resolution can be used to help solve problems in other natural language processing, namely entity linking, machine translation, summarization, chat...

全面介紹

Saved in:

書目詳細資料
主要作者:	Muslim, Fajar
格式:	Theses
語言:	Indonesia
在線閱讀:	https://digilib.itb.ac.id/gdl/view/66654
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Institut Teknologi Bandung
語言:	Indonesia

id	id-itb.:66654
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Coreference resolution is a problem in the field of text processing to find all mentions that refer to the same entity in the real world. Coreference resolution can be used to help solve problems in other natural language processing, namely entity linking, machine translation, summarization, chatbots, and question answering. Research on coreference resolution (coref) for Indonesian is still minimal. The coref research in Indonesian is relatively incomparable with each other because the data used are relatively different. The problems found in the Indonesian coref are problems with the dataset and problems with the algorithm. The problem with the dataset is that there is no standard dataset that can be used as a benchmark. The problem with the algorithm is that there is no research that uses the latest methods from several deep learning architectures that achieve competitive performance as in the English dataset. Another algorithm problem is that the best previous research still uses the pipelined system approach. This thesis is part of a joint research conducted by ITB, Prosa.ai, and AI Singapore. The research includes the creation of a Coreference Resolution in the Indonesian Language (COIN) dataset with standards that are adapted to the OntoNotes dataset standards and modeling using the c2f-coref and wl-coref architectures. This thesis has scope to build program code and carry out experiments using word level coreference resolution (wl-coref) architecture. In addition, there is an architectural experiment of Higher-order Coreference Resolution with Coarse-to-fine Inference (c2f-coref) with a variation of the BERT encoder carried out by engineers from AI Singapore. The analysis is carried out together to compare and analyze the performance of the model. The wl-coref architecture was chosen as the solution in this thesis due to its efficiency and competitive performance. The step in the wl-coref architecture is to find coreference links between word tokens, then perform span construction from IV tokens that have coreference links. The adaptation process carried out on the wl- coref architecture includes changes to pairwise features (hand crafted features) using only the distance between spans because other pairwise features are not available in the COIN dataset. In addition, the wl-coref architecture requires data dependency relations to be used as data in the span construction module. Meanwhile, in the COIN dataset this information is not available, so the data dependency relation is generated using the stanza library. Based on the experimental results, the wl-coref architecture (F1 score 76.24) is better than the c2f-coref architecture (F1 score 76.02). But the difference in performance between the two is not too big. This can be caused by the dependency relation data used in Indonesian wl-coref generated using stanzas, while in English the data is annotated manually. So that it can cause more errors in the Indonesian wl-coref architecture. The best encoder for wl-coref and c2f-coref architecture in Indonesian is XLM-RoBERTa-large. In addition, IndoSpanBERT-large provides competitive performance under XLM-RoBERTa-large, so it can be a good choice for encoder with a lighter model size. Tests on the LEA metric show that there is a tendency for a model that is good on the CoNLL metric to be good on the LEA metric as well. Although LEA and CoNLL metrics have different calculation approaches. Based on the observation of mention recall in several variations of mention types and mention length, it shows that mention types with many instances tend to have better mention recall than mention recall types with few instances. In addition, the longer the mention, the model tends to get fewer recall mentions. The hyperparameter tuning experiment in this thesis proves that the default hyperparameter from Dobrovolskii's (2021) study is the best hyperparameter.
format	Theses
author	Muslim, Fajar
spellingShingle	Muslim, Fajar COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
author_facet	Muslim, Fajar
author_sort	Muslim, Fajar
title	COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
title_short	COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
title_full	COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
title_fullStr	COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
title_full_unstemmed	COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
title_sort	coreference resolution in indonesian language using word level coreference resolution arhitecture
url	https://digilib.itb.ac.id/gdl/view/66654
_version_	1823648985599967232
spelling	id-itb.:666542022-06-29T18:46:16ZCOREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE Muslim, Fajar Indonesia Theses coreference resolution, dataset, word-level architecture, XLM- RoBERTa, IndoSpanBERT, mention recall. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/66654 Coreference resolution is a problem in the field of text processing to find all mentions that refer to the same entity in the real world. Coreference resolution can be used to help solve problems in other natural language processing, namely entity linking, machine translation, summarization, chatbots, and question answering. Research on coreference resolution (coref) for Indonesian is still minimal. The coref research in Indonesian is relatively incomparable with each other because the data used are relatively different. The problems found in the Indonesian coref are problems with the dataset and problems with the algorithm. The problem with the dataset is that there is no standard dataset that can be used as a benchmark. The problem with the algorithm is that there is no research that uses the latest methods from several deep learning architectures that achieve competitive performance as in the English dataset. Another algorithm problem is that the best previous research still uses the pipelined system approach. This thesis is part of a joint research conducted by ITB, Prosa.ai, and AI Singapore. The research includes the creation of a Coreference Resolution in the Indonesian Language (COIN) dataset with standards that are adapted to the OntoNotes dataset standards and modeling using the c2f-coref and wl-coref architectures. This thesis has scope to build program code and carry out experiments using word level coreference resolution (wl-coref) architecture. In addition, there is an architectural experiment of Higher-order Coreference Resolution with Coarse-to-fine Inference (c2f-coref) with a variation of the BERT encoder carried out by engineers from AI Singapore. The analysis is carried out together to compare and analyze the performance of the model. The wl-coref architecture was chosen as the solution in this thesis due to its efficiency and competitive performance. The step in the wl-coref architecture is to find coreference links between word tokens, then perform span construction from IV tokens that have coreference links. The adaptation process carried out on the wl- coref architecture includes changes to pairwise features (hand crafted features) using only the distance between spans because other pairwise features are not available in the COIN dataset. In addition, the wl-coref architecture requires data dependency relations to be used as data in the span construction module. Meanwhile, in the COIN dataset this information is not available, so the data dependency relation is generated using the stanza library. Based on the experimental results, the wl-coref architecture (F1 score 76.24) is better than the c2f-coref architecture (F1 score 76.02). But the difference in performance between the two is not too big. This can be caused by the dependency relation data used in Indonesian wl-coref generated using stanzas, while in English the data is annotated manually. So that it can cause more errors in the Indonesian wl-coref architecture. The best encoder for wl-coref and c2f-coref architecture in Indonesian is XLM-RoBERTa-large. In addition, IndoSpanBERT-large provides competitive performance under XLM-RoBERTa-large, so it can be a good choice for encoder with a lighter model size. Tests on the LEA metric show that there is a tendency for a model that is good on the CoNLL metric to be good on the LEA metric as well. Although LEA and CoNLL metrics have different calculation approaches. Based on the observation of mention recall in several variations of mention types and mention length, it shows that mention types with many instances tend to have better mention recall than mention recall types with few instances. In addition, the longer the mention, the model tends to get fewer recall mentions. The hyperparameter tuning experiment in this thesis proves that the default hyperparameter from Dobrovolskii's (2021) study is the best hyperparameter. text

COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE

相似書籍