COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE
Coreference resolution is a problem in the field of text processing to find all mentions that refer to the same entity in the real world. Coreference resolution can be used to help solve problems in other natural language processing, namely entity linking, machine translation, summarization, chat...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/66654 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:66654 |
---|---|
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Coreference resolution is a problem in the field of text processing to find all
mentions that refer to the same entity in the real world. Coreference resolution can
be used to help solve problems in other natural language processing, namely entity
linking, machine translation, summarization, chatbots, and question answering.
Research on coreference resolution (coref) for Indonesian is still minimal. The
coref research in Indonesian is relatively incomparable with each other because
the data used are relatively different.
The problems found in the Indonesian coref are problems with the dataset and
problems with the algorithm. The problem with the dataset is that there is no
standard dataset that can be used as a benchmark. The problem with the algorithm
is that there is no research that uses the latest methods from several deep learning
architectures that achieve competitive performance as in the English dataset.
Another algorithm problem is that the best previous research still uses the pipelined
system approach.
This thesis is part of a joint research conducted by ITB, Prosa.ai, and AI Singapore.
The research includes the creation of a Coreference Resolution in the Indonesian
Language (COIN) dataset with standards that are adapted to the OntoNotes dataset
standards and modeling using the c2f-coref and wl-coref architectures. This thesis
has scope to build program code and carry out experiments using word level
coreference resolution (wl-coref) architecture. In addition, there is an architectural
experiment of Higher-order Coreference Resolution with Coarse-to-fine Inference
(c2f-coref) with a variation of the BERT encoder carried out by engineers from AI
Singapore. The analysis is carried out together to compare and analyze the
performance of the model.
The wl-coref architecture was chosen as the solution in this thesis due to its
efficiency and competitive performance. The step in the wl-coref architecture is to
find coreference links between word tokens, then perform span construction from
IV
tokens that have coreference links. The adaptation process carried out on the wl-
coref architecture includes changes to pairwise features (hand crafted features)
using only the distance between spans because other pairwise features are not
available in the COIN dataset. In addition, the wl-coref architecture requires data
dependency relations to be used as data in the span construction module.
Meanwhile, in the COIN dataset this information is not available, so the data
dependency relation is generated using the stanza library.
Based on the experimental results, the wl-coref architecture (F1 score 76.24) is
better than the c2f-coref architecture (F1 score 76.02). But the difference in
performance between the two is not too big. This can be caused by the dependency
relation data used in Indonesian wl-coref generated using stanzas, while in English
the data is annotated manually. So that it can cause more errors in the Indonesian
wl-coref architecture. The best encoder for wl-coref and c2f-coref architecture in
Indonesian is XLM-RoBERTa-large. In addition, IndoSpanBERT-large provides
competitive performance under XLM-RoBERTa-large, so it can be a good choice
for encoder with a lighter model size. Tests on the LEA metric show that there is a
tendency for a model that is good on the CoNLL metric to be good on the LEA
metric as well. Although LEA and CoNLL metrics have different calculation
approaches.
Based on the observation of mention recall in several variations of mention types
and mention length, it shows that mention types with many instances tend to have
better mention recall than mention recall types with few instances. In addition, the
longer the mention, the model tends to get fewer recall mentions. The
hyperparameter tuning experiment in this thesis proves that the default
hyperparameter from Dobrovolskii's (2021) study is the best hyperparameter. |
format |
Theses |
author |
Muslim, Fajar |
spellingShingle |
Muslim, Fajar COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE |
author_facet |
Muslim, Fajar |
author_sort |
Muslim, Fajar |
title |
COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE |
title_short |
COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE |
title_full |
COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE |
title_fullStr |
COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE |
title_full_unstemmed |
COREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE |
title_sort |
coreference resolution in indonesian language using word level coreference resolution arhitecture |
url |
https://digilib.itb.ac.id/gdl/view/66654 |
_version_ |
1822005222352879616 |
spelling |
id-itb.:666542022-06-29T18:46:16ZCOREFERENCE RESOLUTION IN INDONESIAN LANGUAGE USING WORD LEVEL COREFERENCE RESOLUTION ARHITECTURE Muslim, Fajar Indonesia Theses coreference resolution, dataset, word-level architecture, XLM- RoBERTa, IndoSpanBERT, mention recall. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/66654 Coreference resolution is a problem in the field of text processing to find all mentions that refer to the same entity in the real world. Coreference resolution can be used to help solve problems in other natural language processing, namely entity linking, machine translation, summarization, chatbots, and question answering. Research on coreference resolution (coref) for Indonesian is still minimal. The coref research in Indonesian is relatively incomparable with each other because the data used are relatively different. The problems found in the Indonesian coref are problems with the dataset and problems with the algorithm. The problem with the dataset is that there is no standard dataset that can be used as a benchmark. The problem with the algorithm is that there is no research that uses the latest methods from several deep learning architectures that achieve competitive performance as in the English dataset. Another algorithm problem is that the best previous research still uses the pipelined system approach. This thesis is part of a joint research conducted by ITB, Prosa.ai, and AI Singapore. The research includes the creation of a Coreference Resolution in the Indonesian Language (COIN) dataset with standards that are adapted to the OntoNotes dataset standards and modeling using the c2f-coref and wl-coref architectures. This thesis has scope to build program code and carry out experiments using word level coreference resolution (wl-coref) architecture. In addition, there is an architectural experiment of Higher-order Coreference Resolution with Coarse-to-fine Inference (c2f-coref) with a variation of the BERT encoder carried out by engineers from AI Singapore. The analysis is carried out together to compare and analyze the performance of the model. The wl-coref architecture was chosen as the solution in this thesis due to its efficiency and competitive performance. The step in the wl-coref architecture is to find coreference links between word tokens, then perform span construction from IV tokens that have coreference links. The adaptation process carried out on the wl- coref architecture includes changes to pairwise features (hand crafted features) using only the distance between spans because other pairwise features are not available in the COIN dataset. In addition, the wl-coref architecture requires data dependency relations to be used as data in the span construction module. Meanwhile, in the COIN dataset this information is not available, so the data dependency relation is generated using the stanza library. Based on the experimental results, the wl-coref architecture (F1 score 76.24) is better than the c2f-coref architecture (F1 score 76.02). But the difference in performance between the two is not too big. This can be caused by the dependency relation data used in Indonesian wl-coref generated using stanzas, while in English the data is annotated manually. So that it can cause more errors in the Indonesian wl-coref architecture. The best encoder for wl-coref and c2f-coref architecture in Indonesian is XLM-RoBERTa-large. In addition, IndoSpanBERT-large provides competitive performance under XLM-RoBERTa-large, so it can be a good choice for encoder with a lighter model size. Tests on the LEA metric show that there is a tendency for a model that is good on the CoNLL metric to be good on the LEA metric as well. Although LEA and CoNLL metrics have different calculation approaches. Based on the observation of mention recall in several variations of mention types and mention length, it shows that mention types with many instances tend to have better mention recall than mention recall types with few instances. In addition, the longer the mention, the model tends to get fewer recall mentions. The hyperparameter tuning experiment in this thesis proves that the default hyperparameter from Dobrovolskii's (2021) study is the best hyperparameter. text |