ABSTRACT MEANING REPRESENTATION PARSER DEVELOPMENT FOR CROSSÂLINGUAL INDONESIAN- ENGLISH WITH BART, INPUT CONCATENATION, AND DATASET AUGMENTATION
Abstract meaning representation (AMR) is one of representations that highlight the semantic of a sentence. AMR parsers for Indonesian sentences have been developed for monolingual and bilingual. The annotation for Indonesian AMR still has limited concepts and relations, so English AMR is used. Th...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/73945 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Abstract meaning representation (AMR) is one of representations that highlight
the semantic of a sentence. AMR parsers for Indonesian sentences have been
developed for monolingual and bilingual. The annotation for Indonesian AMR
still has limited concepts and relations, so English AMR is used. The crosslingual
approach transforms Indonesian sentences into English AMR. The current best
AMR parsing model is based on the BART language model, called AMRBART,
using graph pretraining techniques. The multilingual BART model, mBART, is
required for crosslingual AMR parsing. Training the mBART model takes a long
time and requires significant memory. The available training data has not been
evaluated, and given efforts to increase both the quantity and quality of the data
are needed. Bilingual training methods have not yet utilized the interrelation
between Indonesian and English sentences. In this study, a crosslingual AMR
parser is developed to parse Indonesian sentences into English AMR.
Cross lingual AMR parser model for Indonesian sentences is developed with
graphpretraining using mBART model. Vocabulary trimming is performed based
on the training dataset to reduce the required time and memory during model
training. The model is trained using two techniques: bilingual and concatenation.
The training dataset that produces the best performance is determined through
augmentation and filtration techniques using AMR 2.0 and AMR 3.0 datasets. In
this study, three experimental objectives are pursued: (1) validating the model
after the vocabularies are trimmed, (2) evaluating the model that use
concatenation training method, and (3) determining the best dataset building for
cross lingual amr parsing model. The experiments are conducted using the
SMATCH metric on the validation data of the AMR 3.0 dataset, utilizing OPUS-
MT to generate pairs of Indonesian sentences. The model is then evaluated for
crosslingual AMR parsing tasks on the AMR 3.0 test data and entailment or
paraphrase detection on the WReTE dataset, with classification based on
SMATCH scores.
iii
Based on the experimental results, three conclusions can be drawn: (1) deleting
some of the vocabularies from the model can fasten the training speed without
sacrificing the performance drastically, (2) the training method using
concatenation would improve the performance, and (3) AMR 3.0 dataset with
augmentation and filtration produces the most performant model. The best model
from the experiments was evaluated and achieved an SMATCH score of 68.1. This
performance surpasses previous research models.
Testing on the entailment or paraphrase detection task using the WReTE dataset
resulted in better performance than previous AMR parser models on the
validation data, with an F1 score of 0.748. Further testing on the test data yielded
an F1 score of 0.758. However, it has not surpassed the performance of encoder
models such as finetuned mBERT and finetuned IndoBERTlitelargep2 from
IndoNLU, which achieved F1 scores of 0.844 and 0.854 respectively. |
---|