ABSTRACT MEANING REPRESENTATION PARSER DEVELOPMENT FOR CROSS­LINGUAL INDONESIAN- ENGLISH WITH BART, INPUT CONCATENATION, AND DATASET AUGMENTATION

Abstract meaning representation (AMR) is one of representations that highlight the semantic of a sentence. AMR parsers for Indonesian sentences have been developed for monolingual and bilingual. The annotation for Indonesian AMR still has limited concepts and relations, so English AMR is used. Th...

Full description

Saved in:
Bibliographic Details
Main Author: Nafkhan Alzamzami, Moch.
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/73945
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Abstract meaning representation (AMR) is one of representations that highlight the semantic of a sentence. AMR parsers for Indonesian sentences have been developed for monolingual and bilingual. The annotation for Indonesian AMR still has limited concepts and relations, so English AMR is used. The cross­lingual approach transforms Indonesian sentences into English AMR. The current best AMR parsing model is based on the BART language model, called AMRBART, using graph pre­training techniques. The multilingual BART model, mBART, is required for cross­lingual AMR parsing. Training the mBART model takes a long time and requires significant memory. The available training data has not been evaluated, and given efforts to increase both the quantity and quality of the data are needed. Bilingual training methods have not yet utilized the interrelation between Indonesian and English sentences. In this study, a cross­lingual AMR parser is developed to parse Indonesian sentences into English AMR. Cross lingual AMR parser model for Indonesian sentences is developed with graph­pre­training using mBART model. Vocabulary trimming is performed based on the training dataset to reduce the required time and memory during model training. The model is trained using two techniques: bilingual and concatenation. The training dataset that produces the best performance is determined through augmentation and filtration techniques using AMR 2.0 and AMR 3.0 datasets. In this study, three experimental objectives are pursued: (1) validating the model after the vocabularies are trimmed, (2) evaluating the model that use concatenation training method, and (3) determining the best dataset building for cross lingual amr parsing model. The experiments are conducted using the SMATCH metric on the validation data of the AMR 3.0 dataset, utilizing OPUS- MT to generate pairs of Indonesian sentences. The model is then evaluated for cross­lingual AMR parsing tasks on the AMR 3.0 test data and entailment or paraphrase detection on the WReTE dataset, with classification based on SMATCH scores. iii Based on the experimental results, three conclusions can be drawn: (1) deleting some of the vocabularies from the model can fasten the training speed without sacrificing the performance drastically, (2) the training method using concatenation would improve the performance, and (3) AMR 3.0 dataset with augmentation and filtration produces the most performant model. The best model from the experiments was evaluated and achieved an SMATCH score of 68.1. This performance surpasses previous research models. Testing on the entailment or paraphrase detection task using the WReTE dataset resulted in better performance than previous AMR parser models on the validation data, with an F1 score of 0.748. Further testing on the test data yielded an F1 score of 0.758. However, it has not surpassed the performance of encoder models such as fine­tuned mBERT and fine­tuned IndoBERT­lite­large­p2 from IndoNLU, which achieved F1 scores of 0.844 and 0.854 respectively.