CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES

Abstract Meaning Representation (AMR) is a semantic representation of a sentence. Latest AMR parser for Indonesian language was built using machine learning approach with XGBoost and dependency parser, there are several caveats for this parser. There are only relatively small instances of AMR gra...

Full description

Saved in:
Bibliographic Details
Main Author: Rachman Putra, Aditya
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/66555
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Abstract Meaning Representation (AMR) is a semantic representation of a sentence. Latest AMR parser for Indonesian language was built using machine learning approach with XGBoost and dependency parser, there are several caveats for this parser. There are only relatively small instances of AMR graph pair in Indonesian language. Also, the variety of concepts and relations that can be represented is limited compared to AMR for English. In this research a cross- lingual AMR parser will be built, that is a model that generate English AMR from sentences in Indonesian Language. A cross-lingual AMR parser model was designed based on Pointer Generator Network to identify concepts, and biaffine attention classifier to identify relationships between these concepts. Because the cross-lingual AMR model is trained using the target language resource (in this case English), training corpus was built using 2 types of silver datasets. Silver dataset par that consists of parallel sentence from PANL-BPPT with the English part parsed using AMR parser for English. And Silver dataset trans, that contains train and dev set of AMR 2.0 translated to Indonesian using Opus-MT. In this study, 3 tests were carried out, namely testing the silver dataset used, between silver par and silver trans. The second is testing the training scheme in the form of zero-shot, bilingual, and language-specific. Third, testing the alternative multilingual word embedding used, including mBERT, XLM-R, and mT5. Based on the tests carried out, the silver trans dataset has the best performance, with the best training scheme being the bilingual scheme using both Indonesian silver dataset and English AMR 2.0 dataset. The multilingual word embedding that produces the best performance in this study is mT5. This model has comparable performance with cross-lingual AMR for German, Italy, Spain, and Chinese language. If compared with translate and parse baseline (gold test data translated to English, then parsed with English AMR) our model still falls short. Further analysis shows that the AMR cross-lingual parser has difficulty handling very short sentences especially those in the form of entities, incomplete sentences such as hashtags and article dates, and very long sentences. Extrinsic testing was also carried out on the WReTE paraphrased dataset, with classification based on the smatch value. This model produces better performance than the Indo4B-based model and similar models using Indonesian AMR. However, it still has lower performance compared to IndoBERT and mBERT based models.