CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES

Abstract Meaning Representation (AMR) is a semantic representation of a sentence. Latest AMR parser for Indonesian language was built using machine learning approach with XGBoost and dependency parser, there are several caveats for this parser. There are only relatively small instances of AMR gra...

Full description

Saved in:
Bibliographic Details
Main Author: Rachman Putra, Aditya
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/66555
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:66555
spelling id-itb.:665552022-06-28T19:57:35ZCROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES Rachman Putra, Aditya Indonesia Theses Cross-lingual, Abstract Meaning Representation, Silver Dataset, Stog, Paraphrase INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/66555 Abstract Meaning Representation (AMR) is a semantic representation of a sentence. Latest AMR parser for Indonesian language was built using machine learning approach with XGBoost and dependency parser, there are several caveats for this parser. There are only relatively small instances of AMR graph pair in Indonesian language. Also, the variety of concepts and relations that can be represented is limited compared to AMR for English. In this research a cross- lingual AMR parser will be built, that is a model that generate English AMR from sentences in Indonesian Language. A cross-lingual AMR parser model was designed based on Pointer Generator Network to identify concepts, and biaffine attention classifier to identify relationships between these concepts. Because the cross-lingual AMR model is trained using the target language resource (in this case English), training corpus was built using 2 types of silver datasets. Silver dataset par that consists of parallel sentence from PANL-BPPT with the English part parsed using AMR parser for English. And Silver dataset trans, that contains train and dev set of AMR 2.0 translated to Indonesian using Opus-MT. In this study, 3 tests were carried out, namely testing the silver dataset used, between silver par and silver trans. The second is testing the training scheme in the form of zero-shot, bilingual, and language-specific. Third, testing the alternative multilingual word embedding used, including mBERT, XLM-R, and mT5. Based on the tests carried out, the silver trans dataset has the best performance, with the best training scheme being the bilingual scheme using both Indonesian silver dataset and English AMR 2.0 dataset. The multilingual word embedding that produces the best performance in this study is mT5. This model has comparable performance with cross-lingual AMR for German, Italy, Spain, and Chinese language. If compared with translate and parse baseline (gold test data translated to English, then parsed with English AMR) our model still falls short. Further analysis shows that the AMR cross-lingual parser has difficulty handling very short sentences especially those in the form of entities, incomplete sentences such as hashtags and article dates, and very long sentences. Extrinsic testing was also carried out on the WReTE paraphrased dataset, with classification based on the smatch value. This model produces better performance than the Indo4B-based model and similar models using Indonesian AMR. However, it still has lower performance compared to IndoBERT and mBERT based models. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Abstract Meaning Representation (AMR) is a semantic representation of a sentence. Latest AMR parser for Indonesian language was built using machine learning approach with XGBoost and dependency parser, there are several caveats for this parser. There are only relatively small instances of AMR graph pair in Indonesian language. Also, the variety of concepts and relations that can be represented is limited compared to AMR for English. In this research a cross- lingual AMR parser will be built, that is a model that generate English AMR from sentences in Indonesian Language. A cross-lingual AMR parser model was designed based on Pointer Generator Network to identify concepts, and biaffine attention classifier to identify relationships between these concepts. Because the cross-lingual AMR model is trained using the target language resource (in this case English), training corpus was built using 2 types of silver datasets. Silver dataset par that consists of parallel sentence from PANL-BPPT with the English part parsed using AMR parser for English. And Silver dataset trans, that contains train and dev set of AMR 2.0 translated to Indonesian using Opus-MT. In this study, 3 tests were carried out, namely testing the silver dataset used, between silver par and silver trans. The second is testing the training scheme in the form of zero-shot, bilingual, and language-specific. Third, testing the alternative multilingual word embedding used, including mBERT, XLM-R, and mT5. Based on the tests carried out, the silver trans dataset has the best performance, with the best training scheme being the bilingual scheme using both Indonesian silver dataset and English AMR 2.0 dataset. The multilingual word embedding that produces the best performance in this study is mT5. This model has comparable performance with cross-lingual AMR for German, Italy, Spain, and Chinese language. If compared with translate and parse baseline (gold test data translated to English, then parsed with English AMR) our model still falls short. Further analysis shows that the AMR cross-lingual parser has difficulty handling very short sentences especially those in the form of entities, incomplete sentences such as hashtags and article dates, and very long sentences. Extrinsic testing was also carried out on the WReTE paraphrased dataset, with classification based on the smatch value. This model produces better performance than the Indo4B-based model and similar models using Indonesian AMR. However, it still has lower performance compared to IndoBERT and mBERT based models.
format Theses
author Rachman Putra, Aditya
spellingShingle Rachman Putra, Aditya
CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
author_facet Rachman Putra, Aditya
author_sort Rachman Putra, Aditya
title CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
title_short CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
title_full CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
title_fullStr CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
title_full_unstemmed CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
title_sort cross-lingual abstract meaning representation parsing for indonesian sentences
url https://digilib.itb.ac.id/gdl/view/66555
_version_ 1822933079253057536