CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES

Abstract Meaning Representation (AMR) is a semantic representation of a sentence. Latest AMR parser for Indonesian language was built using machine learning approach with XGBoost and dependency parser, there are several caveats for this parser. There are only relatively small instances of AMR gra...

Full description

Saved in:

Bibliographic Details
Main Author:	Rachman Putra, Aditya
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/66555
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:66555
spelling	id-itb.:665552022-06-28T19:57:35ZCROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES Rachman Putra, Aditya Indonesia Theses Cross-lingual, Abstract Meaning Representation, Silver Dataset, Stog, Paraphrase INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/66555 Abstract Meaning Representation (AMR) is a semantic representation of a sentence. Latest AMR parser for Indonesian language was built using machine learning approach with XGBoost and dependency parser, there are several caveats for this parser. There are only relatively small instances of AMR graph pair in Indonesian language. Also, the variety of concepts and relations that can be represented is limited compared to AMR for English. In this research a cross- lingual AMR parser will be built, that is a model that generate English AMR from sentences in Indonesian Language. A cross-lingual AMR parser model was designed based on Pointer Generator Network to identify concepts, and biaffine attention classifier to identify relationships between these concepts. Because the cross-lingual AMR model is trained using the target language resource (in this case English), training corpus was built using 2 types of silver datasets. Silver dataset par that consists of parallel sentence from PANL-BPPT with the English part parsed using AMR parser for English. And Silver dataset trans, that contains train and dev set of AMR 2.0 translated to Indonesian using Opus-MT. In this study, 3 tests were carried out, namely testing the silver dataset used, between silver par and silver trans. The second is testing the training scheme in the form of zero-shot, bilingual, and language-specific. Third, testing the alternative multilingual word embedding used, including mBERT, XLM-R, and mT5. Based on the tests carried out, the silver trans dataset has the best performance, with the best training scheme being the bilingual scheme using both Indonesian silver dataset and English AMR 2.0 dataset. The multilingual word embedding that produces the best performance in this study is mT5. This model has comparable performance with cross-lingual AMR for German, Italy, Spain, and Chinese language. If compared with translate and parse baseline (gold test data translated to English, then parsed with English AMR) our model still falls short. Further analysis shows that the AMR cross-lingual parser has difficulty handling very short sentences especially those in the form of entities, incomplete sentences such as hashtags and article dates, and very long sentences. Extrinsic testing was also carried out on the WReTE paraphrased dataset, with classification based on the smatch value. This model produces better performance than the Indo4B-based model and similar models using Indonesian AMR. However, it still has lower performance compared to IndoBERT and mBERT based models. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Abstract Meaning Representation (AMR) is a semantic representation of a sentence. Latest AMR parser for Indonesian language was built using machine learning approach with XGBoost and dependency parser, there are several caveats for this parser. There are only relatively small instances of AMR graph pair in Indonesian language. Also, the variety of concepts and relations that can be represented is limited compared to AMR for English. In this research a cross- lingual AMR parser will be built, that is a model that generate English AMR from sentences in Indonesian Language. A cross-lingual AMR parser model was designed based on Pointer Generator Network to identify concepts, and biaffine attention classifier to identify relationships between these concepts. Because the cross-lingual AMR model is trained using the target language resource (in this case English), training corpus was built using 2 types of silver datasets. Silver dataset par that consists of parallel sentence from PANL-BPPT with the English part parsed using AMR parser for English. And Silver dataset trans, that contains train and dev set of AMR 2.0 translated to Indonesian using Opus-MT. In this study, 3 tests were carried out, namely testing the silver dataset used, between silver par and silver trans. The second is testing the training scheme in the form of zero-shot, bilingual, and language-specific. Third, testing the alternative multilingual word embedding used, including mBERT, XLM-R, and mT5. Based on the tests carried out, the silver trans dataset has the best performance, with the best training scheme being the bilingual scheme using both Indonesian silver dataset and English AMR 2.0 dataset. The multilingual word embedding that produces the best performance in this study is mT5. This model has comparable performance with cross-lingual AMR for German, Italy, Spain, and Chinese language. If compared with translate and parse baseline (gold test data translated to English, then parsed with English AMR) our model still falls short. Further analysis shows that the AMR cross-lingual parser has difficulty handling very short sentences especially those in the form of entities, incomplete sentences such as hashtags and article dates, and very long sentences. Extrinsic testing was also carried out on the WReTE paraphrased dataset, with classification based on the smatch value. This model produces better performance than the Indo4B-based model and similar models using Indonesian AMR. However, it still has lower performance compared to IndoBERT and mBERT based models.
format	Theses
author	Rachman Putra, Aditya
spellingShingle	Rachman Putra, Aditya CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
author_facet	Rachman Putra, Aditya
author_sort	Rachman Putra, Aditya
title	CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
title_short	CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
title_full	CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
title_fullStr	CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
title_full_unstemmed	CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
title_sort	cross-lingual abstract meaning representation parsing for indonesian sentences
url	https://digilib.itb.ac.id/gdl/view/66555
_version_	1822933079253057536

CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES

Similar Items