CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
Abstract Meaning Representation (AMR) is a semantic representation of a sentence. Latest AMR parser for Indonesian language was built using machine learning approach with XGBoost and dependency parser, there are several caveats for this parser. There are only relatively small instances of AMR gra...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/66555 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:66555 |
---|---|
spelling |
id-itb.:665552022-06-28T19:57:35ZCROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES Rachman Putra, Aditya Indonesia Theses Cross-lingual, Abstract Meaning Representation, Silver Dataset, Stog, Paraphrase INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/66555 Abstract Meaning Representation (AMR) is a semantic representation of a sentence. Latest AMR parser for Indonesian language was built using machine learning approach with XGBoost and dependency parser, there are several caveats for this parser. There are only relatively small instances of AMR graph pair in Indonesian language. Also, the variety of concepts and relations that can be represented is limited compared to AMR for English. In this research a cross- lingual AMR parser will be built, that is a model that generate English AMR from sentences in Indonesian Language. A cross-lingual AMR parser model was designed based on Pointer Generator Network to identify concepts, and biaffine attention classifier to identify relationships between these concepts. Because the cross-lingual AMR model is trained using the target language resource (in this case English), training corpus was built using 2 types of silver datasets. Silver dataset par that consists of parallel sentence from PANL-BPPT with the English part parsed using AMR parser for English. And Silver dataset trans, that contains train and dev set of AMR 2.0 translated to Indonesian using Opus-MT. In this study, 3 tests were carried out, namely testing the silver dataset used, between silver par and silver trans. The second is testing the training scheme in the form of zero-shot, bilingual, and language-specific. Third, testing the alternative multilingual word embedding used, including mBERT, XLM-R, and mT5. Based on the tests carried out, the silver trans dataset has the best performance, with the best training scheme being the bilingual scheme using both Indonesian silver dataset and English AMR 2.0 dataset. The multilingual word embedding that produces the best performance in this study is mT5. This model has comparable performance with cross-lingual AMR for German, Italy, Spain, and Chinese language. If compared with translate and parse baseline (gold test data translated to English, then parsed with English AMR) our model still falls short. Further analysis shows that the AMR cross-lingual parser has difficulty handling very short sentences especially those in the form of entities, incomplete sentences such as hashtags and article dates, and very long sentences. Extrinsic testing was also carried out on the WReTE paraphrased dataset, with classification based on the smatch value. This model produces better performance than the Indo4B-based model and similar models using Indonesian AMR. However, it still has lower performance compared to IndoBERT and mBERT based models. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Abstract Meaning Representation (AMR) is a semantic representation of a
sentence. Latest AMR parser for Indonesian language was built using machine
learning approach with XGBoost and dependency parser, there are several caveats
for this parser. There are only relatively small instances of AMR graph pair in
Indonesian language. Also, the variety of concepts and relations that can be
represented is limited compared to AMR for English. In this research a cross-
lingual AMR parser will be built, that is a model that generate English AMR from
sentences in Indonesian Language.
A cross-lingual AMR parser model was designed based on Pointer Generator
Network to identify concepts, and biaffine attention classifier to identify
relationships between these concepts. Because the cross-lingual AMR model is
trained using the target language resource (in this case English), training corpus
was built using 2 types of silver datasets. Silver dataset par that consists of parallel
sentence from PANL-BPPT with the English part parsed using AMR parser for
English. And Silver dataset trans, that contains train and dev set of AMR 2.0
translated to Indonesian using Opus-MT. In this study, 3 tests were carried out,
namely testing the silver dataset used, between silver par and silver trans. The
second is testing the training scheme in the form of zero-shot, bilingual, and
language-specific. Third, testing the alternative multilingual word embedding used,
including mBERT, XLM-R, and mT5.
Based on the tests carried out, the silver trans dataset has the best performance,
with the best training scheme being the bilingual scheme using both Indonesian
silver dataset and English AMR 2.0 dataset. The multilingual word embedding that
produces the best performance in this study is mT5. This model has comparable
performance with cross-lingual AMR for German, Italy, Spain, and Chinese
language. If compared with translate and parse baseline (gold test data translated
to English, then parsed with English AMR) our model still falls short. Further
analysis shows that the AMR cross-lingual parser has difficulty handling very short
sentences especially those in the form of entities, incomplete sentences such as
hashtags and article dates, and very long sentences. Extrinsic testing was also
carried out on the WReTE paraphrased dataset, with classification based on the
smatch value. This model produces better performance than the Indo4B-based model and similar models using Indonesian AMR. However, it still has lower
performance compared to IndoBERT and mBERT based models. |
format |
Theses |
author |
Rachman Putra, Aditya |
spellingShingle |
Rachman Putra, Aditya CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES |
author_facet |
Rachman Putra, Aditya |
author_sort |
Rachman Putra, Aditya |
title |
CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES |
title_short |
CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES |
title_full |
CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES |
title_fullStr |
CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES |
title_full_unstemmed |
CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES |
title_sort |
cross-lingual abstract meaning representation parsing for indonesian sentences |
url |
https://digilib.itb.ac.id/gdl/view/66555 |
_version_ |
1822933079253057536 |