CROSS-LINGUAL ABSTRACT MEANING REPRESENTATION PARSING FOR INDONESIAN SENTENCES
Abstract Meaning Representation (AMR) is a semantic representation of a sentence. Latest AMR parser for Indonesian language was built using machine learning approach with XGBoost and dependency parser, there are several caveats for this parser. There are only relatively small instances of AMR gra...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/66555 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Abstract Meaning Representation (AMR) is a semantic representation of a
sentence. Latest AMR parser for Indonesian language was built using machine
learning approach with XGBoost and dependency parser, there are several caveats
for this parser. There are only relatively small instances of AMR graph pair in
Indonesian language. Also, the variety of concepts and relations that can be
represented is limited compared to AMR for English. In this research a cross-
lingual AMR parser will be built, that is a model that generate English AMR from
sentences in Indonesian Language.
A cross-lingual AMR parser model was designed based on Pointer Generator
Network to identify concepts, and biaffine attention classifier to identify
relationships between these concepts. Because the cross-lingual AMR model is
trained using the target language resource (in this case English), training corpus
was built using 2 types of silver datasets. Silver dataset par that consists of parallel
sentence from PANL-BPPT with the English part parsed using AMR parser for
English. And Silver dataset trans, that contains train and dev set of AMR 2.0
translated to Indonesian using Opus-MT. In this study, 3 tests were carried out,
namely testing the silver dataset used, between silver par and silver trans. The
second is testing the training scheme in the form of zero-shot, bilingual, and
language-specific. Third, testing the alternative multilingual word embedding used,
including mBERT, XLM-R, and mT5.
Based on the tests carried out, the silver trans dataset has the best performance,
with the best training scheme being the bilingual scheme using both Indonesian
silver dataset and English AMR 2.0 dataset. The multilingual word embedding that
produces the best performance in this study is mT5. This model has comparable
performance with cross-lingual AMR for German, Italy, Spain, and Chinese
language. If compared with translate and parse baseline (gold test data translated
to English, then parsed with English AMR) our model still falls short. Further
analysis shows that the AMR cross-lingual parser has difficulty handling very short
sentences especially those in the form of entities, incomplete sentences such as
hashtags and article dates, and very long sentences. Extrinsic testing was also
carried out on the WReTE paraphrased dataset, with classification based on the
smatch value. This model produces better performance than the Indo4B-based model and similar models using Indonesian AMR. However, it still has lower
performance compared to IndoBERT and mBERT based models. |
---|