INDONESIAN-JAVANESE XLM MACHINE TRANSLATION AIDED WITH GPT-3.5 DATA AUGMENTATION

Machine translation is one solution for preserving regional languages, which number more than 700 regional languages in Indonesia. An effective approach can start from developing a machine translation model that focuses on Javanese, which is the regional language with the largest number of speaker...

Full description

Saved in:
Bibliographic Details
Main Author: Diandaru, Ryandito
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/79562
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:79562
spelling id-itb.:795622024-01-10T08:19:57ZINDONESIAN-JAVANESE XLM MACHINE TRANSLATION AIDED WITH GPT-3.5 DATA AUGMENTATION Diandaru, Ryandito Indonesia Final Project Javanese, GPT-3.5, Machine Translation, Data Augmentation. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/79562 Machine translation is one solution for preserving regional languages, which number more than 700 regional languages in Indonesia. An effective approach can start from developing a machine translation model that focuses on Javanese, which is the regional language with the largest number of speakers in Indonesia, reaching 68 million people. Javanese has a lower bilingual corpus size compared to other languages in the world. The size of the bilingual corpus is a challenge in itself for building a machine translation model. Therefore, to create an Indonesian-Javanese machine translation model, data augmentation such as back-translation is needed to increase the size of the bilingual corpus from the existing monolingual corpus. On the other hand, Large Language Models (LLM) that can help with this problem are starting to emerge. GPT-3.5 has attracted attention recently because of its capabilities in terms of reasoning and logic, which have not previously been observed in language models. However, there has not been much exploration of the use of LLM for underrepresented languages such as Javanese. This research focuses on evaluating and exploring the performance of GPT-3.5 in translating Indonesian to Javanese, as well as its use as a method for enriching data through augmentation. Evaluation and exploration of GPT-3.5 in Indonesian-Javanese machine translation was carried out through three main experiments. The first experiment was a GPT-3.5 engineering prompt in Indonesian-Javanese machine translation. The second experiment is a comparison of several data augmentation methods for machine translation that use GPT-3.5 and those that do not. The third experiment was a comparison of prompting conditions in a bilingual sentence production task. The first and third experiments were run in zero-shot and few-shot conditions. Creating bilingual sentences is called parallel sentence generation. From the experimental results, it was revealed that the most optimal prompting method for GPT-3.5 in translating Indonesian to Javanese is through the few-shot approach. Compared to prompting with behavior context, the few-shot approach succeeded in consistently increasing the BLEU score by an average of 1.01. The experimental results of comparing data augmentation with back-translation and parallel sentence generation show that parallel sentence generation produces the highest average BLEU score, namely 16. Parallel sentence generation with the few-shot approach succeeded in achieving a competitive score with the zero-shot approach , although with a smaller amount of synthetic data. In addition, the sentences generated using the few-shot approach also show a lower level of mismatch compared to the zero-shot approach, with a difference of around 11.34%. Thus, it can be concluded that the sentences produced using the few-shot approach in parallel sentence generation have superior quality. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Machine translation is one solution for preserving regional languages, which number more than 700 regional languages in Indonesia. An effective approach can start from developing a machine translation model that focuses on Javanese, which is the regional language with the largest number of speakers in Indonesia, reaching 68 million people. Javanese has a lower bilingual corpus size compared to other languages in the world. The size of the bilingual corpus is a challenge in itself for building a machine translation model. Therefore, to create an Indonesian-Javanese machine translation model, data augmentation such as back-translation is needed to increase the size of the bilingual corpus from the existing monolingual corpus. On the other hand, Large Language Models (LLM) that can help with this problem are starting to emerge. GPT-3.5 has attracted attention recently because of its capabilities in terms of reasoning and logic, which have not previously been observed in language models. However, there has not been much exploration of the use of LLM for underrepresented languages such as Javanese. This research focuses on evaluating and exploring the performance of GPT-3.5 in translating Indonesian to Javanese, as well as its use as a method for enriching data through augmentation. Evaluation and exploration of GPT-3.5 in Indonesian-Javanese machine translation was carried out through three main experiments. The first experiment was a GPT-3.5 engineering prompt in Indonesian-Javanese machine translation. The second experiment is a comparison of several data augmentation methods for machine translation that use GPT-3.5 and those that do not. The third experiment was a comparison of prompting conditions in a bilingual sentence production task. The first and third experiments were run in zero-shot and few-shot conditions. Creating bilingual sentences is called parallel sentence generation. From the experimental results, it was revealed that the most optimal prompting method for GPT-3.5 in translating Indonesian to Javanese is through the few-shot approach. Compared to prompting with behavior context, the few-shot approach succeeded in consistently increasing the BLEU score by an average of 1.01. The experimental results of comparing data augmentation with back-translation and parallel sentence generation show that parallel sentence generation produces the highest average BLEU score, namely 16. Parallel sentence generation with the few-shot approach succeeded in achieving a competitive score with the zero-shot approach , although with a smaller amount of synthetic data. In addition, the sentences generated using the few-shot approach also show a lower level of mismatch compared to the zero-shot approach, with a difference of around 11.34%. Thus, it can be concluded that the sentences produced using the few-shot approach in parallel sentence generation have superior quality.
format Final Project
author Diandaru, Ryandito
spellingShingle Diandaru, Ryandito
INDONESIAN-JAVANESE XLM MACHINE TRANSLATION AIDED WITH GPT-3.5 DATA AUGMENTATION
author_facet Diandaru, Ryandito
author_sort Diandaru, Ryandito
title INDONESIAN-JAVANESE XLM MACHINE TRANSLATION AIDED WITH GPT-3.5 DATA AUGMENTATION
title_short INDONESIAN-JAVANESE XLM MACHINE TRANSLATION AIDED WITH GPT-3.5 DATA AUGMENTATION
title_full INDONESIAN-JAVANESE XLM MACHINE TRANSLATION AIDED WITH GPT-3.5 DATA AUGMENTATION
title_fullStr INDONESIAN-JAVANESE XLM MACHINE TRANSLATION AIDED WITH GPT-3.5 DATA AUGMENTATION
title_full_unstemmed INDONESIAN-JAVANESE XLM MACHINE TRANSLATION AIDED WITH GPT-3.5 DATA AUGMENTATION
title_sort indonesian-javanese xlm machine translation aided with gpt-3.5 data augmentation
url https://digilib.itb.ac.id/gdl/view/79562
_version_ 1822008923425603584