INDONESIAN-JAVANESE XLM MACHINE TRANSLATION AIDED WITH GPT-3.5 DATA AUGMENTATION

Machine translation is one solution for preserving regional languages, which number more than 700 regional languages in Indonesia. An effective approach can start from developing a machine translation model that focuses on Javanese, which is the regional language with the largest number of speaker...

Full description

Saved in:
Bibliographic Details
Main Author: Diandaru, Ryandito
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/79562
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Machine translation is one solution for preserving regional languages, which number more than 700 regional languages in Indonesia. An effective approach can start from developing a machine translation model that focuses on Javanese, which is the regional language with the largest number of speakers in Indonesia, reaching 68 million people. Javanese has a lower bilingual corpus size compared to other languages in the world. The size of the bilingual corpus is a challenge in itself for building a machine translation model. Therefore, to create an Indonesian-Javanese machine translation model, data augmentation such as back-translation is needed to increase the size of the bilingual corpus from the existing monolingual corpus. On the other hand, Large Language Models (LLM) that can help with this problem are starting to emerge. GPT-3.5 has attracted attention recently because of its capabilities in terms of reasoning and logic, which have not previously been observed in language models. However, there has not been much exploration of the use of LLM for underrepresented languages such as Javanese. This research focuses on evaluating and exploring the performance of GPT-3.5 in translating Indonesian to Javanese, as well as its use as a method for enriching data through augmentation. Evaluation and exploration of GPT-3.5 in Indonesian-Javanese machine translation was carried out through three main experiments. The first experiment was a GPT-3.5 engineering prompt in Indonesian-Javanese machine translation. The second experiment is a comparison of several data augmentation methods for machine translation that use GPT-3.5 and those that do not. The third experiment was a comparison of prompting conditions in a bilingual sentence production task. The first and third experiments were run in zero-shot and few-shot conditions. Creating bilingual sentences is called parallel sentence generation. From the experimental results, it was revealed that the most optimal prompting method for GPT-3.5 in translating Indonesian to Javanese is through the few-shot approach. Compared to prompting with behavior context, the few-shot approach succeeded in consistently increasing the BLEU score by an average of 1.01. The experimental results of comparing data augmentation with back-translation and parallel sentence generation show that parallel sentence generation produces the highest average BLEU score, namely 16. Parallel sentence generation with the few-shot approach succeeded in achieving a competitive score with the zero-shot approach , although with a smaller amount of synthetic data. In addition, the sentences generated using the few-shot approach also show a lower level of mismatch compared to the zero-shot approach, with a difference of around 11.34%. Thus, it can be concluded that the sentences produced using the few-shot approach in parallel sentence generation have superior quality.