FEW-SHOT LEARNING IN INDONESIAN LANGUAGE DOMAIN TEXT CLASSIFICATION

Lack of labelled data is a long-standing problem on natural language processing (NLP) field, particularly in low-resource languages, such as Indonesian. Transfer learning via pre-trained transformer-based language model (LM) has been a common approach to address this. The two most popular types of p...

Full description

Saved in:
Bibliographic Details
Main Author: Mahendra G H, Rayza
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/68650
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:68650
spelling id-itb.:686502022-09-19T07:59:08ZFEW-SHOT LEARNING IN INDONESIAN LANGUAGE DOMAIN TEXT CLASSIFICATION Mahendra G H, Rayza Indonesia Theses NLP, prompt-based few-shot learning, transformer, LM-BFF, in- context learning. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/68650 Lack of labelled data is a long-standing problem on natural language processing (NLP) field, particularly in low-resource languages, such as Indonesian. Transfer learning via pre-trained transformer-based language model (LM) has been a common approach to address this. The two most popular types of pretrained models are encoder only models like BERT and decoder-only models like GPT. Standard finetuning is the defacto method to use in transfer learning. There is another method however, that can be used in very low amount of data setting, namely the few-shot learning method. To understand the effectiveness of applying pre-trained LMs on the low-resource setting, a comprehensive study on prompt-based few-shot learning methodologies on IndoNLU as an existing Indonesian natural language understanding benchmark has been done. Three different methods were tested, standard finetuning and two few-shot learning methods, namely prompt-based finetuning (LM-BFF), and few-shot in-context learning. The language models tested are divided into three category, multilingual models with XGLM and XLM-R, English monolingual models with GPT-Neo, and Indonesian monolingual models with IndoBERT and IndoGPT. It is found that in-context learning using a multilingual decoder model, XGLM outperforms the English GPT-Neo models. Prompt-based tuning using LM-BFF with XLM-R was also shown to generally outperform the in-context learning method with a difference of up to ~20 F1-macro-average scores. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Lack of labelled data is a long-standing problem on natural language processing (NLP) field, particularly in low-resource languages, such as Indonesian. Transfer learning via pre-trained transformer-based language model (LM) has been a common approach to address this. The two most popular types of pretrained models are encoder only models like BERT and decoder-only models like GPT. Standard finetuning is the defacto method to use in transfer learning. There is another method however, that can be used in very low amount of data setting, namely the few-shot learning method. To understand the effectiveness of applying pre-trained LMs on the low-resource setting, a comprehensive study on prompt-based few-shot learning methodologies on IndoNLU as an existing Indonesian natural language understanding benchmark has been done. Three different methods were tested, standard finetuning and two few-shot learning methods, namely prompt-based finetuning (LM-BFF), and few-shot in-context learning. The language models tested are divided into three category, multilingual models with XGLM and XLM-R, English monolingual models with GPT-Neo, and Indonesian monolingual models with IndoBERT and IndoGPT. It is found that in-context learning using a multilingual decoder model, XGLM outperforms the English GPT-Neo models. Prompt-based tuning using LM-BFF with XLM-R was also shown to generally outperform the in-context learning method with a difference of up to ~20 F1-macro-average scores.
format Theses
author Mahendra G H, Rayza
spellingShingle Mahendra G H, Rayza
FEW-SHOT LEARNING IN INDONESIAN LANGUAGE DOMAIN TEXT CLASSIFICATION
author_facet Mahendra G H, Rayza
author_sort Mahendra G H, Rayza
title FEW-SHOT LEARNING IN INDONESIAN LANGUAGE DOMAIN TEXT CLASSIFICATION
title_short FEW-SHOT LEARNING IN INDONESIAN LANGUAGE DOMAIN TEXT CLASSIFICATION
title_full FEW-SHOT LEARNING IN INDONESIAN LANGUAGE DOMAIN TEXT CLASSIFICATION
title_fullStr FEW-SHOT LEARNING IN INDONESIAN LANGUAGE DOMAIN TEXT CLASSIFICATION
title_full_unstemmed FEW-SHOT LEARNING IN INDONESIAN LANGUAGE DOMAIN TEXT CLASSIFICATION
title_sort few-shot learning in indonesian language domain text classification
url https://digilib.itb.ac.id/gdl/view/68650
_version_ 1822933711923970048