LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS

Clickbait spoiling is a new task that aims to produce spoilers from posts or headlines that contain clickbait. Previous research completed this task in English by treating this task as a question answering task. The best model produced has promising performance with a BERTScore of 77.03 for the p...

Full description

Saved in:
Bibliographic Details
Main Author: Putu Intan Maharani, Ni
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/80973
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:80973
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Clickbait spoiling is a new task that aims to produce spoilers from posts or headlines that contain clickbait. Previous research completed this task in English by treating this task as a question answering task. The best model produced has promising performance with a BERTScore of 77.03 for the phrase type spoiler and 51.06 for the passage type spoiler. With limited labeled data for clickbait spoiling in Indonesian, implementing clickbait spoiling for Indonesian is a challenge. In this research, clickbait spoiling test data in Indonesian was constructed to evaluate the performance of the developed clickbait spoiling model. In addition, experiments on several training approaches to pre-trained multilingual language models were carried out with limited labeled data for training. The clickbait spoiling test data in Indonesian Indonesian Clickbait Spoiling Corpus was constructed using online news article data, especially the title and content of the article. IndoSUM data is used as a starting point for constructing clickbait spoiling data. To filter the data so that it only contains article titles that contain clickbait, a clickbait classification model was developed using the IndoBERT pre-trained model and clickbait classification data in Indonesian CLICK-ID. Then extractive spoiler annotation is carried out to identify the type of spoiler that has been annotated. The annotated data was then validated by 2 validators who were native Indonesian speakers. With limited clickbait spoiling training data in Indonesian, experiments related to model training techniques such as zero-shot cross-lingual learning, further fine- tuning, semi-supervised learning consistency training, and adapters performed. For the zero-shot cross-lingual learning approach, the English-language Webis Clickbait Spoiling Corpus 2022 is used to train multilingual pre-trained language models such as mBERT, XLM-RoBERTa, and mDeBERTaV3. The model is trained as a question answering task which is then evaluated using the Indonesian Clickbait Spoiling Corpus. This zero-shot cross-lingual learning approach is used as a baseline. iv Previous research on clickbait spoiling carried out further fine-tuning where the language model was first fine-tuned with question answering training data. In this research, the multilingual pre-trained language model is first fine-tuned with question answering training data such as SQuADv2 in English and IDK-MRC in Indonesian. Then, the model was fine-tuned again using Webis Clickbait Spoiling Corpus 2022. One semi-supervised learning approach, namely consistency training, is also used, which utilizes unlabeled data for training. This training technique has the aim of minimizing supervised loss by using labeled data and minimizing consistency loss by using unlabeled data. Unlabeled data was collected from IndoSUM and the online news site CNBC Indonesia. The clickbait classification model mentioned previously was also used to filter article titles that contained clickbait. To produce consistency loss, the unlabeled data is augmented by paraphrasing through back- translation. Additionally, the adapter approach, parameter-efficient transfer learning, is used. Task adapters are added to multilingual pre-trained language models, such as mBERT and XLM-Rbase. The task adapter is trained using question answering data SQuADv2 or IDK-MRC, and Webis Clickbait Spoiling Corpus. During training, the pre-trained language adapter is activated according to the language of the training data. When inferring with the Indonesian Clickbait Spoiling Corpus, the pre- trained Indonesian language adapter is activated. In general, the advanced fine-tuning, semi-supervised learning consistency training, and adapter approaches produce models with competitive performance against the baseline model (zero-shot cross-lingual learning). The advanced fine- tuning approach using SQuADv2 outperforms the other approaches with an SQuAD F1 score of 41.519 and an IndoSBERT score of 59.522 (XLM-Rlarge). Consistency training also produces a model with performance that largely outperforms the baseline model, especially the mDeBERTaV3 language model by a significant margin. The adapter approach produces a model with competitive performance against the baseline model. The XLM-Rbase model with task adapter trained with SQuADv2 and Webis Clickbait Spoiling Corpus 2022 outperforms the baseline. In the manual evaluation of the sample of model inference results, two evaluation categories to describe the spoilers produced by the model were identified, namely valid spoilers (71%, with three sub-categories) and invalid spoilers (29%, with two sub-categories).
format Theses
author Putu Intan Maharani, Ni
spellingShingle Putu Intan Maharani, Ni
LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS
author_facet Putu Intan Maharani, Ni
author_sort Putu Intan Maharani, Ni
title LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS
title_short LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS
title_full LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS
title_fullStr LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS
title_full_unstemmed LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS
title_sort low-resource clickbait spoiling for indonesian using multilingual pre-trained language models
url https://digilib.itb.ac.id/gdl/view/80973
_version_ 1822281777202331648
spelling id-itb.:809732024-03-17T04:36:55ZLOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS Putu Intan Maharani, Ni Indonesia Theses clickbait, clickbait spoiling, multilingual pre-trained language model, zero-shot cross-lingual learning, further fine-tuning, semi-supervised learning, consistency training, adapter, task adapter. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/80973 Clickbait spoiling is a new task that aims to produce spoilers from posts or headlines that contain clickbait. Previous research completed this task in English by treating this task as a question answering task. The best model produced has promising performance with a BERTScore of 77.03 for the phrase type spoiler and 51.06 for the passage type spoiler. With limited labeled data for clickbait spoiling in Indonesian, implementing clickbait spoiling for Indonesian is a challenge. In this research, clickbait spoiling test data in Indonesian was constructed to evaluate the performance of the developed clickbait spoiling model. In addition, experiments on several training approaches to pre-trained multilingual language models were carried out with limited labeled data for training. The clickbait spoiling test data in Indonesian Indonesian Clickbait Spoiling Corpus was constructed using online news article data, especially the title and content of the article. IndoSUM data is used as a starting point for constructing clickbait spoiling data. To filter the data so that it only contains article titles that contain clickbait, a clickbait classification model was developed using the IndoBERT pre-trained model and clickbait classification data in Indonesian CLICK-ID. Then extractive spoiler annotation is carried out to identify the type of spoiler that has been annotated. The annotated data was then validated by 2 validators who were native Indonesian speakers. With limited clickbait spoiling training data in Indonesian, experiments related to model training techniques such as zero-shot cross-lingual learning, further fine- tuning, semi-supervised learning consistency training, and adapters performed. For the zero-shot cross-lingual learning approach, the English-language Webis Clickbait Spoiling Corpus 2022 is used to train multilingual pre-trained language models such as mBERT, XLM-RoBERTa, and mDeBERTaV3. The model is trained as a question answering task which is then evaluated using the Indonesian Clickbait Spoiling Corpus. This zero-shot cross-lingual learning approach is used as a baseline. iv Previous research on clickbait spoiling carried out further fine-tuning where the language model was first fine-tuned with question answering training data. In this research, the multilingual pre-trained language model is first fine-tuned with question answering training data such as SQuADv2 in English and IDK-MRC in Indonesian. Then, the model was fine-tuned again using Webis Clickbait Spoiling Corpus 2022. One semi-supervised learning approach, namely consistency training, is also used, which utilizes unlabeled data for training. This training technique has the aim of minimizing supervised loss by using labeled data and minimizing consistency loss by using unlabeled data. Unlabeled data was collected from IndoSUM and the online news site CNBC Indonesia. The clickbait classification model mentioned previously was also used to filter article titles that contained clickbait. To produce consistency loss, the unlabeled data is augmented by paraphrasing through back- translation. Additionally, the adapter approach, parameter-efficient transfer learning, is used. Task adapters are added to multilingual pre-trained language models, such as mBERT and XLM-Rbase. The task adapter is trained using question answering data SQuADv2 or IDK-MRC, and Webis Clickbait Spoiling Corpus. During training, the pre-trained language adapter is activated according to the language of the training data. When inferring with the Indonesian Clickbait Spoiling Corpus, the pre- trained Indonesian language adapter is activated. In general, the advanced fine-tuning, semi-supervised learning consistency training, and adapter approaches produce models with competitive performance against the baseline model (zero-shot cross-lingual learning). The advanced fine- tuning approach using SQuADv2 outperforms the other approaches with an SQuAD F1 score of 41.519 and an IndoSBERT score of 59.522 (XLM-Rlarge). Consistency training also produces a model with performance that largely outperforms the baseline model, especially the mDeBERTaV3 language model by a significant margin. The adapter approach produces a model with competitive performance against the baseline model. The XLM-Rbase model with task adapter trained with SQuADv2 and Webis Clickbait Spoiling Corpus 2022 outperforms the baseline. In the manual evaluation of the sample of model inference results, two evaluation categories to describe the spoilers produced by the model were identified, namely valid spoilers (71%, with three sub-categories) and invalid spoilers (29%, with two sub-categories). text