LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS
Clickbait spoiling is a new task that aims to produce spoilers from posts or headlines that contain clickbait. Previous research completed this task in English by treating this task as a question answering task. The best model produced has promising performance with a BERTScore of 77.03 for the p...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/80973 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:80973 |
---|---|
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Clickbait spoiling is a new task that aims to produce spoilers from posts or
headlines that contain clickbait. Previous research completed this task in English
by treating this task as a question answering task. The best model produced has
promising performance with a BERTScore of 77.03 for the phrase type spoiler and
51.06 for the passage type spoiler. With limited labeled data for clickbait spoiling
in Indonesian, implementing clickbait spoiling for Indonesian is a challenge. In this
research, clickbait spoiling test data in Indonesian was constructed to evaluate the
performance of the developed clickbait spoiling model. In addition, experiments on
several training approaches to pre-trained multilingual language models were
carried out with limited labeled data for training.
The clickbait spoiling test data in Indonesian Indonesian Clickbait Spoiling
Corpus was constructed using online news article data, especially the title and
content of the article. IndoSUM data is used as a starting point for constructing
clickbait spoiling data. To filter the data so that it only contains article titles that
contain clickbait, a clickbait classification model was developed using the
IndoBERT pre-trained model and clickbait classification data in Indonesian
CLICK-ID. Then extractive spoiler annotation is carried out to identify the type of
spoiler that has been annotated. The annotated data was then validated by 2
validators who were native Indonesian speakers.
With limited clickbait spoiling training data in Indonesian, experiments related to
model training techniques such as zero-shot cross-lingual learning, further fine-
tuning, semi-supervised learning consistency training, and adapters performed.
For the zero-shot cross-lingual learning approach, the English-language Webis
Clickbait Spoiling Corpus 2022 is used to train multilingual pre-trained language
models such as mBERT, XLM-RoBERTa, and mDeBERTaV3. The model is trained
as a question answering task which is then evaluated using the Indonesian Clickbait
Spoiling Corpus. This zero-shot cross-lingual learning approach is used as a
baseline.
iv
Previous research on clickbait spoiling carried out further fine-tuning where the
language model was first fine-tuned with question answering training data. In this
research, the multilingual pre-trained language model is first fine-tuned with
question answering training data such as SQuADv2 in English and IDK-MRC in
Indonesian. Then, the model was fine-tuned again using Webis Clickbait Spoiling
Corpus 2022.
One semi-supervised learning approach, namely consistency training, is also used,
which utilizes unlabeled data for training. This training technique has the aim of
minimizing supervised loss by using labeled data and minimizing consistency loss
by using unlabeled data. Unlabeled data was collected from IndoSUM and the
online news site CNBC Indonesia. The clickbait classification model mentioned
previously was also used to filter article titles that contained clickbait. To produce
consistency loss, the unlabeled data is augmented by paraphrasing through back-
translation.
Additionally, the adapter approach, parameter-efficient transfer learning, is used.
Task adapters are added to multilingual pre-trained language models, such as
mBERT and XLM-Rbase. The task adapter is trained using question answering data
SQuADv2 or IDK-MRC, and Webis Clickbait Spoiling Corpus. During training, the
pre-trained language adapter is activated according to the language of the training
data. When inferring with the Indonesian Clickbait Spoiling Corpus, the pre-
trained Indonesian language adapter is activated.
In general, the advanced fine-tuning, semi-supervised learning consistency
training, and adapter approaches produce models with competitive performance
against the baseline model (zero-shot cross-lingual learning). The advanced fine-
tuning approach using SQuADv2 outperforms the other approaches with an
SQuAD F1 score of 41.519 and an IndoSBERT score of 59.522 (XLM-Rlarge).
Consistency training also produces a model with performance that largely
outperforms the baseline model, especially the mDeBERTaV3 language model by
a significant margin. The adapter approach produces a model with competitive
performance against the baseline model. The XLM-Rbase model with task adapter
trained with SQuADv2 and Webis Clickbait Spoiling Corpus 2022 outperforms the
baseline. In the manual evaluation of the sample of model inference results, two
evaluation categories to describe the spoilers produced by the model were
identified, namely valid spoilers (71%, with three sub-categories) and invalid
spoilers (29%, with two sub-categories). |
format |
Theses |
author |
Putu Intan Maharani, Ni |
spellingShingle |
Putu Intan Maharani, Ni LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS |
author_facet |
Putu Intan Maharani, Ni |
author_sort |
Putu Intan Maharani, Ni |
title |
LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS |
title_short |
LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS |
title_full |
LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS |
title_fullStr |
LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS |
title_full_unstemmed |
LOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS |
title_sort |
low-resource clickbait spoiling for indonesian using multilingual pre-trained language models |
url |
https://digilib.itb.ac.id/gdl/view/80973 |
_version_ |
1822281777202331648 |
spelling |
id-itb.:809732024-03-17T04:36:55ZLOW-RESOURCE CLICKBAIT SPOILING FOR INDONESIAN USING MULTILINGUAL PRE-TRAINED LANGUAGE MODELS Putu Intan Maharani, Ni Indonesia Theses clickbait, clickbait spoiling, multilingual pre-trained language model, zero-shot cross-lingual learning, further fine-tuning, semi-supervised learning, consistency training, adapter, task adapter. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/80973 Clickbait spoiling is a new task that aims to produce spoilers from posts or headlines that contain clickbait. Previous research completed this task in English by treating this task as a question answering task. The best model produced has promising performance with a BERTScore of 77.03 for the phrase type spoiler and 51.06 for the passage type spoiler. With limited labeled data for clickbait spoiling in Indonesian, implementing clickbait spoiling for Indonesian is a challenge. In this research, clickbait spoiling test data in Indonesian was constructed to evaluate the performance of the developed clickbait spoiling model. In addition, experiments on several training approaches to pre-trained multilingual language models were carried out with limited labeled data for training. The clickbait spoiling test data in Indonesian Indonesian Clickbait Spoiling Corpus was constructed using online news article data, especially the title and content of the article. IndoSUM data is used as a starting point for constructing clickbait spoiling data. To filter the data so that it only contains article titles that contain clickbait, a clickbait classification model was developed using the IndoBERT pre-trained model and clickbait classification data in Indonesian CLICK-ID. Then extractive spoiler annotation is carried out to identify the type of spoiler that has been annotated. The annotated data was then validated by 2 validators who were native Indonesian speakers. With limited clickbait spoiling training data in Indonesian, experiments related to model training techniques such as zero-shot cross-lingual learning, further fine- tuning, semi-supervised learning consistency training, and adapters performed. For the zero-shot cross-lingual learning approach, the English-language Webis Clickbait Spoiling Corpus 2022 is used to train multilingual pre-trained language models such as mBERT, XLM-RoBERTa, and mDeBERTaV3. The model is trained as a question answering task which is then evaluated using the Indonesian Clickbait Spoiling Corpus. This zero-shot cross-lingual learning approach is used as a baseline. iv Previous research on clickbait spoiling carried out further fine-tuning where the language model was first fine-tuned with question answering training data. In this research, the multilingual pre-trained language model is first fine-tuned with question answering training data such as SQuADv2 in English and IDK-MRC in Indonesian. Then, the model was fine-tuned again using Webis Clickbait Spoiling Corpus 2022. One semi-supervised learning approach, namely consistency training, is also used, which utilizes unlabeled data for training. This training technique has the aim of minimizing supervised loss by using labeled data and minimizing consistency loss by using unlabeled data. Unlabeled data was collected from IndoSUM and the online news site CNBC Indonesia. The clickbait classification model mentioned previously was also used to filter article titles that contained clickbait. To produce consistency loss, the unlabeled data is augmented by paraphrasing through back- translation. Additionally, the adapter approach, parameter-efficient transfer learning, is used. Task adapters are added to multilingual pre-trained language models, such as mBERT and XLM-Rbase. The task adapter is trained using question answering data SQuADv2 or IDK-MRC, and Webis Clickbait Spoiling Corpus. During training, the pre-trained language adapter is activated according to the language of the training data. When inferring with the Indonesian Clickbait Spoiling Corpus, the pre- trained Indonesian language adapter is activated. In general, the advanced fine-tuning, semi-supervised learning consistency training, and adapter approaches produce models with competitive performance against the baseline model (zero-shot cross-lingual learning). The advanced fine- tuning approach using SQuADv2 outperforms the other approaches with an SQuAD F1 score of 41.519 and an IndoSBERT score of 59.522 (XLM-Rlarge). Consistency training also produces a model with performance that largely outperforms the baseline model, especially the mDeBERTaV3 language model by a significant margin. The adapter approach produces a model with competitive performance against the baseline model. The XLM-Rbase model with task adapter trained with SQuADv2 and Webis Clickbait Spoiling Corpus 2022 outperforms the baseline. In the manual evaluation of the sample of model inference results, two evaluation categories to describe the spoilers produced by the model were identified, namely valid spoilers (71%, with three sub-categories) and invalid spoilers (29%, with two sub-categories). text |