IDENTIFYING PLAUSIBILITY PHRASES IN INSTRUCTIONAL TEXTS USING BOOSTINGBERT AND ADABOOST.RT
The coherence of each word or phrase in instructional text is crucial because incorrect word choice can lead to different outcomes. This research aimed to develop models to identify word or phrase coherence in instructional texts for classification and regression tasks. This topic is similar to t...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/82499 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | The coherence of each word or phrase in instructional text is crucial because
incorrect word choice can lead to different outcomes. This research aimed to
develop models to identify word or phrase coherence in instructional texts for
classification and regression tasks. This topic is similar to the one in SemEval 2022
task 7, and we will use the same tasks: classification and regression. Word
coherence or plausibility phrases is tested by evaluating how well a word or phrase
fits when substituted into the text based on the surrounding context. This method is
similar to BERT training techniques, masked language model (MLM). To increased
the perfomance of the model, ensemble learning will be used specifically boosting
with DeBERTaV3, an advanced variant of BERT, as the weak learner. Model’s
perfomance will be compared with the best models in SemEval 2022 task 7 and
advantages and disadvantages of the model will be analyzed.
The training phase of boosting method will be run iteratively and sequentially,
focusing on incorrect predictions from previous iteration. In this final project, two
model will be developed with two AdaBoost algorithm modifications.
BoostingBERT technique used to develop model for classification task while
AdaBoost.RT technique used to develop model for regression task. The
implementation of those technique used DeBERTaV3 as the weak learner.
Additionally, there are also data preparation and imbalance data handling for the
training dataset used by SemEval 2022.
The developed model achieved fourth place in both the regression and classification
in the SemEval 2022 task 7. In classification task, the model achieved an accuracy
of 64.24%, demonstrating its ability to classify the coherence of words or phrases
with a relatively high level of accuracy. Meanwhile, in the regression task, the
model achieved Spearman’s rank correlation of 0.765. However, the final model
size was quite large, reaching 9.8 GB for each task. Additionaly, the model struggle
to predicting the neutral label in classification task and low score data in the
regression task. |
---|