IDENTIFYING PLAUSIBILITY PHRASES IN INSTRUCTIONAL TEXTS USING BOOSTINGBERT AND ADABOOST.RT

The coherence of each word or phrase in instructional text is crucial because incorrect word choice can lead to different outcomes. This research aimed to develop models to identify word or phrase coherence in instructional texts for classification and regression tasks. This topic is similar to t...

Full description

Saved in:
Bibliographic Details
Main Author: Sumerta Yoga, Gede
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/82499
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:The coherence of each word or phrase in instructional text is crucial because incorrect word choice can lead to different outcomes. This research aimed to develop models to identify word or phrase coherence in instructional texts for classification and regression tasks. This topic is similar to the one in SemEval 2022 task 7, and we will use the same tasks: classification and regression. Word coherence or plausibility phrases is tested by evaluating how well a word or phrase fits when substituted into the text based on the surrounding context. This method is similar to BERT training techniques, masked language model (MLM). To increased the perfomance of the model, ensemble learning will be used specifically boosting with DeBERTaV3, an advanced variant of BERT, as the weak learner. Model’s perfomance will be compared with the best models in SemEval 2022 task 7 and advantages and disadvantages of the model will be analyzed. The training phase of boosting method will be run iteratively and sequentially, focusing on incorrect predictions from previous iteration. In this final project, two model will be developed with two AdaBoost algorithm modifications. BoostingBERT technique used to develop model for classification task while AdaBoost.RT technique used to develop model for regression task. The implementation of those technique used DeBERTaV3 as the weak learner. Additionally, there are also data preparation and imbalance data handling for the training dataset used by SemEval 2022. The developed model achieved fourth place in both the regression and classification in the SemEval 2022 task 7. In classification task, the model achieved an accuracy of 64.24%, demonstrating its ability to classify the coherence of words or phrases with a relatively high level of accuracy. Meanwhile, in the regression task, the model achieved Spearman’s rank correlation of 0.765. However, the final model size was quite large, reaching 9.8 GB for each task. Additionaly, the model struggle to predicting the neutral label in classification task and low score data in the regression task.