COMPLEXITY, WORD FEATURES, SENTENCE FEATURES, BERT, ROBERTA, XLNET, STACKING.

The complexity of words or phrases in a sentence is one way of knowing the literacy level of the reading text. Information about the literacy level of a text can be used to determine the complexity of a corpus. The complexity of a corpus can certainly affect the performance of artificial intellig...

Full description

Saved in:
Bibliographic Details
Main Author: Stanley Yoga Setiawan, Stefanus
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/66605
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:The complexity of words or phrases in a sentence is one way of knowing the literacy level of the reading text. Information about the literacy level of a text can be used to determine the complexity of a corpus. The complexity of a corpus can certainly affect the performance of artificial intelligence to understand the context of a text. This final project aims to create a model that can predict the complexity value of a word (subtask 1) or a phrase (subtask 2) that appears in a sentence. In a previous study in the SemEval 2021 task 1 competition, BERT and RoBERTa were two contextual pretrained embeddings that managed to get the best performance on both subtasks. The research in this final project focuses on adding word and sentence features to the contextual pretrained embedding-based model and the static embedding-based model to improve performance from the previous competition. Based on the experiments conducted, the use of word and sentence features is proven to improve the performance of the model and the results of stacking. The results of the best stacking model managed to rank first in subtask 1 with a Pearson value of 0.7887. In subtask 2, managed to rank second with a Pearson score of 0.8590. Based on further analysis, the characteristics of the built model tend to predict the complexity of words or phrases that are rarely used higher than words or phrases that are often used..