TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION

One determinant of the quality of a language processing model based on machine learning is the availability of data. Annotating data can take a long time and data which is public in Indonesian is still not enough. This can hamper research on Indonesian language processing. In addition to optimizi...

Full description

Saved in:
Bibliographic Details
Main Author: Abdurrahman
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/39980
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:One determinant of the quality of a language processing model based on machine learning is the availability of data. Annotating data can take a long time and data which is public in Indonesian is still not enough. This can hamper research on Indonesian language processing. In addition to optimizing small data, text data augmentation can be done to improve the evaluation results of the classification model. Text data augmentation is done by making a new sentence. New sentence is composed by replacing a few words in the original sentence with their synonyms. Text data augmentation needs to be done by considering two factors, namely the number of words replaced and the selection of synonyms for each word that is replaced. The number of words replaced is calculated by multiplying the length of the sentence with the degree of augmentation. Synonym candidates are obtained by a thesaurus while the selection of synonyms is determined by tracing the possibility value of the wording with the beam search algorithm so that the word arrangement with the best probability values is obtained. The probability value is generated by the language model. Experiments are carried out with the corpus containing sentences collected from various news site and datasets on automotive domain. The corpus which is used to train language model contains a sentence of approximately 1 million sentences. The language model is then used in the augmentation process to produce the dataset used to train aspect categorization and sentiment classification models. The best language model is obtained by building a 5-gram neural language model. Using the language model, it was found that the best augmentation degree are 0.5 and 0.3 in the aspect categorization and sentiment classification respectively. Text data augmentation increases the evaluation results by 0.03 to 0.04.