TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
One determinant of the quality of a language processing model based on machine learning is the availability of data. Annotating data can take a long time and data which is public in Indonesian is still not enough. This can hamper research on Indonesian language processing. In addition to optimizi...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/39980 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | One determinant of the quality of a language processing model based on machine learning is
the availability of data. Annotating data can take a long time and data which is public in
Indonesian is still not enough. This can hamper research on Indonesian language processing.
In addition to optimizing small data, text data augmentation can be done to improve the
evaluation results of the classification model. Text data augmentation is done by making a new
sentence. New sentence is composed by replacing a few words in the original sentence with
their synonyms.
Text data augmentation needs to be done by considering two factors, namely the number of
words replaced and the selection of synonyms for each word that is replaced. The number of
words replaced is calculated by multiplying the length of the sentence with the degree of
augmentation. Synonym candidates are obtained by a thesaurus while the selection of
synonyms is determined by tracing the possibility value of the wording with the beam search
algorithm so that the word arrangement with the best probability values is obtained. The
probability value is generated by the language model.
Experiments are carried out with the corpus containing sentences collected from various news
site and datasets on automotive domain. The corpus which is used to train language model
contains a sentence of approximately 1 million sentences. The language model is then used in
the augmentation process to produce the dataset used to train aspect categorization and
sentiment classification models. The best language model is obtained by building a 5-gram
neural language model. Using the language model, it was found that the best augmentation
degree are 0.5 and 0.3 in the aspect categorization and sentiment classification respectively.
Text data augmentation increases the evaluation results by 0.03 to 0.04. |
---|