TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION

One determinant of the quality of a language processing model based on machine learning is the availability of data. Annotating data can take a long time and data which is public in Indonesian is still not enough. This can hamper research on Indonesian language processing. In addition to optimizi...

Full description

Saved in:
Bibliographic Details
Main Author: Abdurrahman
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/39980
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:39980
spelling id-itb.:399802019-06-28T14:57:29ZTEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION Abdurrahman Indonesia Final Project augmentation, text, synonym, language model INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/39980 One determinant of the quality of a language processing model based on machine learning is the availability of data. Annotating data can take a long time and data which is public in Indonesian is still not enough. This can hamper research on Indonesian language processing. In addition to optimizing small data, text data augmentation can be done to improve the evaluation results of the classification model. Text data augmentation is done by making a new sentence. New sentence is composed by replacing a few words in the original sentence with their synonyms. Text data augmentation needs to be done by considering two factors, namely the number of words replaced and the selection of synonyms for each word that is replaced. The number of words replaced is calculated by multiplying the length of the sentence with the degree of augmentation. Synonym candidates are obtained by a thesaurus while the selection of synonyms is determined by tracing the possibility value of the wording with the beam search algorithm so that the word arrangement with the best probability values is obtained. The probability value is generated by the language model. Experiments are carried out with the corpus containing sentences collected from various news site and datasets on automotive domain. The corpus which is used to train language model contains a sentence of approximately 1 million sentences. The language model is then used in the augmentation process to produce the dataset used to train aspect categorization and sentiment classification models. The best language model is obtained by building a 5-gram neural language model. Using the language model, it was found that the best augmentation degree are 0.5 and 0.3 in the aspect categorization and sentiment classification respectively. Text data augmentation increases the evaluation results by 0.03 to 0.04. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description One determinant of the quality of a language processing model based on machine learning is the availability of data. Annotating data can take a long time and data which is public in Indonesian is still not enough. This can hamper research on Indonesian language processing. In addition to optimizing small data, text data augmentation can be done to improve the evaluation results of the classification model. Text data augmentation is done by making a new sentence. New sentence is composed by replacing a few words in the original sentence with their synonyms. Text data augmentation needs to be done by considering two factors, namely the number of words replaced and the selection of synonyms for each word that is replaced. The number of words replaced is calculated by multiplying the length of the sentence with the degree of augmentation. Synonym candidates are obtained by a thesaurus while the selection of synonyms is determined by tracing the possibility value of the wording with the beam search algorithm so that the word arrangement with the best probability values is obtained. The probability value is generated by the language model. Experiments are carried out with the corpus containing sentences collected from various news site and datasets on automotive domain. The corpus which is used to train language model contains a sentence of approximately 1 million sentences. The language model is then used in the augmentation process to produce the dataset used to train aspect categorization and sentiment classification models. The best language model is obtained by building a 5-gram neural language model. Using the language model, it was found that the best augmentation degree are 0.5 and 0.3 in the aspect categorization and sentiment classification respectively. Text data augmentation increases the evaluation results by 0.03 to 0.04.
format Final Project
author Abdurrahman
spellingShingle Abdurrahman
TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
author_facet Abdurrahman
author_sort Abdurrahman
title TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
title_short TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
title_full TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
title_fullStr TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
title_full_unstemmed TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
title_sort text data augmentation using synonyms on indonesian text classification
url https://digilib.itb.ac.id/gdl/view/39980
_version_ 1821997948578299904