TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION

One determinant of the quality of a language processing model based on machine learning is the availability of data. Annotating data can take a long time and data which is public in Indonesian is still not enough. This can hamper research on Indonesian language processing. In addition to optimizi...

Full description

Saved in:

Bibliographic Details
Main Author:	Abdurrahman
Format:	Final Project
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/39980
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:39980
spelling	id-itb.:399802019-06-28T14:57:29ZTEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION Abdurrahman Indonesia Final Project augmentation, text, synonym, language model INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/39980 One determinant of the quality of a language processing model based on machine learning is the availability of data. Annotating data can take a long time and data which is public in Indonesian is still not enough. This can hamper research on Indonesian language processing. In addition to optimizing small data, text data augmentation can be done to improve the evaluation results of the classification model. Text data augmentation is done by making a new sentence. New sentence is composed by replacing a few words in the original sentence with their synonyms. Text data augmentation needs to be done by considering two factors, namely the number of words replaced and the selection of synonyms for each word that is replaced. The number of words replaced is calculated by multiplying the length of the sentence with the degree of augmentation. Synonym candidates are obtained by a thesaurus while the selection of synonyms is determined by tracing the possibility value of the wording with the beam search algorithm so that the word arrangement with the best probability values is obtained. The probability value is generated by the language model. Experiments are carried out with the corpus containing sentences collected from various news site and datasets on automotive domain. The corpus which is used to train language model contains a sentence of approximately 1 million sentences. The language model is then used in the augmentation process to produce the dataset used to train aspect categorization and sentiment classification models. The best language model is obtained by building a 5-gram neural language model. Using the language model, it was found that the best augmentation degree are 0.5 and 0.3 in the aspect categorization and sentiment classification respectively. Text data augmentation increases the evaluation results by 0.03 to 0.04. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	One determinant of the quality of a language processing model based on machine learning is the availability of data. Annotating data can take a long time and data which is public in Indonesian is still not enough. This can hamper research on Indonesian language processing. In addition to optimizing small data, text data augmentation can be done to improve the evaluation results of the classification model. Text data augmentation is done by making a new sentence. New sentence is composed by replacing a few words in the original sentence with their synonyms. Text data augmentation needs to be done by considering two factors, namely the number of words replaced and the selection of synonyms for each word that is replaced. The number of words replaced is calculated by multiplying the length of the sentence with the degree of augmentation. Synonym candidates are obtained by a thesaurus while the selection of synonyms is determined by tracing the possibility value of the wording with the beam search algorithm so that the word arrangement with the best probability values is obtained. The probability value is generated by the language model. Experiments are carried out with the corpus containing sentences collected from various news site and datasets on automotive domain. The corpus which is used to train language model contains a sentence of approximately 1 million sentences. The language model is then used in the augmentation process to produce the dataset used to train aspect categorization and sentiment classification models. The best language model is obtained by building a 5-gram neural language model. Using the language model, it was found that the best augmentation degree are 0.5 and 0.3 in the aspect categorization and sentiment classification respectively. Text data augmentation increases the evaluation results by 0.03 to 0.04.
format	Final Project
author	Abdurrahman
spellingShingle	Abdurrahman TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
author_facet	Abdurrahman
author_sort	Abdurrahman
title	TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
title_short	TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
title_full	TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
title_fullStr	TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
title_full_unstemmed	TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION
title_sort	text data augmentation using synonyms on indonesian text classification
url	https://digilib.itb.ac.id/gdl/view/39980
_version_	1821997948578299904

TEXT DATA AUGMENTATION USING SYNONYMS ON INDONESIAN TEXT CLASSIFICATION

Similar Items