Integration of feature subset selection methods for sentiment analysis

Feature selection is one of the main challenges in sentiment analysis to find an optimal feature subset from a real-world domain. The complexity of an optimal feature subset selection grows exponentially based on the number of features for analysing and organizing data in high-dimensional spaces tha...

Full description

Saved in:
Bibliographic Details
Main Author: Yousefpour, Alireza
Format: Thesis
Language:English
Published: 2019
Subjects:
Online Access:http://eprints.utm.my/id/eprint/98112/1/AlirezaYousefpourPSC2019.pdf
http://eprints.utm.my/id/eprint/98112/
http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:143783
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Teknologi Malaysia
Language: English
id my.utm.98112
record_format eprints
spelling my.utm.981122022-11-14T10:09:51Z http://eprints.utm.my/id/eprint/98112/ Integration of feature subset selection methods for sentiment analysis Yousefpour, Alireza QA75 Electronic computers. Computer science Feature selection is one of the main challenges in sentiment analysis to find an optimal feature subset from a real-world domain. The complexity of an optimal feature subset selection grows exponentially based on the number of features for analysing and organizing data in high-dimensional spaces that lead to the high-dimensional problems. To overcome the problem, this study attempted to enhance the feature subset selection in high-dimensional data by removing irrelevant and redundant features using filter and wrapper approaches. Initially, a filter method based on dispersion of samples on feature space known as mutual standard deviation method was developed to minimize intra-class and maximize inter-class distances. The filter-based methods have some advantages such as they are easily scaled to high-dimensional datasets and are computationally simple and fast. Besides, they only depend on feature selection space and ignore the hypothesis model space. Hence, the next step of this study developed a new feature ranking approach by integrating various filter methods. The ordinal-based and frequency-based integration of different filter methods were developed. Finally, a hybrid harmony search based on search strategy was developed and used to enhance the feature subset selection to overcome the problem of ignoring the dependency of feature selection on the classifier. Therefore, a search strategy on feature space using integration of filter and wrapper approaches was introduced to find a semantic relationship among the model selections and subsets of the search features. Comparative experiments were performed on five sentiment datasets, namely movie, music, book, electronics, and kitchen review dataset. A sizeable performance improvement was noted whereby the proposed integration-based feature subset selection method yielded a result of 98.32% accuracy in sentiment classification using POS-based features on movie reviews. Finally, a statistical test conducted based on the accuracy showed significant differences between the proposed methods and the baseline methods in almost all the comparisons in k-fold cross-validation. The findings of the study have shown the effectiveness of the mutual standard deviation and integration-based feature subset selection methods have outperformed the other baseline methods in terms of accuracy. 2019 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/id/eprint/98112/1/AlirezaYousefpourPSC2019.pdf Yousefpour, Alireza (2019) Integration of feature subset selection methods for sentiment analysis. PhD thesis, Universiti Teknologi Malaysia, Faculty of Engineering - School of Computing. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:143783
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
language English
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Yousefpour, Alireza
Integration of feature subset selection methods for sentiment analysis
description Feature selection is one of the main challenges in sentiment analysis to find an optimal feature subset from a real-world domain. The complexity of an optimal feature subset selection grows exponentially based on the number of features for analysing and organizing data in high-dimensional spaces that lead to the high-dimensional problems. To overcome the problem, this study attempted to enhance the feature subset selection in high-dimensional data by removing irrelevant and redundant features using filter and wrapper approaches. Initially, a filter method based on dispersion of samples on feature space known as mutual standard deviation method was developed to minimize intra-class and maximize inter-class distances. The filter-based methods have some advantages such as they are easily scaled to high-dimensional datasets and are computationally simple and fast. Besides, they only depend on feature selection space and ignore the hypothesis model space. Hence, the next step of this study developed a new feature ranking approach by integrating various filter methods. The ordinal-based and frequency-based integration of different filter methods were developed. Finally, a hybrid harmony search based on search strategy was developed and used to enhance the feature subset selection to overcome the problem of ignoring the dependency of feature selection on the classifier. Therefore, a search strategy on feature space using integration of filter and wrapper approaches was introduced to find a semantic relationship among the model selections and subsets of the search features. Comparative experiments were performed on five sentiment datasets, namely movie, music, book, electronics, and kitchen review dataset. A sizeable performance improvement was noted whereby the proposed integration-based feature subset selection method yielded a result of 98.32% accuracy in sentiment classification using POS-based features on movie reviews. Finally, a statistical test conducted based on the accuracy showed significant differences between the proposed methods and the baseline methods in almost all the comparisons in k-fold cross-validation. The findings of the study have shown the effectiveness of the mutual standard deviation and integration-based feature subset selection methods have outperformed the other baseline methods in terms of accuracy.
format Thesis
author Yousefpour, Alireza
author_facet Yousefpour, Alireza
author_sort Yousefpour, Alireza
title Integration of feature subset selection methods for sentiment analysis
title_short Integration of feature subset selection methods for sentiment analysis
title_full Integration of feature subset selection methods for sentiment analysis
title_fullStr Integration of feature subset selection methods for sentiment analysis
title_full_unstemmed Integration of feature subset selection methods for sentiment analysis
title_sort integration of feature subset selection methods for sentiment analysis
publishDate 2019
url http://eprints.utm.my/id/eprint/98112/1/AlirezaYousefpourPSC2019.pdf
http://eprints.utm.my/id/eprint/98112/
http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:143783
_version_ 1751536149304705024