Lexical knowledge-based machine learning method for sentiment analysis

Before doing any sentiment analysis or classifications, one would need labelled reviews (either a positive or negative sentiment) to do further data mining or natural language processing. Labelling of reviews are done manually and are usually time-consuming and demanding. In this paper, we proposed...

Full description

Saved in:
Bibliographic Details
Main Author: Heng, Lai Xiang
Other Authors: Cong Gao
Format: Final Year Project
Language:English
Published: 2015
Subjects:
Online Access:http://hdl.handle.net/10356/62824
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Before doing any sentiment analysis or classifications, one would need labelled reviews (either a positive or negative sentiment) to do further data mining or natural language processing. Labelling of reviews are done manually and are usually time-consuming and demanding. In this paper, we proposed a new learning algorithm, which is to combine supervised learning with the pre-compiled opinion lexicons. Using this algorithm, manpower and time needed are greatly reduced as it will not require manually labelling of reviews. For this project, customers’ reviews on restaurants will be used from the rich pool of Yelp dataset. There are a total of five steps to the new algorithm: 1) Building two pseudo positive and negative documents. 2) Computation on the pairwise document similarity between the review documents and the positive and negative documents using either the Cosine Similarity or Euclidean Distance approach. 3) Labelling the reviews to either a positive or negative sentiment based on the similarity results. 4) Rank the reviews. 5) Selecting top 2,000 reviews, each 1,000 from the positive and negative labelled documents for sentiment classification model building. In this experiment, we looked into both Naïve Bayes and Support Vector Machine (SVM) classifiers. Three different feature extraction methods namely bag of words model, bag of words model with stopwords removed and using of significant bigrams are used for training the classifier. Out of the three, the use of significant bigrams performed the best by achieving 67% in accuracy whereas the bag of words model performed the worst for Naïve Bayes classifier. On the other hand, SVM classifier performs well in both bag of words model and bag of words model with stopwords removed, achieving an accuracy of about 99%. However, this may indicate an overfitting due to the large sparse of features. Nevertheless, this experiment shows that the automation system of labelling the reviews is possible and it is one step closer in achieving to the goal.