Gauging online and offline public opinion for social media monitoring

This thesis examines the issue of ill-formed text (microtext) and shows its impact on sentiment analysis. The source of microtext is Twitter, WhatsApp, Facebook and other social media platforms. Words or phrases not in their standard language format like “lol” (laugh out loud), “c u 2nite” (see you...

Full description

Saved in:
Bibliographic Details
Main Author: Satapathy, Ranjan
Other Authors: Erik Cambria
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/151548
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:This thesis examines the issue of ill-formed text (microtext) and shows its impact on sentiment analysis. The source of microtext is Twitter, WhatsApp, Facebook and other social media platforms. Words or phrases not in their standard language format like “lol” (laugh out loud), “c u 2nite” (see you tonight) and “plz” (please) are called out of vocabulary (OOV) terms. On the other hand, words or phrases in their standard forms, for example, “tonight”, “love”, or “talk to you later”, are called In Vocabulary (IV) terms. The motivation behind microtext normalization is its growing presence on both online and offline platforms. Microtext might appear to be an ill-formed text, which requires find-and-replace methods. Traditionally, in a Natural Language Processing (NLP) context, microtexts such as abbreviations and length shortening or extending are normalized, but phonemic and graphemic variation is often ignored. However, the features of microtext can be traced back to social, cognitive and geographic influences. To understand social media text, it is essential to develop a microtext normalization module for traditional and modern NLP models. The first section of this thesis focuses on combining unsupervised learning methods in 2 different chapters, 4 and 5. The first method includes lexicon creation to transform the most frequent microtexts and emoticons on social media and applies them to sentiment analysis. This method shows an accuracy increase of 4% when microtext normalization is applied before sentiment analysis. This method is extended with an application to a pre-trained chatbot to improve the chatbot’s understanding of microtext (social media slangs). This method shows a mean BLEU score of 0.8 for the SMS and Tweet dataset. The second method incorporates a phonetic based approach to transform any microtext into its phonetic equivalent before 1normalizing it. This method shows an accuracy improvement of 6% for sentiment analysis on SenticNet. The second part of the unsupervised learning is in Chapter 5, introducing the IPA-based method for microtext normalization and sentiment analysis. Epitran was used to transform a word to its International Phonetic Alphabet (IPA) equivalent. The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic notation based primarily on the Latin script. There is a significant improvement in microtext normalization task and sentiment analysis over Soundex. The results also show that there is very little redundancy when words are transformed to phonetics in IPA. In addition, the accuracy of polarity detection using this method has an improvement over baseline by 5%. Chapter 6 discusses a simple sequence-to-sequence based Deep Learning method. As the language is dynamic, it is not easy to maintain the lexicon. Deep Learning alleviates the static lexicon with a probabilistic model to transform microtext into its standard form. The findings suggest that a semiotic class of microtext was easy to transform, whereas the phonetic class of microtext was not handled easily. The proposed model improves the pre-trained sentiment analysis model by 4%. We also introduce a corpus for English microtext normalization. The corpus contains microtext-containing sentences along with their correctly spelt sentences and their correct polarity. This dataset can tackle two NLP problems, i.e., microtext normalization and sentiment analysis. This corpus consists of 38% positive, 39.9% negative and 22.1% neutral. A pre-trained model shows an accuracy of 46.4 % on OOV text for sentiment analysis, showing considerable room for improvement. In summary, this thesis introduces several approaches to microtext normalization analysis and focuses on sentiment analysis application. Microtext cannot be thought of as a one-dimensional problem to NLP. It encapsulates many factors like regional (geographical), ethnic (national and racial), and social (class, age, gender, socioeconomic status and education), which cannot be ignored given the growing importance of social media in our daily life.