A statistical feature extraction tool for mining short text data

The surging popularity and influence of social media globally opens a lot of doors for instantaneous information sharing, marketing opportunities, self-promotion and crowdsourcing. With the good, comes the bad -- social media is also hit with different criticism and issues such as cyberbullying, inv...

Full description

Saved in:
Bibliographic Details
Main Author: Chan, Oliver Isaac L.
Format: text
Language:English
Published: Animo Repository 2015
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etd_masteral/5049
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
Description
Summary:The surging popularity and influence of social media globally opens a lot of doors for instantaneous information sharing, marketing opportunities, self-promotion and crowdsourcing. With the good, comes the bad -- social media is also hit with different criticism and issues such as cyberbullying, invasion of privacy, online harassment and false reporting. For NLP practitioners, dealing with social media data is interesting because of the properties it possesses. Social media data differs from traditional text data because of its short text nature. This research aims to create an easy to use tool for users beyond NLP practitioners such as students, site administrators, blog owners and others. In order to make a universal feature extraction tool, the following issues were addressed: language-independence, domain-Independence, data sparsity, informal grammar patterns and feature selection/reduction. The tool was designed, built and comprised of different modules to handle each of these issues. The tool emphasized on extracting and building a data model that goes beyond the commonly used Bag-of-Words by utilizing the co-occurrence properties of the extracted features. Third-party performance evaluation surveys were conducted to evaluate the tool on the criteria of ease of use, efficiency, accuracy, usability and completeness. The survey resulted with a favorable evaluation from the respondents. For benchmarking and general testing, the tool was subjected to four experiments with varying domains, languages, number of classes and data sources. All the datasets used for each experiments were gathered from past researches. The results for each of the experiments were benchmarked against the results of the previous works respectively. Overall, the tool managed to produce higher accuracies for each of the experiments conducted compared to their respective reported results.