A statistical feature extraction tool for mining short text data

The surging popularity and influence of social media globally opens a lot of doors for instantaneous information sharing, marketing opportunities, self-promotion and crowdsourcing. With the good, comes the bad -- social media is also hit with different criticism and issues such as cyberbullying, inv...

Full description

Saved in:
Bibliographic Details
Main Author: Chan, Oliver Isaac L.
Format: text
Language:English
Published: Animo Repository 2015
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etd_masteral/5049
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
id oai:animorepository.dlsu.edu.ph:etd_masteral-11887
record_format eprints
spelling oai:animorepository.dlsu.edu.ph:etd_masteral-118872024-05-27T00:22:39Z A statistical feature extraction tool for mining short text data Chan, Oliver Isaac L. The surging popularity and influence of social media globally opens a lot of doors for instantaneous information sharing, marketing opportunities, self-promotion and crowdsourcing. With the good, comes the bad -- social media is also hit with different criticism and issues such as cyberbullying, invasion of privacy, online harassment and false reporting. For NLP practitioners, dealing with social media data is interesting because of the properties it possesses. Social media data differs from traditional text data because of its short text nature. This research aims to create an easy to use tool for users beyond NLP practitioners such as students, site administrators, blog owners and others. In order to make a universal feature extraction tool, the following issues were addressed: language-independence, domain-Independence, data sparsity, informal grammar patterns and feature selection/reduction. The tool was designed, built and comprised of different modules to handle each of these issues. The tool emphasized on extracting and building a data model that goes beyond the commonly used Bag-of-Words by utilizing the co-occurrence properties of the extracted features. Third-party performance evaluation surveys were conducted to evaluate the tool on the criteria of ease of use, efficiency, accuracy, usability and completeness. The survey resulted with a favorable evaluation from the respondents. For benchmarking and general testing, the tool was subjected to four experiments with varying domains, languages, number of classes and data sources. All the datasets used for each experiments were gathered from past researches. The results for each of the experiments were benchmarked against the results of the previous works respectively. Overall, the tool managed to produce higher accuracies for each of the experiments conducted compared to their respective reported results. 2015-01-01T08:00:00Z text https://animorepository.dlsu.edu.ph/etd_masteral/5049 Master's Theses English Animo Repository Social media Electronic data processing--Data entry Computer Sciences
institution De La Salle University
building De La Salle University Library
continent Asia
country Philippines
Philippines
content_provider De La Salle University Library
collection DLSU Institutional Repository
language English
topic Social media
Electronic data processing--Data entry
Computer Sciences
spellingShingle Social media
Electronic data processing--Data entry
Computer Sciences
Chan, Oliver Isaac L.
A statistical feature extraction tool for mining short text data
description The surging popularity and influence of social media globally opens a lot of doors for instantaneous information sharing, marketing opportunities, self-promotion and crowdsourcing. With the good, comes the bad -- social media is also hit with different criticism and issues such as cyberbullying, invasion of privacy, online harassment and false reporting. For NLP practitioners, dealing with social media data is interesting because of the properties it possesses. Social media data differs from traditional text data because of its short text nature. This research aims to create an easy to use tool for users beyond NLP practitioners such as students, site administrators, blog owners and others. In order to make a universal feature extraction tool, the following issues were addressed: language-independence, domain-Independence, data sparsity, informal grammar patterns and feature selection/reduction. The tool was designed, built and comprised of different modules to handle each of these issues. The tool emphasized on extracting and building a data model that goes beyond the commonly used Bag-of-Words by utilizing the co-occurrence properties of the extracted features. Third-party performance evaluation surveys were conducted to evaluate the tool on the criteria of ease of use, efficiency, accuracy, usability and completeness. The survey resulted with a favorable evaluation from the respondents. For benchmarking and general testing, the tool was subjected to four experiments with varying domains, languages, number of classes and data sources. All the datasets used for each experiments were gathered from past researches. The results for each of the experiments were benchmarked against the results of the previous works respectively. Overall, the tool managed to produce higher accuracies for each of the experiments conducted compared to their respective reported results.
format text
author Chan, Oliver Isaac L.
author_facet Chan, Oliver Isaac L.
author_sort Chan, Oliver Isaac L.
title A statistical feature extraction tool for mining short text data
title_short A statistical feature extraction tool for mining short text data
title_full A statistical feature extraction tool for mining short text data
title_fullStr A statistical feature extraction tool for mining short text data
title_full_unstemmed A statistical feature extraction tool for mining short text data
title_sort statistical feature extraction tool for mining short text data
publisher Animo Repository
publishDate 2015
url https://animorepository.dlsu.edu.ph/etd_masteral/5049
_version_ 1800919025273798656