Forum date understanding and mining

With the explosive growth of data online from terabytes to petabytes, a large amount of data is being collected and warehoused. People are drowning in data and starving for knowledge. Traditional techniques such as statistics and database systems are not...

Full description

Saved in:
Bibliographic Details
Main Author: Sim, Edwin Wong Loong
Other Authors: Sun Aixin
Format: Final Year Project
Language:English
Published: 2014
Subjects:
Online Access:http://hdl.handle.net/10356/59103
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-59103
record_format dspace
spelling sg-ntu-dr.10356-591032023-03-03T20:26:49Z Forum date understanding and mining Sim, Edwin Wong Loong Sun Aixin School of Computer Engineering DRNTU::Engineering::Computer science and engineering::Computer systems organization::Computer system implementation With the explosive growth of data online from terabytes to petabytes, a large amount of data is being collected and warehoused. People are drowning in data and starving for knowledge. Traditional techniques such as statistics and database systems are not suitable to extract sufficient knowledge as there is an enormity of data, and high dimensionality of data. Hence, data mining technique was created to perform non- trivial extraction of implicit, previously unknown and potentially useful knowledge from large amounts of data. As computers get cheaper and more powerful, the analysis of huge amounts of data is made possible. In this project, the student was given a sample dataset, which was retrieved from a popular local forum, www.hardwarezone.com. This data was retrieved in Extensible Markup Language (XML) format and the student has to process the data to determine the communication patterns among users in forums. By using text mining, which is a process of text analytics that involves information retrieval, study of word frequency distributions and pattern recognition, the student was able to derive high- quality information from the text. The data was broken down into terms and documents where the frequency of each term is calculated against the document. Using techniques such as case folding, stop word removal, lemmatization and stemming, the student was able to clean the data and make it more efficient for analysis. A weighted formula factor term frequency–inverse document frequency (Tf-idf) was used to normalise the frequency of the term. A single pass clustering method using cosine similarity is used on the data to find out the relation of each term. From the clustered formed, the student determined a certain event occurrence on a particular day. From this project, the student has gained a higher level of knowledge of how data is processed and analyzed. It also greatly increased the interest of the student in data mining and data processing. The techniques used and learned from this project will be very helpful in the future for data analysis. Bachelor of Engineering (Computer Science) 2014-04-22T09:08:48Z 2014-04-22T09:08:48Z 2014 2014 Final Year Project (FYP) http://hdl.handle.net/10356/59103 en Nanyang Technological University 40 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering::Computer systems organization::Computer system implementation
spellingShingle DRNTU::Engineering::Computer science and engineering::Computer systems organization::Computer system implementation
Sim, Edwin Wong Loong
Forum date understanding and mining
description With the explosive growth of data online from terabytes to petabytes, a large amount of data is being collected and warehoused. People are drowning in data and starving for knowledge. Traditional techniques such as statistics and database systems are not suitable to extract sufficient knowledge as there is an enormity of data, and high dimensionality of data. Hence, data mining technique was created to perform non- trivial extraction of implicit, previously unknown and potentially useful knowledge from large amounts of data. As computers get cheaper and more powerful, the analysis of huge amounts of data is made possible. In this project, the student was given a sample dataset, which was retrieved from a popular local forum, www.hardwarezone.com. This data was retrieved in Extensible Markup Language (XML) format and the student has to process the data to determine the communication patterns among users in forums. By using text mining, which is a process of text analytics that involves information retrieval, study of word frequency distributions and pattern recognition, the student was able to derive high- quality information from the text. The data was broken down into terms and documents where the frequency of each term is calculated against the document. Using techniques such as case folding, stop word removal, lemmatization and stemming, the student was able to clean the data and make it more efficient for analysis. A weighted formula factor term frequency–inverse document frequency (Tf-idf) was used to normalise the frequency of the term. A single pass clustering method using cosine similarity is used on the data to find out the relation of each term. From the clustered formed, the student determined a certain event occurrence on a particular day. From this project, the student has gained a higher level of knowledge of how data is processed and analyzed. It also greatly increased the interest of the student in data mining and data processing. The techniques used and learned from this project will be very helpful in the future for data analysis.
author2 Sun Aixin
author_facet Sun Aixin
Sim, Edwin Wong Loong
format Final Year Project
author Sim, Edwin Wong Loong
author_sort Sim, Edwin Wong Loong
title Forum date understanding and mining
title_short Forum date understanding and mining
title_full Forum date understanding and mining
title_fullStr Forum date understanding and mining
title_full_unstemmed Forum date understanding and mining
title_sort forum date understanding and mining
publishDate 2014
url http://hdl.handle.net/10356/59103
_version_ 1759854711709630464