Forum date understanding and mining
With the explosive growth of data online from terabytes to petabytes, a large amount of data is being collected and warehoused. People are drowning in data and starving for knowledge. Traditional techniques such as statistics and database systems are not...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2014
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/59103 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | With the explosive growth of data online from terabytes to petabytes, a large amount
of data is being collected and warehoused. People are drowning in data and starving
for knowledge. Traditional techniques such as statistics and database systems are
not suitable to extract sufficient knowledge as there is an enormity of data, and high
dimensionality of data. Hence, data mining technique was created to perform non-
trivial extraction of implicit, previously unknown and potentially useful knowledge
from large amounts of data. As computers get cheaper and more powerful, the
analysis of huge amounts of data is made possible.
In this project, the student was given a sample dataset, which was retrieved from a
popular local forum, www.hardwarezone.com. This data was retrieved in Extensible
Markup Language (XML) format and the student has to process the data to
determine the communication patterns among users in forums. By using text mining,
which is a process of text analytics that involves information retrieval, study of word
frequency distributions and pattern recognition, the student was able to derive high-
quality information from the text.
The data was broken down into terms and documents where the frequency of each
term is calculated against the document. Using techniques such as case folding,
stop word removal, lemmatization and stemming, the student was able to clean the
data and make it more efficient for analysis. A weighted formula factor term
frequency–inverse document frequency (Tf-idf) was used to normalise the frequency
of the term.
A single pass clustering method using cosine similarity is used on the data to find out
the relation of each term. From the clustered formed, the student determined a
certain event occurrence on a particular day.
From this project, the student has gained a higher level of knowledge of how data is
processed and analyzed. It also greatly increased the interest of the student in data
mining and data processing. The techniques used and learned from this project will
be very helpful in the future for data analysis. |
---|