Term co-occurrence evolution study

Huge data is created continuously and all these data are stored somewhere in its raw form. In this project, we introduced a prototype application using series of algorithms to convert these raw data into a form that we can study on. The project focused on the terms’ co-occurrence evolution over time...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, Bernard Mao Sheng
Other Authors: Sun Aixin
Format: Final Year Project
Language:English
Published: 2014
Subjects:
Online Access:http://hdl.handle.net/10356/58954
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Huge data is created continuously and all these data are stored somewhere in its raw form. In this project, we introduced a prototype application using series of algorithms to convert these raw data into a form that we can study on. The project focused on the terms’ co-occurrence evolution over time. In order to implement this application, some research is done to identify ways to transform these raw data into other forms for easy manipulation. Various API Libraries, including Natural Language Processing, Multi-threading and Data Indexing are used. With project focus on studying term co-occurrence evolution, the prototype is designed with a graphical user interface with real-time performance in consideration. The application allows direct user interaction to run analysis which complete within seconds. The result is displayed in two forms, line chart graph and detailed table. User is able to directly manipulate on the line chart by dynamically selecting co-occurred terms that they are interested in. To facilitate on clearer analysis result, the application includes ranking algorithms to rank the terms from the result based on their interestingness. By default, when the analysis is complete, the application will rank the terms, output the line chart with top 5 interesting terms and sort the details in the detailed table. Due to the nature of handling huge data, the application needs to be optimised and fast. This is where preprocessing is performed and multi-threading is added in the analysis process to utilise the system’s computing power to speed up the analysis. Even though, the objective is achieved in identifying interesting co-occurred terms, improvements and additional features could be introduced to extend its potential. Some recommendations include better multi-threading logic and better ranking algorithms.