Bursty feature representation for clustering text streams

Text representation plays a crucial role in classical text mining, where the primary focus was on static text. Nevertheless, well-studied static text representations including TFIDF are not optimized for non-stationary streams of information such as news, discussion board messages, and blogs. We the...

Full description

Saved in:
Bibliographic Details
Main Authors: HE, Qi, CHANG, Kuiyu, LIM, Ee Peng, ZHANG, Jun
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2007
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/1273
http://doi.org/10.1137/1.9781611972771.50
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:Text representation plays a crucial role in classical text mining, where the primary focus was on static text. Nevertheless, well-studied static text representations including TFIDF are not optimized for non-stationary streams of information such as news, discussion board messages, and blogs. We therefore introduce a new temporal representation for text streams based on bursty features. Our bursty text representation differs significantly from traditional schemes in that it 1) dynamically represents documents over time, 2) amplifies a feature in proportional to its burstiness at any point in time, and 3) is topic independent. Our bursty text representation model was evaluated against a classical bagof-words text representation on the task of clustering TDT3 topical text streams. It was shown to consistently yield more cohesive clusters in terms of cluster purity and cluster/class entropies. This new temporal bursty text representation can be extended to most text mining tasks involving a temporal dimension, such as modeling of online blog pages.