Visualization and analysis of document clusters produced by self-organizing maps

The problem of information overload with the huge number of text documents available makes them increasingly difficult to organize and analyze. To alleviate this problem, text document clustering is used to automatically group related documents together. However, documents usually produce very high-...

Full description

Saved in:
Bibliographic Details
Main Author: Landrito, Maynard R.
Format: text
Language:English
Published: Animo Repository 2013
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etd_masteral/4372
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
Description
Summary:The problem of information overload with the huge number of text documents available makes them increasingly difficult to organize and analyze. To alleviate this problem, text document clustering is used to automatically group related documents together. However, documents usually produce very high-dimensional data, making it resource-intensive to perform data processing on them. Random Projection Method (RPM) is shown to reduce the dimensionality of a large document dataset. The dimensionality reduction scheme is then coupled with Self-Organizing Maps (SOM) to organize the documents in the dataset. K-Means clustering is then performed on the SOM units to produce clusters of documents that were organized within the SOM. Various properties based on the SOM were introduced, as well as a method to measure and visualize them. These allowed for detailed analysis of the clusters and aided in nding outliers of the dataset, overlap between clusters, concentration of documents within clusters, possible subclusters and quality of di erent parts of clusters, among others. Cross-referencing between di erent property visualizations provided internal validation of the observations. For future work, the di erent SOM-based properties and their visualizations can be used for interactive document selection, recommendation systems, and quality measure.