Clustering together with learning representations

Document clustering is a useful and practical machine learning methodology, with various real-world applications, such as search optimization, document recommendation, and tag generation of papers and records. It realizes the process of arranging a batch of pdf documents into many separate subgroups...

Full description

Saved in:
Bibliographic Details
Main Author: Yu, Shuaiqi
Other Authors: Lihui Chen
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/158048
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Document clustering is a useful and practical machine learning methodology, with various real-world applications, such as search optimization, document recommendation, and tag generation of papers and records. It realizes the process of arranging a batch of pdf documents into many separate subgroups. To achieve more efficient clustering, we introduce representation learning, which is an unsupervised learning approach that self-studies the features from unlabeled data. In this project, we aim at implementing and studying a series of representation learning methods which are more suitable for clustering tasks on web documents such as Reuters-10k dataset. Specifically, the deep fuzzy clustering GrDNFCS has been implemented and explored to reproduce automatically categorize web documents reported in the paper. A new approach named CLDFC, where a contrastive loss is introduced into GrDNFCS is proposed and designed to improve accuracy of clustering. Based on our preliminary study, CLDEC shows 2.5% improvement in accuracy and reduce time complexity of average 60s per epoch compared with GrDNFCS. Experiments on several other clustering models will be included for comparisons.