Automatic topic detection of documents

The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This...

全面介紹

Saved in:
書目詳細資料
主要作者: Chia, Darren Kok Seng
其他作者: Mao Kezhi
格式: Final Year Project
語言:English
出版: 2019
主題:
在線閱讀:http://hdl.handle.net/10356/78439
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
實物特徵
總結:The increasing volume of documents uploaded onto the internet on a daily basis presents a challenge for users to search for relevant articles on specific topics. This is the basis for developing a model for “Automatic Topic Detection of documents” through the use of natural language processing. This report covers a quick literature review of Representation learning, including some text representation learning and data pre-processing techniques available and Pattern recognition which contains three supervised technique for pattern classification namely, Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB) and K Nearest Neighbour (KNN). The research team built an automatic topic detection model through the use of Python Machine Scikit-learn. The dataset “20 newsgroups” which consist of nearly 20000 documents that is divide into 20 topics was the dataset for the experiment. Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF – IDF) along with some pre-processing technique was used, as were the three different classifiers named above. Preliminary results for each classifier and the comparison between the reports show that SVM is the best classifier among the three, and the team also analyzed how the tuning of parameters and data pre-processing can affect accuracy output.