Theme identification using machine learning techniques
With the abundance of online research platforms, much information presented in PDF files, such as articles and journals, can be obtained easily. In this case, students completing research projects would have many downloaded PDF articles on their laptops. However, identifying the target article...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
ASASI
2021
|
Subjects: | |
Online Access: | http://irep.iium.edu.my/104247/2/104247_Theme%20identification.pdf http://irep.iium.edu.my/104247/ https://asasijournal.id/index.php/jiae/article/view/24 https://doi.org/10.51662/jiae.v1i2.24 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Islam Antarabangsa Malaysia |
Language: | English |
Summary: | With the abundance of online research platforms, much information presented in
PDF files, such as articles and journals, can be obtained easily. In this case, students
completing research projects would have many downloaded PDF articles on their
laptops. However, identifying the target articles manually within the collection can
be tiring as most articles consist of several pages that need to be analyzed. Reading
each article to determine if the article relates theme and organizing the articles
based on themes is time and energy-consuming. Referring to this problem, a PDF
files organizer that implemented a theme identifier is necessary. Thus, work will focus on automatic text classification using the machine learning methods to build a theme identifier employed in the PDF files organizer to classify articles into augmented reality and machine learning. A total of 1000 text documents for both themes were used to build the classification model. Moreover, the pre-preprocessing step for data cleaning and TF-IDF feature extraction for text vectorization and to reduce sparse vectors were performed. 80% of the dataset were used for training,
and the remaining were used to validate the trained models. The classification
models proposed in this work are Linear SVM and Multinomial Naïve Bayes. The
accuracy of the models was evaluated using a confusion matrix. For the Linear SVM
model, grid-search optimization was performed to determine the optimal value of
the Cost parameter. |
---|