Job scam detection using classification algorithms

Scams are the most common type of cybercrime in Singapore, with a majority of them being job scams. Applicant Tracking Systems (ATS) and their automation capabilities makes it easy for scammers to post fraudulent job listings on online recruitment portals such as Monster. It also allows them to easi...

Full description

Saved in:
Bibliographic Details
Main Author: Sim, Keith Shi Jie
Other Authors: Josephine Chong Leng Leng
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/181115
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Scams are the most common type of cybercrime in Singapore, with a majority of them being job scams. Applicant Tracking Systems (ATS) and their automation capabilities makes it easy for scammers to post fraudulent job listings on online recruitment portals such as Monster. It also allows them to easily collect up to 1000 resumes a day. The objective of this study is to expand upon the foundational knowledge obtained by past researchers and identify feature extraction techniques and classification models that are most effective in identifying fake job advertisements. This study applies modern Natural Language Processing (NLP) techniques such as transformers and word embeddings on the Employment Scam Aegean Dataset (EMSCAD) from the University of the Aegean to study its effectiveness. The resulting models that utilised these techniques managed to achieve the highest F1 scores through the study, highlighting their effectiveness in the classification task. These results support prior research and prove that feature selection improves performance regardless of the classification model chosen. Additionally, embedding features generally perform better than a custom ruleset of features. Although these results show that transformers and word embeddings are effective, they are prone to certain limitations due to the imbalanced EMSCAD dataset, and the maximum sequence length of the transformer models used in this study. Hence, future work in this area can focus on creating a more robust, comprehensive and balanced dataset as compared to the EMSCAD dataset and focus on fine-tuning other transformer models such as BigBird and Longformer, that are capable of handling larger sequences of texts.