Job scam detection using classification algorithms
Scams are the most common type of cybercrime in Singapore, with a majority of them being job scams. Applicant Tracking Systems (ATS) and their automation capabilities makes it easy for scammers to post fraudulent job listings on online recruitment portals such as Monster. It also allows them to easi...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/181115 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Scams are the most common type of cybercrime in Singapore, with a majority of them being job scams. Applicant Tracking Systems (ATS) and their automation capabilities makes it easy for scammers to post fraudulent job listings on online recruitment portals such as Monster. It also allows them to easily collect up to 1000 resumes a day. The objective of this study is to expand upon the foundational knowledge obtained by past researchers and identify feature extraction techniques and classification models that are most effective in identifying fake job advertisements. This study applies modern Natural Language Processing (NLP) techniques such as transformers and word embeddings on the Employment Scam Aegean Dataset (EMSCAD) from the University of the Aegean to study its effectiveness. The resulting models that utilised these techniques managed to achieve the highest F1 scores through the study, highlighting their effectiveness in the classification task. These results support prior research and prove that feature selection improves performance regardless of the classification model chosen. Additionally, embedding features generally perform better than a custom ruleset of features. Although these results show that transformers and word embeddings are effective, they are prone to certain limitations due to the imbalanced EMSCAD dataset, and the maximum sequence length of the transformer models used in this study. Hence, future work in this area can focus on creating a more robust, comprehensive and balanced dataset as compared to the EMSCAD dataset and focus on fine-tuning other transformer models such as BigBird and Longformer, that are capable of handling larger sequences of texts. |
---|