Cost-Sensitive Online Active Learning with application to malicious URL detection

Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims...

Full description

Saved in:
Bibliographic Details
Main Authors: ZHAO, Peilin, HOI, Steven C. H.
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2013
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/2324
https://ink.library.smu.edu.sg/context/sis_research/article/3324/viewcontent/Cost_SensitiveOnlineActiveLearningwithApplicationMalicio.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-3324
record_format dspace
spelling sg-smu-ink.sis_research-33242020-04-01T02:55:17Z Cost-Sensitive Online Active Learning with application to malicious URL detection ZHAO, Peilin HOI, Steven C. H. Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims to optimize the prediction accuracy. However, in a real-world malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy. Besides, another key limitation of the existing work is to assume a large amount of training data is available, which is impractical as the human labeling cost could be potentially quite expensive. To solve these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. In particular, we propose two CSOAL algorithms and analyze their theoretical performance in terms of cost-sensitive bounds. We conduct an extensive set of experiments to examine the empirical performance of the proposed algorithms for a large-scale challenging malicious URL detection task, in which the encouraging results showed that the proposed technique by querying an extremely small-sized labeled data (about 0.5% out of 1-million instances) can achieve better or highly comparable classification performance in comparison to the state-of-the-art cost-insensitive and cost-sensitive online classification algorithms using a huge amount of labeled data. 2013-08-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/2324 info:doi/10.1145/2487575.2487647 https://ink.library.smu.edu.sg/context/sis_research/article/3324/viewcontent/Cost_SensitiveOnlineActiveLearningwithApplicationMalicio.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Active learning Cost-sensitive learning Malicious URL detection Online learning Computer Sciences Databases and Information Systems Information Security
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Active learning
Cost-sensitive learning
Malicious URL detection
Online learning
Computer Sciences
Databases and Information Systems
Information Security
spellingShingle Active learning
Cost-sensitive learning
Malicious URL detection
Online learning
Computer Sciences
Databases and Information Systems
Information Security
ZHAO, Peilin
HOI, Steven C. H.
Cost-Sensitive Online Active Learning with application to malicious URL detection
description Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims to optimize the prediction accuracy. However, in a real-world malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy. Besides, another key limitation of the existing work is to assume a large amount of training data is available, which is impractical as the human labeling cost could be potentially quite expensive. To solve these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. In particular, we propose two CSOAL algorithms and analyze their theoretical performance in terms of cost-sensitive bounds. We conduct an extensive set of experiments to examine the empirical performance of the proposed algorithms for a large-scale challenging malicious URL detection task, in which the encouraging results showed that the proposed technique by querying an extremely small-sized labeled data (about 0.5% out of 1-million instances) can achieve better or highly comparable classification performance in comparison to the state-of-the-art cost-insensitive and cost-sensitive online classification algorithms using a huge amount of labeled data.
format text
author ZHAO, Peilin
HOI, Steven C. H.
author_facet ZHAO, Peilin
HOI, Steven C. H.
author_sort ZHAO, Peilin
title Cost-Sensitive Online Active Learning with application to malicious URL detection
title_short Cost-Sensitive Online Active Learning with application to malicious URL detection
title_full Cost-Sensitive Online Active Learning with application to malicious URL detection
title_fullStr Cost-Sensitive Online Active Learning with application to malicious URL detection
title_full_unstemmed Cost-Sensitive Online Active Learning with application to malicious URL detection
title_sort cost-sensitive online active learning with application to malicious url detection
publisher Institutional Knowledge at Singapore Management University
publishDate 2013
url https://ink.library.smu.edu.sg/sis_research/2324
https://ink.library.smu.edu.sg/context/sis_research/article/3324/viewcontent/Cost_SensitiveOnlineActiveLearningwithApplicationMalicio.pdf
_version_ 1770572098643689472