Cost-Sensitive Online Active Learning with application to malicious URL detection
Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims...
Saved in:
Main Authors: | , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2013
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/2324 https://ink.library.smu.edu.sg/context/sis_research/article/3324/viewcontent/Cost_SensitiveOnlineActiveLearningwithApplicationMalicio.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-3324 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-33242020-04-01T02:55:17Z Cost-Sensitive Online Active Learning with application to malicious URL detection ZHAO, Peilin HOI, Steven C. H. Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims to optimize the prediction accuracy. However, in a real-world malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy. Besides, another key limitation of the existing work is to assume a large amount of training data is available, which is impractical as the human labeling cost could be potentially quite expensive. To solve these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. In particular, we propose two CSOAL algorithms and analyze their theoretical performance in terms of cost-sensitive bounds. We conduct an extensive set of experiments to examine the empirical performance of the proposed algorithms for a large-scale challenging malicious URL detection task, in which the encouraging results showed that the proposed technique by querying an extremely small-sized labeled data (about 0.5% out of 1-million instances) can achieve better or highly comparable classification performance in comparison to the state-of-the-art cost-insensitive and cost-sensitive online classification algorithms using a huge amount of labeled data. 2013-08-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/2324 info:doi/10.1145/2487575.2487647 https://ink.library.smu.edu.sg/context/sis_research/article/3324/viewcontent/Cost_SensitiveOnlineActiveLearningwithApplicationMalicio.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Active learning Cost-sensitive learning Malicious URL detection Online learning Computer Sciences Databases and Information Systems Information Security |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Active learning Cost-sensitive learning Malicious URL detection Online learning Computer Sciences Databases and Information Systems Information Security |
spellingShingle |
Active learning Cost-sensitive learning Malicious URL detection Online learning Computer Sciences Databases and Information Systems Information Security ZHAO, Peilin HOI, Steven C. H. Cost-Sensitive Online Active Learning with application to malicious URL detection |
description |
Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims to optimize the prediction accuracy. However, in a real-world malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy. Besides, another key limitation of the existing work is to assume a large amount of training data is available, which is impractical as the human labeling cost could be potentially quite expensive. To solve these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. In particular, we propose two CSOAL algorithms and analyze their theoretical performance in terms of cost-sensitive bounds. We conduct an extensive set of experiments to examine the empirical performance of the proposed algorithms for a large-scale challenging malicious URL detection task, in which the encouraging results showed that the proposed technique by querying an extremely small-sized labeled data (about 0.5% out of 1-million instances) can achieve better or highly comparable classification performance in comparison to the state-of-the-art cost-insensitive and cost-sensitive online classification algorithms using a huge amount of labeled data. |
format |
text |
author |
ZHAO, Peilin HOI, Steven C. H. |
author_facet |
ZHAO, Peilin HOI, Steven C. H. |
author_sort |
ZHAO, Peilin |
title |
Cost-Sensitive Online Active Learning with application to malicious URL detection |
title_short |
Cost-Sensitive Online Active Learning with application to malicious URL detection |
title_full |
Cost-Sensitive Online Active Learning with application to malicious URL detection |
title_fullStr |
Cost-Sensitive Online Active Learning with application to malicious URL detection |
title_full_unstemmed |
Cost-Sensitive Online Active Learning with application to malicious URL detection |
title_sort |
cost-sensitive online active learning with application to malicious url detection |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2013 |
url |
https://ink.library.smu.edu.sg/sis_research/2324 https://ink.library.smu.edu.sg/context/sis_research/article/3324/viewcontent/Cost_SensitiveOnlineActiveLearningwithApplicationMalicio.pdf |
_version_ |
1770572098643689472 |