URLNet: Learning a URL representation with deep learning for malicious URL detection
Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years...
Saved in:
Main Authors: | , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2018
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/4135 https://ink.library.smu.edu.sg/context/sis_research/article/5138/viewcontent/UrlNet_2018_wp.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-5138 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-51382020-06-09T03:14:49Z URLNet: Learning a URL representation with deep learning for malicious URL detection LE, Hung PHAM, Hong Quang SAHOO, Doyen HOI, Steven C. H. Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to handle unseen features and generalize to test data. To address these challenges, we propose URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL. Specifically, we apply Convolutional Neural Networks to both characters and words of the URL String to learn the URL embedding in a jointly optimized framework. This approach allows the model to capture several types of semantic information, which was not possible by the existing models. We also propose advanced word-embeddings to solve the problem of too many rare words observed in this task. We conduct extensive experiments on a large-scale dataset and show a significant performance gain over existing methods. We also conduct ablation studies to evaluate the performance of various components of URLNet. 2018-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4135 https://ink.library.smu.edu.sg/context/sis_research/article/5138/viewcontent/UrlNet_2018_wp.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University URLNet Malicious URL Detection Deep Learning Databases and Information Systems Information Security |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
URLNet Malicious URL Detection Deep Learning Databases and Information Systems Information Security |
spellingShingle |
URLNet Malicious URL Detection Deep Learning Databases and Information Systems Information Security LE, Hung PHAM, Hong Quang SAHOO, Doyen HOI, Steven C. H. URLNet: Learning a URL representation with deep learning for malicious URL detection |
description |
Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to handle unseen features and generalize to test data. To address these challenges, we propose URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL. Specifically, we apply Convolutional Neural Networks to both characters and words of the URL String to learn the URL embedding in a jointly optimized framework. This approach allows the model to capture several types of semantic information, which was not possible by the existing models. We also propose advanced word-embeddings to solve the problem of too many rare words observed in this task. We conduct extensive experiments on a large-scale dataset and show a significant performance gain over existing methods. We also conduct ablation studies to evaluate the performance of various components of URLNet. |
format |
text |
author |
LE, Hung PHAM, Hong Quang SAHOO, Doyen HOI, Steven C. H. |
author_facet |
LE, Hung PHAM, Hong Quang SAHOO, Doyen HOI, Steven C. H. |
author_sort |
LE, Hung |
title |
URLNet: Learning a URL representation with deep learning for malicious URL detection |
title_short |
URLNet: Learning a URL representation with deep learning for malicious URL detection |
title_full |
URLNet: Learning a URL representation with deep learning for malicious URL detection |
title_fullStr |
URLNet: Learning a URL representation with deep learning for malicious URL detection |
title_full_unstemmed |
URLNet: Learning a URL representation with deep learning for malicious URL detection |
title_sort |
urlnet: learning a url representation with deep learning for malicious url detection |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2018 |
url |
https://ink.library.smu.edu.sg/sis_research/4135 https://ink.library.smu.edu.sg/context/sis_research/article/5138/viewcontent/UrlNet_2018_wp.pdf |
_version_ |
1770574348693798912 |