URLNet: Learning a URL representation with deep learning for malicious URL detection

Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years...

Full description

Saved in:
Bibliographic Details
Main Authors: LE, Hung, PHAM, Hong Quang, SAHOO, Doyen, HOI, Steven C. H.
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2018
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/4135
https://ink.library.smu.edu.sg/context/sis_research/article/5138/viewcontent/UrlNet_2018_wp.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-5138
record_format dspace
spelling sg-smu-ink.sis_research-51382020-06-09T03:14:49Z URLNet: Learning a URL representation with deep learning for malicious URL detection LE, Hung PHAM, Hong Quang SAHOO, Doyen HOI, Steven C. H. Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to handle unseen features and generalize to test data. To address these challenges, we propose URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL. Specifically, we apply Convolutional Neural Networks to both characters and words of the URL String to learn the URL embedding in a jointly optimized framework. This approach allows the model to capture several types of semantic information, which was not possible by the existing models. We also propose advanced word-embeddings to solve the problem of too many rare words observed in this task. We conduct extensive experiments on a large-scale dataset and show a significant performance gain over existing methods. We also conduct ablation studies to evaluate the performance of various components of URLNet. 2018-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4135 https://ink.library.smu.edu.sg/context/sis_research/article/5138/viewcontent/UrlNet_2018_wp.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University URLNet Malicious URL Detection Deep Learning Databases and Information Systems Information Security
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic URLNet
Malicious URL Detection
Deep Learning
Databases and Information Systems
Information Security
spellingShingle URLNet
Malicious URL Detection
Deep Learning
Databases and Information Systems
Information Security
LE, Hung
PHAM, Hong Quang
SAHOO, Doyen
HOI, Steven C. H.
URLNet: Learning a URL representation with deep learning for malicious URL detection
description Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to handle unseen features and generalize to test data. To address these challenges, we propose URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL. Specifically, we apply Convolutional Neural Networks to both characters and words of the URL String to learn the URL embedding in a jointly optimized framework. This approach allows the model to capture several types of semantic information, which was not possible by the existing models. We also propose advanced word-embeddings to solve the problem of too many rare words observed in this task. We conduct extensive experiments on a large-scale dataset and show a significant performance gain over existing methods. We also conduct ablation studies to evaluate the performance of various components of URLNet.
format text
author LE, Hung
PHAM, Hong Quang
SAHOO, Doyen
HOI, Steven C. H.
author_facet LE, Hung
PHAM, Hong Quang
SAHOO, Doyen
HOI, Steven C. H.
author_sort LE, Hung
title URLNet: Learning a URL representation with deep learning for malicious URL detection
title_short URLNet: Learning a URL representation with deep learning for malicious URL detection
title_full URLNet: Learning a URL representation with deep learning for malicious URL detection
title_fullStr URLNet: Learning a URL representation with deep learning for malicious URL detection
title_full_unstemmed URLNet: Learning a URL representation with deep learning for malicious URL detection
title_sort urlnet: learning a url representation with deep learning for malicious url detection
publisher Institutional Knowledge at Singapore Management University
publishDate 2018
url https://ink.library.smu.edu.sg/sis_research/4135
https://ink.library.smu.edu.sg/context/sis_research/article/5138/viewcontent/UrlNet_2018_wp.pdf
_version_ 1770574348693798912