URLNet: Learning a URL representation with deep learning for malicious URL detection

Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years...

Full description

Saved in:

Bibliographic Details
Main Authors:	LE, Hung, PHAM, Hong Quang, SAHOO, Doyen, HOI, Steven C. H.
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2018
Subjects:	URLNet Malicious URL Detection Deep Learning Databases and Information Systems Information Security
Online Access:	https://ink.library.smu.edu.sg/sis_research/4135 https://ink.library.smu.edu.sg/context/sis_research/article/5138/viewcontent/UrlNet_2018_wp.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-5138
record_format	dspace
spelling	sg-smu-ink.sis_research-51382020-06-09T03:14:49Z URLNet: Learning a URL representation with deep learning for malicious URL detection LE, Hung PHAM, Hong Quang SAHOO, Doyen HOI, Steven C. H. Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to handle unseen features and generalize to test data. To address these challenges, we propose URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL. Specifically, we apply Convolutional Neural Networks to both characters and words of the URL String to learn the URL embedding in a jointly optimized framework. This approach allows the model to capture several types of semantic information, which was not possible by the existing models. We also propose advanced word-embeddings to solve the problem of too many rare words observed in this task. We conduct extensive experiments on a large-scale dataset and show a significant performance gain over existing methods. We also conduct ablation studies to evaluate the performance of various components of URLNet. 2018-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4135 https://ink.library.smu.edu.sg/context/sis_research/article/5138/viewcontent/UrlNet_2018_wp.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University URLNet Malicious URL Detection Deep Learning Databases and Information Systems Information Security
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	URLNet Malicious URL Detection Deep Learning Databases and Information Systems Information Security
spellingShingle	URLNet Malicious URL Detection Deep Learning Databases and Information Systems Information Security LE, Hung PHAM, Hong Quang SAHOO, Doyen HOI, Steven C. H. URLNet: Learning a URL representation with deep learning for malicious URL detection
description	Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to handle unseen features and generalize to test data. To address these challenges, we propose URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL. Specifically, we apply Convolutional Neural Networks to both characters and words of the URL String to learn the URL embedding in a jointly optimized framework. This approach allows the model to capture several types of semantic information, which was not possible by the existing models. We also propose advanced word-embeddings to solve the problem of too many rare words observed in this task. We conduct extensive experiments on a large-scale dataset and show a significant performance gain over existing methods. We also conduct ablation studies to evaluate the performance of various components of URLNet.
format	text
author	LE, Hung PHAM, Hong Quang SAHOO, Doyen HOI, Steven C. H.
author_facet	LE, Hung PHAM, Hong Quang SAHOO, Doyen HOI, Steven C. H.
author_sort	LE, Hung
title	URLNet: Learning a URL representation with deep learning for malicious URL detection
title_short	URLNet: Learning a URL representation with deep learning for malicious URL detection
title_full	URLNet: Learning a URL representation with deep learning for malicious URL detection
title_fullStr	URLNet: Learning a URL representation with deep learning for malicious URL detection
title_full_unstemmed	URLNet: Learning a URL representation with deep learning for malicious URL detection
title_sort	urlnet: learning a url representation with deep learning for malicious url detection
publisher	Institutional Knowledge at Singapore Management University
publishDate	2018
url	https://ink.library.smu.edu.sg/sis_research/4135 https://ink.library.smu.edu.sg/context/sis_research/article/5138/viewcontent/UrlNet_2018_wp.pdf
_version_	1770574348693798912

URLNet: Learning a URL representation with deep learning for malicious URL detection

Similar Items