Web application vulnerability prediction using hybrid program analysis and machine learning

Due to limited time and resources, web software engineers need support in identifying vulnerable code. A practical approach to predicting vulnerable code would enable them to prioritize security auditing efforts. In this paper, we propose using a set of hybrid (staticþdynamic) code attributes that c...

Full description

Saved in:
Bibliographic Details
Main Authors: SHAR, Lwin Khin, BRIAND, Lionel, TAN, Hee Beng Kuan
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2014
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/4895
https://ink.library.smu.edu.sg/context/sis_research/article/5898/viewcontent/Web_Application___PV.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-5898
record_format dspace
spelling sg-smu-ink.sis_research-58982020-02-13T08:17:35Z Web application vulnerability prediction using hybrid program analysis and machine learning SHAR, Lwin Khin BRIAND, Lionel TAN, Hee Beng Kuan Due to limited time and resources, web software engineers need support in identifying vulnerable code. A practical approach to predicting vulnerable code would enable them to prioritize security auditing efforts. In this paper, we propose using a set of hybrid (staticþdynamic) code attributes that characterize input validation and input sanitization code patterns and are expected to be significant indicators of web application vulnerabilities. Because static and dynamic program analyses complement each other, both techniques are used to extract the proposed attributes in an accurate and scalable way. Current vulnerability prediction techniques rely on the availability of data labeled with vulnerability information for training. For many real world applications, past vulnerability data is often not available or at least not complete. Hence, to address both situations where labeled past data is fully available or not, we apply both supervised and semi-supervised learning when building vulnerability predictors based on hybrid code attributes. Given that semi-supervised learning is entirely unexplored in this domain, we describe how to use this learning scheme effectively for vulnerability prediction. We performed empirical case studies on seven open source projects where we built and evaluated supervised and semi-supervised models. When cross validated with fully available labeled data, the supervised models achieve an average of 77 percent recall and 5 percent probability of false alarm for predicting SQL injection, cross site scripting, remote code execution and file inclusion vulnerabilities. With a low amount of labeled data, when compared to the supervised model, the semi-supervised model showed an average improvement of 24 percent higher recall and 3 percent lower probability of false alarm, thus suggesting semi-supervised learning may be a preferable solution for many real world applications where vulnerability data is missing. 2014-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4895 info:doi/10.1109/TDSC.2014.2373377 https://ink.library.smu.edu.sg/context/sis_research/article/5898/viewcontent/Web_Application___PV.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Vulnerability prediction security measures input validation and sanitization program analysis empirical study Information Security Programming Languages and Compilers
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Vulnerability prediction
security measures
input validation and sanitization
program analysis
empirical study
Information Security
Programming Languages and Compilers
spellingShingle Vulnerability prediction
security measures
input validation and sanitization
program analysis
empirical study
Information Security
Programming Languages and Compilers
SHAR, Lwin Khin
BRIAND, Lionel
TAN, Hee Beng Kuan
Web application vulnerability prediction using hybrid program analysis and machine learning
description Due to limited time and resources, web software engineers need support in identifying vulnerable code. A practical approach to predicting vulnerable code would enable them to prioritize security auditing efforts. In this paper, we propose using a set of hybrid (staticþdynamic) code attributes that characterize input validation and input sanitization code patterns and are expected to be significant indicators of web application vulnerabilities. Because static and dynamic program analyses complement each other, both techniques are used to extract the proposed attributes in an accurate and scalable way. Current vulnerability prediction techniques rely on the availability of data labeled with vulnerability information for training. For many real world applications, past vulnerability data is often not available or at least not complete. Hence, to address both situations where labeled past data is fully available or not, we apply both supervised and semi-supervised learning when building vulnerability predictors based on hybrid code attributes. Given that semi-supervised learning is entirely unexplored in this domain, we describe how to use this learning scheme effectively for vulnerability prediction. We performed empirical case studies on seven open source projects where we built and evaluated supervised and semi-supervised models. When cross validated with fully available labeled data, the supervised models achieve an average of 77 percent recall and 5 percent probability of false alarm for predicting SQL injection, cross site scripting, remote code execution and file inclusion vulnerabilities. With a low amount of labeled data, when compared to the supervised model, the semi-supervised model showed an average improvement of 24 percent higher recall and 3 percent lower probability of false alarm, thus suggesting semi-supervised learning may be a preferable solution for many real world applications where vulnerability data is missing.
format text
author SHAR, Lwin Khin
BRIAND, Lionel
TAN, Hee Beng Kuan
author_facet SHAR, Lwin Khin
BRIAND, Lionel
TAN, Hee Beng Kuan
author_sort SHAR, Lwin Khin
title Web application vulnerability prediction using hybrid program analysis and machine learning
title_short Web application vulnerability prediction using hybrid program analysis and machine learning
title_full Web application vulnerability prediction using hybrid program analysis and machine learning
title_fullStr Web application vulnerability prediction using hybrid program analysis and machine learning
title_full_unstemmed Web application vulnerability prediction using hybrid program analysis and machine learning
title_sort web application vulnerability prediction using hybrid program analysis and machine learning
publisher Institutional Knowledge at Singapore Management University
publishDate 2014
url https://ink.library.smu.edu.sg/sis_research/4895
https://ink.library.smu.edu.sg/context/sis_research/article/5898/viewcontent/Web_Application___PV.pdf
_version_ 1770575088748331008