Automated identification of libraries from vulnerability data: can we do better?

Software engineers depend heavily on software libraries and have to update their dependencies once vulnerabilities are found in them. Software Composition Analysis (SCA) helps developers identify vulnerable libraries used by an application. A key challenge is the identification of libraries related...

Full description

Saved in:

Bibliographic Details
Main Authors:	HARYONO, Stefanus A., KANG, Hong Jin, SHARMA, Abhishek, SHARMA, Asankhaya, SANTOSA, Andrew E., ANG, Ming Yi, LO, David
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	Multi-label classification Machine learning Vulnerability report Databases and Information Systems
Online Access:	https://ink.library.smu.edu.sg/sis_research/7690 https://ink.library.smu.edu.sg/context/sis_research/article/8693/viewcontent/automated.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-8693
record_format	dspace
spelling	sg-smu-ink.sis_research-86932023-01-10T03:14:54Z Automated identification of libraries from vulnerability data: can we do better? HARYONO, Stefanus A. KANG, Hong Jin SHARMA, Abhishek SHARMA, Asankhaya SANTOSA, Andrew E. ANG, Ming Yi LO, David Software engineers depend heavily on software libraries and have to update their dependencies once vulnerabilities are found in them. Software Composition Analysis (SCA) helps developers identify vulnerable libraries used by an application. A key challenge is the identification of libraries related to a given reported vulnerability in the National Vulnerability Database (NVD), which may not explicitly indicate the affected libraries. Recently, researchers have tried to address the problem of identifying the libraries from an NVD report by treating it as an extreme multi-label learning (XML) problem, characterized by its large number of possible labels and severe data sparsity. As input, the NVD report is provided, and as output, a set of relevant libraries is returned. In this work, we evaluated multiple XML techniques. While previous work only evaluated a traditional XML technique, FastXML, we trained four other traditional XML models (DiSMEC, Parabel, Bonsai, ExtremeText) as well as two deep learning-based models (XML-CNN and LightXML). We compared both their effectiveness and the time cost of training and using the models for predictions. We find that other than DiSMEC and XML-CNN, recent XML models outperform the FastXML model by 3%–10% in terms of F1-scores on Top-k (k=1,2,3) predictions. Furthermore, we observe significant improvements in both the training and prediction time of these XML models, with Bonsai and Parabel model achieving 627x and 589x faster training time and 12x faster prediction time from the FastXML baseline. We discuss the implications of our experimental results and highlight limitations for future work to address. 2022-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7690 info:doi/10.1145/3377813.3381360 https://ink.library.smu.edu.sg/context/sis_research/article/8693/viewcontent/automated.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Multi-label classification Machine learning Vulnerability report Databases and Information Systems
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Multi-label classification Machine learning Vulnerability report Databases and Information Systems
spellingShingle	Multi-label classification Machine learning Vulnerability report Databases and Information Systems HARYONO, Stefanus A. KANG, Hong Jin SHARMA, Abhishek SHARMA, Asankhaya SANTOSA, Andrew E. ANG, Ming Yi LO, David Automated identification of libraries from vulnerability data: can we do better?
description	Software engineers depend heavily on software libraries and have to update their dependencies once vulnerabilities are found in them. Software Composition Analysis (SCA) helps developers identify vulnerable libraries used by an application. A key challenge is the identification of libraries related to a given reported vulnerability in the National Vulnerability Database (NVD), which may not explicitly indicate the affected libraries. Recently, researchers have tried to address the problem of identifying the libraries from an NVD report by treating it as an extreme multi-label learning (XML) problem, characterized by its large number of possible labels and severe data sparsity. As input, the NVD report is provided, and as output, a set of relevant libraries is returned. In this work, we evaluated multiple XML techniques. While previous work only evaluated a traditional XML technique, FastXML, we trained four other traditional XML models (DiSMEC, Parabel, Bonsai, ExtremeText) as well as two deep learning-based models (XML-CNN and LightXML). We compared both their effectiveness and the time cost of training and using the models for predictions. We find that other than DiSMEC and XML-CNN, recent XML models outperform the FastXML model by 3%–10% in terms of F1-scores on Top-k (k=1,2,3) predictions. Furthermore, we observe significant improvements in both the training and prediction time of these XML models, with Bonsai and Parabel model achieving 627x and 589x faster training time and 12x faster prediction time from the FastXML baseline. We discuss the implications of our experimental results and highlight limitations for future work to address.
format	text
author	HARYONO, Stefanus A. KANG, Hong Jin SHARMA, Abhishek SHARMA, Asankhaya SANTOSA, Andrew E. ANG, Ming Yi LO, David
author_facet	HARYONO, Stefanus A. KANG, Hong Jin SHARMA, Abhishek SHARMA, Asankhaya SANTOSA, Andrew E. ANG, Ming Yi LO, David
author_sort	HARYONO, Stefanus A.
title	Automated identification of libraries from vulnerability data: can we do better?
title_short	Automated identification of libraries from vulnerability data: can we do better?
title_full	Automated identification of libraries from vulnerability data: can we do better?
title_fullStr	Automated identification of libraries from vulnerability data: can we do better?
title_full_unstemmed	Automated identification of libraries from vulnerability data: can we do better?
title_sort	automated identification of libraries from vulnerability data: can we do better?
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/sis_research/7690 https://ink.library.smu.edu.sg/context/sis_research/article/8693/viewcontent/automated.pdf
_version_	1770576414856183808

Automated identification of libraries from vulnerability data: can we do better?

Similar Items