Automated identification of libraries from vulnerability data

Software Composition Analysis (SCA) has gained traction in recent years with a number of commercial offerings from various companies. SCA involves vulnerability curation process where a group of security researchers, using various data sources, populate a database of open-source library vulnerabilit...

Full description

Saved in:
Bibliographic Details
Main Authors: YANG, Chen, SANTOSA, Andrew, SHARMA, Asankhaya, LO, David
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2020
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/5501
https://ink.library.smu.edu.sg/context/sis_research/article/6504/viewcontent/Automated_Identification_of_Libraries_from_Vulnerability_Data.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-6504
record_format dspace
spelling sg-smu-ink.sis_research-65042021-05-12T06:25:29Z Automated identification of libraries from vulnerability data YANG, Chen SANTOSA, Andrew SHARMA, Asankhaya LO, David Software Composition Analysis (SCA) has gained traction in recent years with a number of commercial offerings from various companies. SCA involves vulnerability curation process where a group of security researchers, using various data sources, populate a database of open-source library vulnerabilities, which is used by a scanner to inform the end users of vulnerable libraries used by their applications. One of the data sources used is the National Vulnerability Database (NVD). The key challenge faced by the security researchers here is in figuring out which libraries are related to each of the reported vulnerability in NVD. In this article, we report our design and implementation of a machine learning system to help identify the libraries related to each vulnerability in NVD. The problem is that of extreme multi-label learning (XML), and we developed our system using the state-of-the-art FastXML algorithm. Our system is iteratively executed, improving the performance of the model over time. At the time of writing, it achieves F1@1 score of 0.53 with average F1@k score for k = 1, 2, 3 of 0.51 (F1@k is the harmonic mean of precision@k and recall@k). It has been deployed in Veracode as part of a machine learning system that helps the security researchers identify the likelihood of web data items to be vulnerability-related. In addition, we present evaluation results of our feature engineering and the FastXML tree number used. Our work formulates for the first time library name identification from NVD data as XML and it is also the first attempt at solving it in a complete production system. 2020-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/5501 info:doi/10.1145/3377813.3381360 https://ink.library.smu.edu.sg/context/sis_research/article/6504/viewcontent/Automated_Identification_of_Libraries_from_Vulnerability_Data.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University application security open source software machine learning classifiers ensemble self training Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic application security
open source software
machine learning
classifiers ensemble
self training
Software Engineering
spellingShingle application security
open source software
machine learning
classifiers ensemble
self training
Software Engineering
YANG, Chen
SANTOSA, Andrew
SHARMA, Asankhaya
LO, David
Automated identification of libraries from vulnerability data
description Software Composition Analysis (SCA) has gained traction in recent years with a number of commercial offerings from various companies. SCA involves vulnerability curation process where a group of security researchers, using various data sources, populate a database of open-source library vulnerabilities, which is used by a scanner to inform the end users of vulnerable libraries used by their applications. One of the data sources used is the National Vulnerability Database (NVD). The key challenge faced by the security researchers here is in figuring out which libraries are related to each of the reported vulnerability in NVD. In this article, we report our design and implementation of a machine learning system to help identify the libraries related to each vulnerability in NVD. The problem is that of extreme multi-label learning (XML), and we developed our system using the state-of-the-art FastXML algorithm. Our system is iteratively executed, improving the performance of the model over time. At the time of writing, it achieves F1@1 score of 0.53 with average F1@k score for k = 1, 2, 3 of 0.51 (F1@k is the harmonic mean of precision@k and recall@k). It has been deployed in Veracode as part of a machine learning system that helps the security researchers identify the likelihood of web data items to be vulnerability-related. In addition, we present evaluation results of our feature engineering and the FastXML tree number used. Our work formulates for the first time library name identification from NVD data as XML and it is also the first attempt at solving it in a complete production system.
format text
author YANG, Chen
SANTOSA, Andrew
SHARMA, Asankhaya
LO, David
author_facet YANG, Chen
SANTOSA, Andrew
SHARMA, Asankhaya
LO, David
author_sort YANG, Chen
title Automated identification of libraries from vulnerability data
title_short Automated identification of libraries from vulnerability data
title_full Automated identification of libraries from vulnerability data
title_fullStr Automated identification of libraries from vulnerability data
title_full_unstemmed Automated identification of libraries from vulnerability data
title_sort automated identification of libraries from vulnerability data
publisher Institutional Knowledge at Singapore Management University
publishDate 2020
url https://ink.library.smu.edu.sg/sis_research/5501
https://ink.library.smu.edu.sg/context/sis_research/article/6504/viewcontent/Automated_Identification_of_Libraries_from_Vulnerability_Data.pdf
_version_ 1770575481419071488