Data driven security analysis of open source software

Third-party libraries (TPLs) with rich functionalities have facilitated the fast devel- opment of modern software, leading to the explosive growth of open-source ecosys- tems and software supply chains. However, the wide reuse of TPLs as dependencies, especially those commonly used ones, also pos...

Full description

Saved in:
Bibliographic Details
Main Author: Liu, Chengwei
Other Authors: Liu Yang
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/168554
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Third-party libraries (TPLs) with rich functionalities have facilitated the fast devel- opment of modern software, leading to the explosive growth of open-source ecosys- tems and software supply chains. However, the wide reuse of TPLs as dependencies, especially those commonly used ones, also poses a new threat that TPLs are black boxes to developers, and the hidden security threats could expose downstream users to potential risks of being attacked. With the global awareness of security in the open-source supply chain, many existing research works have been carried out to identify, understand and mitigate these potential risks. However, most existing works and tools are conducted at coarse-grained levels, i.e., identifying TPL de- pendencies by reasoning the dependency networks, and detecting and remediating vulnerabilities by their existence while neglecting their triggerability, which largely compromises the effectiveness of existing solutions. To fill this gap, we carry out several research works from different aspects to in- vestigate, measure, and mitigate such potential security threats from upstream TPLs. First, to understand the vulnerability threats from TPL dependencies, we carry out an empirical study to demystify the vulnerability impact and its evolution in the NPM ecosystem, which is one of the largest ecosystems. Specifically, we first propose and construct a complete Dependency-Vulnerability Knowledge Graph (DVGraph) capturing the dependency relations among NPM packages and based on it, we design a Dependency Tree Resolution Algorithm (DTResolver) to precisely resolve dependency trees without the real installation. Based on them, we further carry out an ecosystem-wide empirical study to retrieve insights into vulnerability impact propagation and its evolution in the NPM ecosystem. Next, since vulnerabilities existing in user dependencies do not indicate the user projects are deemed to be affected by these vulnerabilities, we extend from the package level to the code level to reduce false positives of vulnerability impact by call graphs. To fill this gap, we implement a static call graph generator (JSReach) to check the Reachability of vulnerabilities for Node.js, it computes static call graphs not only for Node.js projects with full dependencies but also for cases where only dependency paths are provided, so that ecosystem-wide vulnerability impact anal- ysis could be conducted in a more precise way. Our experiments show that JSReach not only achieves high precision (87%) and recall (95%) when full dependencies are available but also preserves most of the reachable functions (88%) when only de- pendency paths are provided. Moreover, JSReach can successfully exclude 78% of unreachable vulnerabilities with no reachable ones missed. Third, based on DVGraph and JSReach, we carry out an ecosystem-wide study to re-investigate the impact of vulnerabilities at the more fine-grained aspect, the reachability of vulnerabilities. Our findings unveil the characteristics of how vul- nerabilities propagate to threaten downstream dependents via API calls in the NPM ecosystem. Based on them, we further propose a metric of reachability to indicate the possibility of user projects being affected by given vulnerabilities and implement a light-weighted tool (VREstimate) that can prioritize vulnerability re- mediation by Estimating Vulnerability Reachability based on empirical statistics. Our experiments validate that 90.28% of reachable vulnerabilities can be reflected by the Reachability metric and VREstimate can successfully prioritize reachable vulnerabilities with higher Reachability metrics, which can be further adapted to assist traditional SCA by prioritizing vulnerability remediation. Fourth, beyond vulnerabilities, unreliable maintenance could also result in poor quality and security of TPLs, while such untrustworthiness, especially for critical TPLs, could further lead to potential threats to the communities. Therefore, from the perspective of securing the development process of critical TPLs, we first pro- pose a systematic method for filtering out packages that are most critical in the Maven ecosystem based on the Fused Maven Dependency Graph (MavenFG), and next, we investigate the development process of these critical packages and con- clude the weakness points during the maintenance of these critical packages. Based on them, we conclude our findings and provide countermeasures that are possible to guide further open-source governance. In summary, these research works have unveiled the vulnerability threats in the NPM ecosystem, at different granularity, and have proposed well-validated tools from different aspects to mitigate such threats from different aspects. Implications and further research directions are also expected to be explored in future works.