Data driven security analysis of open source software
Third-party libraries (TPLs) with rich functionalities have facilitated the fast devel- opment of modern software, leading to the explosive growth of open-source ecosys- tems and software supply chains. However, the wide reuse of TPLs as dependencies, especially those commonly used ones, also pos...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/168554 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Third-party libraries (TPLs) with rich functionalities have facilitated the fast devel-
opment of modern software, leading to the explosive growth of open-source ecosys-
tems and software supply chains. However, the wide reuse of TPLs as dependencies,
especially those commonly used ones, also poses a new threat that TPLs are black
boxes to developers, and the hidden security threats could expose downstream
users to potential risks of being attacked. With the global awareness of security in
the open-source supply chain, many existing research works have been carried out
to identify, understand and mitigate these potential risks. However, most existing
works and tools are conducted at coarse-grained levels, i.e., identifying TPL de-
pendencies by reasoning the dependency networks, and detecting and remediating
vulnerabilities by their existence while neglecting their triggerability, which largely
compromises the effectiveness of existing solutions.
To fill this gap, we carry out several research works from different aspects to in-
vestigate, measure, and mitigate such potential security threats from upstream
TPLs.
First, to understand the vulnerability threats from TPL dependencies, we carry
out an empirical study to demystify the vulnerability impact and its evolution in
the NPM ecosystem, which is one of the largest ecosystems. Specifically, we first
propose and construct a complete Dependency-Vulnerability Knowledge Graph
(DVGraph) capturing the dependency relations among NPM packages and based
on it, we design a Dependency Tree Resolution Algorithm (DTResolver) to precisely
resolve dependency trees without the real installation. Based on them, we further
carry out an ecosystem-wide empirical study to retrieve insights into vulnerability
impact propagation and its evolution in the NPM ecosystem.
Next, since vulnerabilities existing in user dependencies do not indicate the user
projects are deemed to be affected by these vulnerabilities, we extend from the
package level to the code level to reduce false positives of vulnerability impact by call graphs. To fill this gap, we implement a static call graph generator (JSReach) to
check the Reachability of vulnerabilities for Node.js, it computes static call graphs
not only for Node.js projects with full dependencies but also for cases where only
dependency paths are provided, so that ecosystem-wide vulnerability impact anal-
ysis could be conducted in a more precise way. Our experiments show that JSReach
not only achieves high precision (87%) and recall (95%) when full dependencies are
available but also preserves most of the reachable functions (88%) when only de-
pendency paths are provided. Moreover, JSReach can successfully exclude 78% of
unreachable vulnerabilities with no reachable ones missed.
Third, based on DVGraph and JSReach, we carry out an ecosystem-wide study to
re-investigate the impact of vulnerabilities at the more fine-grained aspect, the
reachability of vulnerabilities. Our findings unveil the characteristics of how vul-
nerabilities propagate to threaten downstream dependents via API calls in the
NPM ecosystem. Based on them, we further propose a metric of reachability to
indicate the possibility of user projects being affected by given vulnerabilities and
implement a light-weighted tool (VREstimate) that can prioritize vulnerability re-
mediation by Estimating Vulnerability Reachability based on empirical statistics.
Our experiments validate that 90.28% of reachable vulnerabilities can be reflected
by the Reachability metric and VREstimate can successfully prioritize reachable
vulnerabilities with higher Reachability metrics, which can be further adapted to
assist traditional SCA by prioritizing vulnerability remediation.
Fourth, beyond vulnerabilities, unreliable maintenance could also result in poor
quality and security of TPLs, while such untrustworthiness, especially for critical
TPLs, could further lead to potential threats to the communities. Therefore, from
the perspective of securing the development process of critical TPLs, we first pro-
pose a systematic method for filtering out packages that are most critical in the
Maven ecosystem based on the Fused Maven Dependency Graph (MavenFG), and
next, we investigate the development process of these critical packages and con-
clude the weakness points during the maintenance of these critical packages. Based
on them, we conclude our findings and provide countermeasures that are possible
to guide further open-source governance.
In summary, these research works have unveiled the vulnerability threats in the
NPM ecosystem, at different granularity, and have proposed well-validated tools
from different aspects to mitigate such threats from different aspects. Implications
and further research directions are also expected to be explored in future works. |
---|