Comprehensively understanding of software project code reuse for Java
Software code reuse has become increasingly popular in software development to simplify and shorten the developing cycle. Unfortunately, the reuse of Open-source software (OSS) also brings security concerns as the vulnerabilities are propagated with OSS. Software composition analysis (SCA) tools are...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2025
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/181931 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Software code reuse has become increasingly popular in software development to simplify and shorten the developing cycle. Unfortunately, the reuse of Open-source software (OSS) also brings security concerns as the vulnerabilities are propagated with OSS. Software composition analysis (SCA) tools are proposed to detect the reused third-party libraries (TPL) or code blocks and the potential vulnerabilities introduced by them. With the increasing complexity of software functionality, SCA tools may encounter various scenarios during the dependency resolution process, such as diverse formats of artifacts, diverse dependency imports, and diverse dependency specifications. However, there still lacks a comprehensive evaluation of SCA tools for Java that takes into account the above scenarios. This could lead to a confined interpretation of comparisons, improper use of tools, and hinder further improvements of the tools. To fill this gap, we proposed an Evaluation Model which consists of Scan Modes, Scan Methods, and SCA Scope for Maven (SSM), for comprehensive assessments of the dependency resolving capabilities and effectiveness of SCA tools. Based on the Evaluation Model, we first qualitatively examined 6 SCA tools’ capabilities. Next, the accuracy of dependency and vulnerability is quantitatively evaluated with a large-scale dataset. The results show that most tools do not fully support SSM, which leads to compromised accuracy. Properly supporting SSM reduces dependency detection false positives by 34.24% and false negatives by 6.91%. This further leads to a reduction of 18.28% in false positives and 8.72% in false negatives in vulnerability reports.
From our first work, we learned that the reused code can be introduced in multiple ways. Generally, two categories of code reuse can be identified: introduced by code clone and introduced by package managers (PM). Although introducing TPLs by clones is prevalent in Java, no clone-based SCA tools are specially designed for Java. Also, directly applying clone-based SCA techniques from other tools of other languages (e.g., C/C++) is problematic. To fill this gap, we introduce JC-Finder, a novel clone-based SCA tool that aims to accurately and comprehensively identify instances of TPL reuse introduced by source code clones in Java projects. JC-Finder achieves both accuracy and efficiency in identifying TPL reuse from code cloning by capturing features at the class level, maintaining inter-function relationships, and excluding trivial or duplicated elements. To evaluate the efficiency of JC-Finder, we applied it to the most popular Maven libraries. The result shows that JC-Finder achieved a high F1-score of 0.818, outperforming the state-of-the-art tool by 0.427. The average time taken for resolving TPL reuse is 14.2 seconds, which is approximately 9 times faster than the other tool.
Moreover, the granularity of code reusing is not limited to entire files, classes, or functions; cloning of code snippets is even more widespread. These snippets represent a more detailed granularity of code reuse, where developers often copy pieces containing essential logic to fulfill their specific functions. To be able to detect cloned code snippets with essential logic, we address the specific challenge of detecting “essence clones”, a complex subtype of Type-3 clones characterized by sharing critical logic despite different peripheral codes. Traditional techniques often fail to detect essence clones due to their syntactic focus. To overcome this limitation, we introduce ECScan, a novel detection tool that leverages information theory to assess the semantic importance of code lines. By assigning weights to each line based on their information content, ECScan emphasizes core logic over peripheral code differences. Our comprehensive evaluation across various real-world projects shows that ECScan significantly outperforms existing tools in detecting essence clones, achieving an average F1-score of 0.846. It demonstrates robust performance across all clone types and offers exceptional scalability. This study advances clone detection by providing a practical tool for developers to enhance code quality and reduce maintenance burdens, emphasizing the semantic aspects of code through an innovative information-theoretic approach.
To comprehensively understand the TPL reuse conditions in Java source code projects, we combine both clone-based and PM-based SCA techniques in our Java SCA tool. Using this tool, we investigate real-world projects to reveal the conditions of reusing cloned TPLs and package-manager-imported (PM-imported) TPLs, as well as the relationships between these two types of TPL reuse. The results show that 9.89% of the testing projects introduce TPLs through code cloning, and identifying cloned TPLs uncovers about 26.20% more TPLs compared to reporting only PM-imported TPLs. Additionally, 1.08% of the cloned TPLs overlap with PM-imported TPLs. This underscores the importance of identifying both cloned and PM-imported TPLs for a comprehensive Java SCA. |
---|