Streamlining Java programming: uncovering well-formed idioms with IdioMine

Code idioms are commonly used patterns, techniques, or practices that aid in solving particular problems or specific tasks across multiple software projects. They can improve code quality, performance, and maintainability, and also promote program standardization and reuse across projects. However,...

Full description

Saved in:
Bibliographic Details
Main Authors: YANG, Yanming, HU, Xing, XIA, Xin, LO, David, YANG, Xiaohu
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9255
https://ink.library.smu.edu.sg/context/sis_research/article/10255/viewcontent/3597503.3639135.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10255
record_format dspace
spelling sg-smu-ink.sis_research-102552024-09-02T06:37:48Z Streamlining Java programming: uncovering well-formed idioms with IdioMine YANG, Yanming HU, Xing XIA, Xin LO, David YANG, Xiaohu Code idioms are commonly used patterns, techniques, or practices that aid in solving particular problems or specific tasks across multiple software projects. They can improve code quality, performance, and maintainability, and also promote program standardization and reuse across projects. However, identifying code idioms is significantly challenging, as existing studies have still suffered from three main limitations. First, it is difficult to recognize idioms that span non-contiguous code lines. Second, identifying idioms with intricate data flow and code structures can be challenging. Moreover, they only extract dataset-specific idioms, so common idioms or well-established code/design patterns that are rarely found in datasets cannot be identified.To overcome these limitations, we propose a novel approach, named IdioMine, to automatically extract generic and specific idioms from both Java projects and libraries. We perform program analysis on Java functions to transform them into concise PDGs, for integrating the data flow and control flow of code fragments. We then develop a novel chain structure, Data-driven Control Chain (DCC), to extract sub-idioms that possess contiguous semantic meanings from PDGs. After that, we utilize GraphCodeBERT to generate code embeddings of these sub-idioms and perform density-based clustering to obtain frequent sub-idioms. We use heuristic rules to identify interrelated sub-idioms among the frequent ones. Finally, we employ ChatGPT to synthesize interrelated sub-idioms into potential code idioms and infer real idioms from them.We conduct well-designed experiments and a user study to evaluate IdioMine's correctness and the practical value of the extracted idioms. Our experimental results show that IdioMine effectively extracts more idioms with better performance in most metrics. We compare our approach with Haggis and ChatGPT, IdioMine outperforms them by 22.8% and 35.5% in Idiom Set Precision (ISP) and by 9.7% and 22.9% in Idiom Coverage (IC) when extracting idioms from libraries. IdioMine also extracts almost twice the size of idioms than the baselines, exhibiting its ability to identify complete idioms. Our user study indicates that idioms extracted by IdioMine are well-formed and semantically clear. Moreover, we conduct a qualitative and quantitative analysis to investigate the primary functionalities of IdioMine's extracted idioms from various projects and libraries. 2024-04-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9255 info:doi/10.1145/3597503.3639135 https://ink.library.smu.edu.sg/context/sis_research/article/10255/viewcontent/3597503.3639135.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Code Idiom Mining Code Pattern Large Language Model (LLM) Clustering Programming Languages and Compilers Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Code Idiom Mining
Code Pattern
Large Language Model (LLM)
Clustering
Programming Languages and Compilers
Software Engineering
spellingShingle Code Idiom Mining
Code Pattern
Large Language Model (LLM)
Clustering
Programming Languages and Compilers
Software Engineering
YANG, Yanming
HU, Xing
XIA, Xin
LO, David
YANG, Xiaohu
Streamlining Java programming: uncovering well-formed idioms with IdioMine
description Code idioms are commonly used patterns, techniques, or practices that aid in solving particular problems or specific tasks across multiple software projects. They can improve code quality, performance, and maintainability, and also promote program standardization and reuse across projects. However, identifying code idioms is significantly challenging, as existing studies have still suffered from three main limitations. First, it is difficult to recognize idioms that span non-contiguous code lines. Second, identifying idioms with intricate data flow and code structures can be challenging. Moreover, they only extract dataset-specific idioms, so common idioms or well-established code/design patterns that are rarely found in datasets cannot be identified.To overcome these limitations, we propose a novel approach, named IdioMine, to automatically extract generic and specific idioms from both Java projects and libraries. We perform program analysis on Java functions to transform them into concise PDGs, for integrating the data flow and control flow of code fragments. We then develop a novel chain structure, Data-driven Control Chain (DCC), to extract sub-idioms that possess contiguous semantic meanings from PDGs. After that, we utilize GraphCodeBERT to generate code embeddings of these sub-idioms and perform density-based clustering to obtain frequent sub-idioms. We use heuristic rules to identify interrelated sub-idioms among the frequent ones. Finally, we employ ChatGPT to synthesize interrelated sub-idioms into potential code idioms and infer real idioms from them.We conduct well-designed experiments and a user study to evaluate IdioMine's correctness and the practical value of the extracted idioms. Our experimental results show that IdioMine effectively extracts more idioms with better performance in most metrics. We compare our approach with Haggis and ChatGPT, IdioMine outperforms them by 22.8% and 35.5% in Idiom Set Precision (ISP) and by 9.7% and 22.9% in Idiom Coverage (IC) when extracting idioms from libraries. IdioMine also extracts almost twice the size of idioms than the baselines, exhibiting its ability to identify complete idioms. Our user study indicates that idioms extracted by IdioMine are well-formed and semantically clear. Moreover, we conduct a qualitative and quantitative analysis to investigate the primary functionalities of IdioMine's extracted idioms from various projects and libraries.
format text
author YANG, Yanming
HU, Xing
XIA, Xin
LO, David
YANG, Xiaohu
author_facet YANG, Yanming
HU, Xing
XIA, Xin
LO, David
YANG, Xiaohu
author_sort YANG, Yanming
title Streamlining Java programming: uncovering well-formed idioms with IdioMine
title_short Streamlining Java programming: uncovering well-formed idioms with IdioMine
title_full Streamlining Java programming: uncovering well-formed idioms with IdioMine
title_fullStr Streamlining Java programming: uncovering well-formed idioms with IdioMine
title_full_unstemmed Streamlining Java programming: uncovering well-formed idioms with IdioMine
title_sort streamlining java programming: uncovering well-formed idioms with idiomine
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/9255
https://ink.library.smu.edu.sg/context/sis_research/article/10255/viewcontent/3597503.3639135.pdf
_version_ 1814047846147555328