Condensing class diagrams with minimal manual labeling cost

Traditionally, to better understand the design of a project, developers can reconstruct a class diagram from source code using a reverse engineering technique. However, the raw diagram is often perplexing because there are too many classes in it. Condensing the reverse engineered class diagram into...

Full description

Saved in:
Bibliographic Details
Main Authors: YANG, Xinli, David LO, XIA, Xin, SUN, Jianling
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2016
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/3566
https://ink.library.smu.edu.sg/context/sis_research/article/4567/viewcontent/CondensingClassDiagramsMinCost_2016.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-4567
record_format dspace
spelling sg-smu-ink.sis_research-45672017-04-10T07:35:39Z Condensing class diagrams with minimal manual labeling cost YANG, Xinli David LO, XIA, Xin SUN, Jianling Traditionally, to better understand the design of a project, developers can reconstruct a class diagram from source code using a reverse engineering technique. However, the raw diagram is often perplexing because there are too many classes in it. Condensing the reverse engineered class diagram into a compact class diagram which contains only the important classes would enhance the understandability of the corresponding project. A number of recent works have proposed several supervised machine learning solutions that can be used for condensing reverse engineered class diagrams given a set of classes that are manually labeled as important or not. However, a challenge impacts the practicality of the proposed solutions, which is the expensive cost for manual labeling of training samples. More training samples will lead to better performance, but means higher manual labeling cost. Too much manual labeling will make the problem pointless since the aim is to automatically identify important classes. In this paper, to bridge this research gap, we propose a novel approach MCCondenser which only requires a small amount of training data but can still achieve a reasonably good performance. MCCondenser firstly selects a small proportion of all data, which are the most representative, as training data in an unsupervised way using k-means clustering. Next, it uses ensemble learning to handle the class imbalance problem so that a suitable classifier can be constructed based on the limited training data. To evaluate the performance of MCCondenser, we use datasets from nine open source projects, i.e., ArgoUML, JavaClient, JGAP, JPMC, Mars, Maze, Neuroph, Wro4J and xUML, containing a total of 2640 classes. We compare MCCondenser with two baseline approaches proposed by Thung et al., both of which are state-of-the-art approaches aimed to reduce the manual labeling cost. The experimental results show that MCCondenser can achieve an average AUC score of 0.73, which improves those of the two baselines by nearly 20% and 10% respectively. 2016-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/3566 info:doi/10.1109/COMPSAC.2016.83 https://ink.library.smu.edu.sg/context/sis_research/article/4567/viewcontent/CondensingClassDiagramsMinCost_2016.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Class Diagram Cost Saving Ensemble Learning Manual Labeling Unsupervised Learning Computer Sciences Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Class Diagram
Cost Saving
Ensemble Learning
Manual Labeling
Unsupervised Learning
Computer Sciences
Software Engineering
spellingShingle Class Diagram
Cost Saving
Ensemble Learning
Manual Labeling
Unsupervised Learning
Computer Sciences
Software Engineering
YANG, Xinli
David LO,
XIA, Xin
SUN, Jianling
Condensing class diagrams with minimal manual labeling cost
description Traditionally, to better understand the design of a project, developers can reconstruct a class diagram from source code using a reverse engineering technique. However, the raw diagram is often perplexing because there are too many classes in it. Condensing the reverse engineered class diagram into a compact class diagram which contains only the important classes would enhance the understandability of the corresponding project. A number of recent works have proposed several supervised machine learning solutions that can be used for condensing reverse engineered class diagrams given a set of classes that are manually labeled as important or not. However, a challenge impacts the practicality of the proposed solutions, which is the expensive cost for manual labeling of training samples. More training samples will lead to better performance, but means higher manual labeling cost. Too much manual labeling will make the problem pointless since the aim is to automatically identify important classes. In this paper, to bridge this research gap, we propose a novel approach MCCondenser which only requires a small amount of training data but can still achieve a reasonably good performance. MCCondenser firstly selects a small proportion of all data, which are the most representative, as training data in an unsupervised way using k-means clustering. Next, it uses ensemble learning to handle the class imbalance problem so that a suitable classifier can be constructed based on the limited training data. To evaluate the performance of MCCondenser, we use datasets from nine open source projects, i.e., ArgoUML, JavaClient, JGAP, JPMC, Mars, Maze, Neuroph, Wro4J and xUML, containing a total of 2640 classes. We compare MCCondenser with two baseline approaches proposed by Thung et al., both of which are state-of-the-art approaches aimed to reduce the manual labeling cost. The experimental results show that MCCondenser can achieve an average AUC score of 0.73, which improves those of the two baselines by nearly 20% and 10% respectively.
format text
author YANG, Xinli
David LO,
XIA, Xin
SUN, Jianling
author_facet YANG, Xinli
David LO,
XIA, Xin
SUN, Jianling
author_sort YANG, Xinli
title Condensing class diagrams with minimal manual labeling cost
title_short Condensing class diagrams with minimal manual labeling cost
title_full Condensing class diagrams with minimal manual labeling cost
title_fullStr Condensing class diagrams with minimal manual labeling cost
title_full_unstemmed Condensing class diagrams with minimal manual labeling cost
title_sort condensing class diagrams with minimal manual labeling cost
publisher Institutional Knowledge at Singapore Management University
publishDate 2016
url https://ink.library.smu.edu.sg/sis_research/3566
https://ink.library.smu.edu.sg/context/sis_research/article/4567/viewcontent/CondensingClassDiagramsMinCost_2016.pdf
_version_ 1770573329724342272