Enhancing domain knowledge sharing via mining software engineering related web resources

Software development web resources and their artefacts, e.g., method names, tags and programming concepts, are important for software development and maintenance. Semantic mining and extraction of these multi-form software-related artefacts can facilitate in-depth comprehension of other people’s cod...

Full description

Saved in:
Bibliographic Details
Main Author: Gao, Sa
Other Authors: Lin Shang-Wei
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/159310
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-159310
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Software::Software engineering
spellingShingle Engineering::Computer science and engineering::Software::Software engineering
Gao, Sa
Enhancing domain knowledge sharing via mining software engineering related web resources
description Software development web resources and their artefacts, e.g., method names, tags and programming concepts, are important for software development and maintenance. Semantic mining and extraction of these multi-form software-related artefacts can facilitate in-depth comprehension of other people’s code as well as technical discussions on online Q&A forums such as Stack Overflow, thereby improving the efficiency of knowledge learning and sharing in the software community. However, the fact that online resources are updated on a second-to-second basis hinders people from manually extracting or constructing such knowledge, not to mention the vast amount of online resources. In this thesis, we tackle the problem of knowledge retrieval and discovery in software domain by exploiting the large amount of available online resources with three different operations: retrieve, link, and generate. Firstly, we represent two heterogeneous forms of the online resources: tags and web resources. We propose a neural language model based approach to jointly learn semantic representations of tags and web resources extracted from Q&A websites. Instead of mining textual content of discussions or relying on statistics such as the co-occurrence of tags and web resources, in our proposed approach, low-dimensional vector representations of tags and web resources are mapped to a joint vector space using word embedding techniques, where the semantically related ones would be close to each other. This allows us to measure the relatedness between a web resource and a technical term with high efficiency, and turns a complicated clustering or recommendation task into a trivial K-nearest neighbour search in the embedding space. Our experiments with Stack Overflow data set show that the learned representations work well in capturing the semantic relatedness of tags and web resources, even when they appear in seemingly different contexts. Secondly, we extend the task of global searching for semantic-relatedness pairs of tags and web resources to a context-sensitive setting, which is automatically linking software artefacts mentioned in Q&A threads with web resources. As an easy-to-use mechanism for knowledge sharing, referencing URLs of external web resources has a broad application in Q&A websites. However, there lack of effective methods to manage and reuse already-shared web resources. To fill in the gap, we formulate this issue as a URL-sense disambiguation problem and propose novel metrics to resolve the linking ambiguity. The proposed metrics consider both the global popularity and the local context relatedness of the URL candidates. To evaluate our metrics, we also build up a large knowledge base of already-shared official document URLs based on the posts on Stack Overflow, and present results from systematic evaluations in which the balanced combination of global popularity and local context relatedness is shown capable of significantly improving the performance of resource linking. Finally, we target at extending the scope of the study beyond pre-existing contents, i.e. artifact names, tags or web resources that are already created. Finding good names for language constructs is important for code understanding and maintenance. However, this is not a trivial task. Good method names need to be functionally descriptive and at the same time follow the naming conventions, which include common rules for lexical choices as well as the language-specific conventions. To help developers, especially novices, discover the underlying knowledge in naming, we propose a neural generation network to directly generate conventional method names from natural language description of the code. Different from rule-based or unsupervised methods, we formulate the task of generating high-quality method names from natural language as a sequence to sequence problem and propose using the encoder-decoder model to learn the naming conventions and automatically generate method names. To improve the generation performance, the attention and copying mechanisms are incorporated as well. We conduct comparison experiments with the state-of-the-art baseline models and the results show that the proposed method can achieve significant improvement in method name generation. Besides, we also explore the naming convention transferability across different projects and demonstrate that it is promising to transfer the knowledge learned from massive open source data to specific projects. In summary, our work focuses on modelling and leveraging the semantics of software related artefacts to enhance developers’ experience in knowledge sharing and discovery. The proposed methods provide novel insights into semantics relatedness or similarity measurement of different forms of software related artefacts, and can also be extended to other applications such as web resource dissemination.
author2 Lin Shang-Wei
author_facet Lin Shang-Wei
Gao, Sa
format Thesis-Doctor of Philosophy
author Gao, Sa
author_sort Gao, Sa
title Enhancing domain knowledge sharing via mining software engineering related web resources
title_short Enhancing domain knowledge sharing via mining software engineering related web resources
title_full Enhancing domain knowledge sharing via mining software engineering related web resources
title_fullStr Enhancing domain knowledge sharing via mining software engineering related web resources
title_full_unstemmed Enhancing domain knowledge sharing via mining software engineering related web resources
title_sort enhancing domain knowledge sharing via mining software engineering related web resources
publisher Nanyang Technological University
publishDate 2022
url https://hdl.handle.net/10356/159310
_version_ 1736856385329561600
spelling sg-ntu-dr.10356-1593102022-06-24T05:44:57Z Enhancing domain knowledge sharing via mining software engineering related web resources Gao, Sa Lin Shang-Wei School of Computer Science and Engineering shang-wei.lin@ntu.edu.sg Engineering::Computer science and engineering::Software::Software engineering Software development web resources and their artefacts, e.g., method names, tags and programming concepts, are important for software development and maintenance. Semantic mining and extraction of these multi-form software-related artefacts can facilitate in-depth comprehension of other people’s code as well as technical discussions on online Q&A forums such as Stack Overflow, thereby improving the efficiency of knowledge learning and sharing in the software community. However, the fact that online resources are updated on a second-to-second basis hinders people from manually extracting or constructing such knowledge, not to mention the vast amount of online resources. In this thesis, we tackle the problem of knowledge retrieval and discovery in software domain by exploiting the large amount of available online resources with three different operations: retrieve, link, and generate. Firstly, we represent two heterogeneous forms of the online resources: tags and web resources. We propose a neural language model based approach to jointly learn semantic representations of tags and web resources extracted from Q&A websites. Instead of mining textual content of discussions or relying on statistics such as the co-occurrence of tags and web resources, in our proposed approach, low-dimensional vector representations of tags and web resources are mapped to a joint vector space using word embedding techniques, where the semantically related ones would be close to each other. This allows us to measure the relatedness between a web resource and a technical term with high efficiency, and turns a complicated clustering or recommendation task into a trivial K-nearest neighbour search in the embedding space. Our experiments with Stack Overflow data set show that the learned representations work well in capturing the semantic relatedness of tags and web resources, even when they appear in seemingly different contexts. Secondly, we extend the task of global searching for semantic-relatedness pairs of tags and web resources to a context-sensitive setting, which is automatically linking software artefacts mentioned in Q&A threads with web resources. As an easy-to-use mechanism for knowledge sharing, referencing URLs of external web resources has a broad application in Q&A websites. However, there lack of effective methods to manage and reuse already-shared web resources. To fill in the gap, we formulate this issue as a URL-sense disambiguation problem and propose novel metrics to resolve the linking ambiguity. The proposed metrics consider both the global popularity and the local context relatedness of the URL candidates. To evaluate our metrics, we also build up a large knowledge base of already-shared official document URLs based on the posts on Stack Overflow, and present results from systematic evaluations in which the balanced combination of global popularity and local context relatedness is shown capable of significantly improving the performance of resource linking. Finally, we target at extending the scope of the study beyond pre-existing contents, i.e. artifact names, tags or web resources that are already created. Finding good names for language constructs is important for code understanding and maintenance. However, this is not a trivial task. Good method names need to be functionally descriptive and at the same time follow the naming conventions, which include common rules for lexical choices as well as the language-specific conventions. To help developers, especially novices, discover the underlying knowledge in naming, we propose a neural generation network to directly generate conventional method names from natural language description of the code. Different from rule-based or unsupervised methods, we formulate the task of generating high-quality method names from natural language as a sequence to sequence problem and propose using the encoder-decoder model to learn the naming conventions and automatically generate method names. To improve the generation performance, the attention and copying mechanisms are incorporated as well. We conduct comparison experiments with the state-of-the-art baseline models and the results show that the proposed method can achieve significant improvement in method name generation. Besides, we also explore the naming convention transferability across different projects and demonstrate that it is promising to transfer the knowledge learned from massive open source data to specific projects. In summary, our work focuses on modelling and leveraging the semantics of software related artefacts to enhance developers’ experience in knowledge sharing and discovery. The proposed methods provide novel insights into semantics relatedness or similarity measurement of different forms of software related artefacts, and can also be extended to other applications such as web resource dissemination. Doctor of Philosophy 2022-06-15T01:35:36Z 2022-06-15T01:35:36Z 2022 Thesis-Doctor of Philosophy Gao, S. (2022). Enhancing domain knowledge sharing via mining software engineering related web resources. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/159310 https://hdl.handle.net/10356/159310 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University