Enhancing domain knowledge sharing via mining software engineering related web resources
Software development web resources and their artefacts, e.g., method names, tags and programming concepts, are important for software development and maintenance. Semantic mining and extraction of these multi-form software-related artefacts can facilitate in-depth comprehension of other people’s cod...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/159310 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Software development web resources and their artefacts, e.g., method names, tags and programming concepts, are important for software development and maintenance. Semantic mining and extraction of these multi-form software-related artefacts can facilitate in-depth comprehension of other people’s code as well as technical discussions on online Q&A forums such as Stack Overflow, thereby improving the efficiency of knowledge learning and sharing in the software community. However, the fact that online resources are updated on a second-to-second basis hinders people from manually extracting or constructing such knowledge, not to mention the vast amount of online resources.
In this thesis, we tackle the problem of knowledge retrieval and discovery in software domain by exploiting the large amount of available online resources with three different operations: retrieve, link, and generate. Firstly, we represent two heterogeneous forms of the online resources: tags and web resources. We propose a neural language model based approach to jointly learn semantic representations of tags and web resources extracted from Q&A websites. Instead of mining textual content of discussions or relying on statistics such as the co-occurrence of tags and web resources, in our proposed approach, low-dimensional vector representations of tags and web resources are mapped to a joint vector space using word embedding techniques, where the semantically related ones would be close to each other. This allows us to measure the relatedness between a web resource and a technical term with high efficiency, and turns a complicated clustering or recommendation task into a trivial K-nearest neighbour search in the embedding space. Our experiments with Stack Overflow data set show that the learned representations work well in capturing the semantic relatedness of tags and web resources, even when they appear in seemingly different contexts.
Secondly, we extend the task of global searching for semantic-relatedness pairs of tags and web resources to a context-sensitive setting, which is automatically linking software artefacts mentioned in Q&A threads with web resources. As an easy-to-use mechanism for knowledge sharing, referencing URLs of external web resources has a broad application in Q&A websites. However, there lack of effective methods to manage and reuse already-shared web resources. To fill in the gap, we formulate this issue as a URL-sense disambiguation problem and propose novel metrics to resolve the linking ambiguity. The proposed metrics consider both the global popularity and the local context relatedness of the URL candidates. To evaluate our metrics, we also build up a large knowledge base of already-shared official document URLs based on the posts on Stack Overflow, and present results from systematic evaluations in which the balanced combination of global popularity and local context relatedness is shown capable of significantly improving the performance of resource linking.
Finally, we target at extending the scope of the study beyond pre-existing contents, i.e. artifact names, tags or web resources that are already created. Finding good names for language constructs is important for code understanding and maintenance. However, this is not a trivial task. Good method names need to be functionally descriptive and at the same time follow the naming conventions, which include common rules for lexical choices as well as the language-specific conventions. To help developers, especially novices, discover the underlying knowledge in naming, we propose a neural generation network to directly generate conventional method names from natural language description of the code. Different from rule-based or unsupervised methods, we formulate the task of generating high-quality method names from natural language as a sequence to sequence problem and propose using the encoder-decoder model to learn the naming conventions and automatically generate method names. To improve the generation performance, the attention and copying mechanisms are incorporated as well. We conduct comparison experiments with the state-of-the-art baseline models and the results show that the proposed method can achieve significant improvement in method name generation. Besides, we also explore the naming convention transferability across different projects and demonstrate that it is promising to transfer the knowledge learned from massive open source data to specific projects.
In summary, our work focuses on modelling and leveraging the semantics of software related artefacts to enhance developers’ experience in knowledge sharing and discovery. The proposed methods provide novel insights into semantics relatedness or similarity measurement of different forms of software related artefacts, and can also be extended to other applications such as web resource dissemination. |
---|