Automated Construction of a Software-Specific Word Similarity Database

Many automated software engineering approaches, including code search, bug report categorization, and duplicate bug report detection, measure similarities between two documents by analyzing natural language contents. Often different words are used to express the same meaning and thus measuring simil...

Full description

Saved in:

Bibliographic Details
Main Authors:	TIAN, Yuan, LO, David, Lawall, Julia
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2014
Subjects:	Computer Sciences Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/2033 https://ink.library.smu.edu.sg/context/sis_research/article/3032/viewcontent/csmr_wcre14_wordsim_av.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-3032
record_format	dspace
spelling	sg-smu-ink.sis_research-30322020-12-04T02:33:47Z Automated Construction of a Software-Specific Word Similarity Database TIAN, Yuan LO, David Lawall, Julia Many automated software engineering approaches, including code search, bug report categorization, and duplicate bug report detection, measure similarities between two documents by analyzing natural language contents. Often different words are used to express the same meaning and thus measuring similarities using exact matching of words is insufficient. To solve this problem, past studies have shown the need to measure the similarities between pairs of words. To meet this need, the natural language processing community has built WordNet which is a manually constructed lexical database that records semantic relations among words and can be used to measure how similar two words are. However, WordNet is a general purpose resource, and often does not contain software-specific words. Also, the meanings of words in WordNet are often different than when they are used in software engineering context. Thus, there is a need for a software-specific WordNet-like resource that can measure similarities of words. In this work, we propose an automated approach that builds a software-specific WordNet like resource, named WordSimSEDB, by leveraging the textual contents of posts in StackOverflow. Our approach measures the similarity of words by computing the similarities of the weighted co-occurrences of these words with three types of words in the textual corpus. We have evaluated our approach on a set of software-specific words and compared our approach with an existing WordNet-based technique (WordNetres) to return top-k most similar words. Human judges are used to evaluate the effectiveness of the two techniques. We find that WordNetres returns no result for 55 % of the queries. For the remaining queries, WordNetres returns significantly poorer results. 2014-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/2033 info:doi/10.1109/CSMR-WCRE.2014.6747213 https://ink.library.smu.edu.sg/context/sis_research/article/3032/viewcontent/csmr_wcre14_wordsim_av.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Computer Sciences Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Computer Sciences Software Engineering
spellingShingle	Computer Sciences Software Engineering TIAN, Yuan LO, David Lawall, Julia Automated Construction of a Software-Specific Word Similarity Database
description	Many automated software engineering approaches, including code search, bug report categorization, and duplicate bug report detection, measure similarities between two documents by analyzing natural language contents. Often different words are used to express the same meaning and thus measuring similarities using exact matching of words is insufficient. To solve this problem, past studies have shown the need to measure the similarities between pairs of words. To meet this need, the natural language processing community has built WordNet which is a manually constructed lexical database that records semantic relations among words and can be used to measure how similar two words are. However, WordNet is a general purpose resource, and often does not contain software-specific words. Also, the meanings of words in WordNet are often different than when they are used in software engineering context. Thus, there is a need for a software-specific WordNet-like resource that can measure similarities of words. In this work, we propose an automated approach that builds a software-specific WordNet like resource, named WordSimSEDB, by leveraging the textual contents of posts in StackOverflow. Our approach measures the similarity of words by computing the similarities of the weighted co-occurrences of these words with three types of words in the textual corpus. We have evaluated our approach on a set of software-specific words and compared our approach with an existing WordNet-based technique (WordNetres) to return top-k most similar words. Human judges are used to evaluate the effectiveness of the two techniques. We find that WordNetres returns no result for 55 % of the queries. For the remaining queries, WordNetres returns significantly poorer results.
format	text
author	TIAN, Yuan LO, David Lawall, Julia
author_facet	TIAN, Yuan LO, David Lawall, Julia
author_sort	TIAN, Yuan
title	Automated Construction of a Software-Specific Word Similarity Database
title_short	Automated Construction of a Software-Specific Word Similarity Database
title_full	Automated Construction of a Software-Specific Word Similarity Database
title_fullStr	Automated Construction of a Software-Specific Word Similarity Database
title_full_unstemmed	Automated Construction of a Software-Specific Word Similarity Database
title_sort	automated construction of a software-specific word similarity database
publisher	Institutional Knowledge at Singapore Management University
publishDate	2014
url	https://ink.library.smu.edu.sg/sis_research/2033 https://ink.library.smu.edu.sg/context/sis_research/article/3032/viewcontent/csmr_wcre14_wordsim_av.pdf
_version_	1770571777279262720

Automated Construction of a Software-Specific Word Similarity Database

Similar Items