SEWordSim: Software-Specific Word Similarity Database

Measuring the similarity of words is important in accurately representing and comparing documents, and thus improves the results of many natural language processing (NLP) tasks. The NLP community has proposed various measurements based on WordNet, a lexical database that contains relationships betwe...

Full description

Saved in:

Bibliographic Details
Main Authors:	TIAN, Yuan, LO, David, Lawall, Julia
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2014
Subjects:	word similiarity java SEWordSim Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/2179 https://ink.library.smu.edu.sg/context/sis_research/article/3179/viewcontent/icse14_wordsimilarity.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-3179
record_format	dspace
spelling	sg-smu-ink.sis_research-31792020-12-04T03:26:41Z SEWordSim: Software-Specific Word Similarity Database TIAN, Yuan LO, David Lawall, Julia Measuring the similarity of words is important in accurately representing and comparing documents, and thus improves the results of many natural language processing (NLP) tasks. The NLP community has proposed various measurements based on WordNet, a lexical database that contains relationships between many pairs of words. Recently, a number of techniques have been proposed to address software engineering issues such as code search and fault localization that require understanding natural language documents, and a measure of word similarity could improve their results. However, WordNet only contains information about words senses in general-purpose conversation, which often differ from word senses in a software-engineering context, and the software-specific word similarity resources that have been developed rely on data sources containing only a limited range of words and word uses. In recent work, we have proposed a word similarity resource based on information collected automatically from StackOverflow. We have found that the results of this resource are given scores on a 3-point Likert scale that are over 50% higher than the results of a resource based on WordNet. In this demo paper, we review our data collection methodology and propose a Java API to make the resulting word similarity resource useful in practice. The SEWordSim database and related information can be found at http://goo.gl/BVEAs8. Demo video is available at http://goo.gl/dyNwyb. 2014-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/2179 info:doi/10.1145/2591062.2591071 https://ink.library.smu.edu.sg/context/sis_research/article/3179/viewcontent/icse14_wordsimilarity.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University word similiarity java SEWordSim Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	word similiarity java SEWordSim Software Engineering
spellingShingle	word similiarity java SEWordSim Software Engineering TIAN, Yuan LO, David Lawall, Julia SEWordSim: Software-Specific Word Similarity Database
description	Measuring the similarity of words is important in accurately representing and comparing documents, and thus improves the results of many natural language processing (NLP) tasks. The NLP community has proposed various measurements based on WordNet, a lexical database that contains relationships between many pairs of words. Recently, a number of techniques have been proposed to address software engineering issues such as code search and fault localization that require understanding natural language documents, and a measure of word similarity could improve their results. However, WordNet only contains information about words senses in general-purpose conversation, which often differ from word senses in a software-engineering context, and the software-specific word similarity resources that have been developed rely on data sources containing only a limited range of words and word uses. In recent work, we have proposed a word similarity resource based on information collected automatically from StackOverflow. We have found that the results of this resource are given scores on a 3-point Likert scale that are over 50% higher than the results of a resource based on WordNet. In this demo paper, we review our data collection methodology and propose a Java API to make the resulting word similarity resource useful in practice. The SEWordSim database and related information can be found at http://goo.gl/BVEAs8. Demo video is available at http://goo.gl/dyNwyb.
format	text
author	TIAN, Yuan LO, David Lawall, Julia
author_facet	TIAN, Yuan LO, David Lawall, Julia
author_sort	TIAN, Yuan
title	SEWordSim: Software-Specific Word Similarity Database
title_short	SEWordSim: Software-Specific Word Similarity Database
title_full	SEWordSim: Software-Specific Word Similarity Database
title_fullStr	SEWordSim: Software-Specific Word Similarity Database
title_full_unstemmed	SEWordSim: Software-Specific Word Similarity Database
title_sort	sewordsim: software-specific word similarity database
publisher	Institutional Knowledge at Singapore Management University
publishDate	2014
url	https://ink.library.smu.edu.sg/sis_research/2179 https://ink.library.smu.edu.sg/context/sis_research/article/3179/viewcontent/icse14_wordsimilarity.pdf
_version_	1770571831633248256

SEWordSim: Software-Specific Word Similarity Database

Similar Items