Towards more accurate content categorization of API discussions

Nowadays, software developers often discuss the usage of various APIs in online forums. Automatically assigning pre-defined semantic categorizes to API discussions in these forums could help manage the data in online forums, and assist developers to search for useful information. We refer to this pr...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhou, Bo, Xia, Xin, LO, David, Tian, Cong, Wang, Xinyu
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2014
Subjects:	API Discussion Text Categorization Composite Method CacheBased Method Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/2420 https://ink.library.smu.edu.sg/context/sis_research/article/3420/viewcontent/p95_zhou.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-3420
record_format	dspace
spelling	sg-smu-ink.sis_research-34202015-11-15T14:30:50Z Towards more accurate content categorization of API discussions Zhou, Bo Xia, Xin LO, David Tian, Cong Wang, Xinyu Nowadays, software developers often discuss the usage of various APIs in online forums. Automatically assigning pre-defined semantic categorizes to API discussions in these forums could help manage the data in online forums, and assist developers to search for useful information. We refer to this process as content categorization of API discussions. To solve this problem, Hou and Mo proposed the usage of naive Bayes multinomial, which is an effective classification algorithm. In this paper, we propose a Cache-bAsed compoSitE algorithm, short formed as CASE, to automatically categorize API discussions. Considering that the content of an API discussion contains both textual description and source code, CASE has 3 components that analyze an API discussion in 3 different ways: text, code, and original. In the text component, CASE only considers the textual description; in the code component, CASE only considers the source code; in the original component, CASE considers the original content of an API discussion which might include textual description and source code. Next, for each component, since different terms (i.e., words) have different affinities to different categories, CASE caches a subset of terms which have the highest affinity scores to each category, and builds a classifier based on the cached terms. Finally, CASE combines all the 3 classifiers to achieve a better accuracy score. We evaluate the performance of CASE on 3 datasets which contain a total of 1,035 API discussions. The experiment results show that CASE achieves accuracy scores of 0.69, 0.77, and 0.96 for the 3 datasets respectively, which outperforms the state-of-the-art method proposed by Hou and Mo by 11%, 10%, and 2%, respectively. 2014-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/2420 info:doi/10.1145/2597008.2597142 https://ink.library.smu.edu.sg/context/sis_research/article/3420/viewcontent/p95_zhou.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University API Discussion Text Categorization Composite Method CacheBased Method Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	API Discussion Text Categorization Composite Method CacheBased Method Software Engineering
spellingShingle	API Discussion Text Categorization Composite Method CacheBased Method Software Engineering Zhou, Bo Xia, Xin LO, David Tian, Cong Wang, Xinyu Towards more accurate content categorization of API discussions
description	Nowadays, software developers often discuss the usage of various APIs in online forums. Automatically assigning pre-defined semantic categorizes to API discussions in these forums could help manage the data in online forums, and assist developers to search for useful information. We refer to this process as content categorization of API discussions. To solve this problem, Hou and Mo proposed the usage of naive Bayes multinomial, which is an effective classification algorithm. In this paper, we propose a Cache-bAsed compoSitE algorithm, short formed as CASE, to automatically categorize API discussions. Considering that the content of an API discussion contains both textual description and source code, CASE has 3 components that analyze an API discussion in 3 different ways: text, code, and original. In the text component, CASE only considers the textual description; in the code component, CASE only considers the source code; in the original component, CASE considers the original content of an API discussion which might include textual description and source code. Next, for each component, since different terms (i.e., words) have different affinities to different categories, CASE caches a subset of terms which have the highest affinity scores to each category, and builds a classifier based on the cached terms. Finally, CASE combines all the 3 classifiers to achieve a better accuracy score. We evaluate the performance of CASE on 3 datasets which contain a total of 1,035 API discussions. The experiment results show that CASE achieves accuracy scores of 0.69, 0.77, and 0.96 for the 3 datasets respectively, which outperforms the state-of-the-art method proposed by Hou and Mo by 11%, 10%, and 2%, respectively.
format	text
author	Zhou, Bo Xia, Xin LO, David Tian, Cong Wang, Xinyu
author_facet	Zhou, Bo Xia, Xin LO, David Tian, Cong Wang, Xinyu
author_sort	Zhou, Bo
title	Towards more accurate content categorization of API discussions
title_short	Towards more accurate content categorization of API discussions
title_full	Towards more accurate content categorization of API discussions
title_fullStr	Towards more accurate content categorization of API discussions
title_full_unstemmed	Towards more accurate content categorization of API discussions
title_sort	towards more accurate content categorization of api discussions
publisher	Institutional Knowledge at Singapore Management University
publishDate	2014
url	https://ink.library.smu.edu.sg/sis_research/2420 https://ink.library.smu.edu.sg/context/sis_research/article/3420/viewcontent/p95_zhou.pdf
_version_	1770572140961071104

Towards more accurate content categorization of API discussions

Similar Items