A Comparative Study on the Effectiveness of Part-of-speech Tagging Techniques on Bug Reports

Many software artifacts are written in natural language or contain substantial amount of natural language contents. Thus these artifacts could be analyzed using text analysis techniques from the natural language processing (NLP) community, e.g., the part-of-speech (POS) tagging technique that assign...

Full description

Saved in:
Bibliographic Details
Main Authors: TIAN, Yuan, David LO
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2014
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/2861
https://ink.library.smu.edu.sg/context/sis_research/article/3861/viewcontent/SANER2015_ERA_av.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-3861
record_format dspace
spelling sg-smu-ink.sis_research-38612020-12-04T03:11:16Z A Comparative Study on the Effectiveness of Part-of-speech Tagging Techniques on Bug Reports TIAN, Yuan David LO, Many software artifacts are written in natural language or contain substantial amount of natural language contents. Thus these artifacts could be analyzed using text analysis techniques from the natural language processing (NLP) community, e.g., the part-of-speech (POS) tagging technique that assigns POS tags (e.g., verb, noun, etc.) to words in a sentence. In the literature, several studies have already applied POS tagging technique on software artifacts to recover important words in them, which are then used for automating various tasks, e.g., locating buggy files for a given bug report, etc. There are many POS tagging techniques proposed and they are trained and evaluated on non software engineering corpus (documents). Thus it is unknown whether they can correctly identify the POS of a word in a software artifact and which of them performs the best. To fill this gap, in this work, we investigate the effectiveness of seven POS taggers on bug reports. We randomly sample 100 bug reports from Eclipse and Mozilla project and create a text corpus that contains 21,713 words. We manually assign POS tags to these words and use them to evaluate the studied POS taggers. Our comparative study shows that the state-of-the-art POS taggers achieve an accuracy of 83.6%-90.5% on bug reports and the Stanford POS tagger and the TreeTagger achieve the highest accuracy on the sampled bug reports. Our findings show that researchers could use these POS taggers to analyze software artifacts, if an accuracy of 80-90% is acceptable for their specific needs, and we recommend using the Stanford POS tagger or the TreeTagger. 2014-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/2861 info:doi/10.1109/SANER.2015.7081879 https://ink.library.smu.edu.sg/context/sis_research/article/3861/viewcontent/SANER2015_ERA_av.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Computer Sciences Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Computer Sciences
Software Engineering
spellingShingle Computer Sciences
Software Engineering
TIAN, Yuan
David LO,
A Comparative Study on the Effectiveness of Part-of-speech Tagging Techniques on Bug Reports
description Many software artifacts are written in natural language or contain substantial amount of natural language contents. Thus these artifacts could be analyzed using text analysis techniques from the natural language processing (NLP) community, e.g., the part-of-speech (POS) tagging technique that assigns POS tags (e.g., verb, noun, etc.) to words in a sentence. In the literature, several studies have already applied POS tagging technique on software artifacts to recover important words in them, which are then used for automating various tasks, e.g., locating buggy files for a given bug report, etc. There are many POS tagging techniques proposed and they are trained and evaluated on non software engineering corpus (documents). Thus it is unknown whether they can correctly identify the POS of a word in a software artifact and which of them performs the best. To fill this gap, in this work, we investigate the effectiveness of seven POS taggers on bug reports. We randomly sample 100 bug reports from Eclipse and Mozilla project and create a text corpus that contains 21,713 words. We manually assign POS tags to these words and use them to evaluate the studied POS taggers. Our comparative study shows that the state-of-the-art POS taggers achieve an accuracy of 83.6%-90.5% on bug reports and the Stanford POS tagger and the TreeTagger achieve the highest accuracy on the sampled bug reports. Our findings show that researchers could use these POS taggers to analyze software artifacts, if an accuracy of 80-90% is acceptable for their specific needs, and we recommend using the Stanford POS tagger or the TreeTagger.
format text
author TIAN, Yuan
David LO,
author_facet TIAN, Yuan
David LO,
author_sort TIAN, Yuan
title A Comparative Study on the Effectiveness of Part-of-speech Tagging Techniques on Bug Reports
title_short A Comparative Study on the Effectiveness of Part-of-speech Tagging Techniques on Bug Reports
title_full A Comparative Study on the Effectiveness of Part-of-speech Tagging Techniques on Bug Reports
title_fullStr A Comparative Study on the Effectiveness of Part-of-speech Tagging Techniques on Bug Reports
title_full_unstemmed A Comparative Study on the Effectiveness of Part-of-speech Tagging Techniques on Bug Reports
title_sort comparative study on the effectiveness of part-of-speech tagging techniques on bug reports
publisher Institutional Knowledge at Singapore Management University
publishDate 2014
url https://ink.library.smu.edu.sg/sis_research/2861
https://ink.library.smu.edu.sg/context/sis_research/article/3861/viewcontent/SANER2015_ERA_av.pdf
_version_ 1770572644110827520