Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments

To uncover interesting and actionable information from natural language documents authored by software developers, many researchers rely on "out-of-the-box" NLP libraries. However, software artifacts written in natural language are different from other textual documents due to the technica...

Full description

Saved in:

Bibliographic Details
Main Authors:	AL OMRAN, Fouad N. A., TREUDE, Christoph
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2017
Subjects:	Natural language processing NLP libraries Part-of-Speech tagging Software documentation Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/8850 https://ink.library.smu.edu.sg/context/sis_research/article/9853/viewcontent/msr17.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9853
record_format	dspace
spelling	sg-smu-ink.sis_research-98532024-06-13T09:16:27Z Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments AL OMRAN, Fouad N. A. TREUDE, Christoph To uncover interesting and actionable information from natural language documents authored by software developers, many researchers rely on "out-of-the-box" NLP libraries. However, software artifacts written in natural language are different from other textual documents due to the technical language used. In this paper, we first analyze the state of the art through a systematic literature review in which we find that only a small minority of papers justify their choice of an NLP library. We then report on a series of experiments in which we applied four state-of-the-art NLP libraries to publicly available software artifacts from three different sources. Our results show low agreement between different libraries (only between 60% and 71% of tokens were assigned the same part-of-speech tag by all four libraries) as well as differences in accuracy depending on source: For example, spaCy achieved the best accuracy on Stack Overflow data with nearly 90% of tokens tagged correctly, while it was clearly outperformed by Google's SyntaxNet when parsing GitHub ReadMe files. Our work implies that researchers should make an informed decision about the particular NLP library they choose and that customizations to libraries might be necessary to achieve good results when analyzing software artifacts written in natural language. 2017-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8850 info:doi/10.1109/MSR.2017.42 https://ink.library.smu.edu.sg/context/sis_research/article/9853/viewcontent/msr17.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Natural language processing NLP libraries Part-of-Speech tagging Software documentation Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Natural language processing NLP libraries Part-of-Speech tagging Software documentation Software Engineering
spellingShingle	Natural language processing NLP libraries Part-of-Speech tagging Software documentation Software Engineering AL OMRAN, Fouad N. A. TREUDE, Christoph Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments
description	To uncover interesting and actionable information from natural language documents authored by software developers, many researchers rely on "out-of-the-box" NLP libraries. However, software artifacts written in natural language are different from other textual documents due to the technical language used. In this paper, we first analyze the state of the art through a systematic literature review in which we find that only a small minority of papers justify their choice of an NLP library. We then report on a series of experiments in which we applied four state-of-the-art NLP libraries to publicly available software artifacts from three different sources. Our results show low agreement between different libraries (only between 60% and 71% of tokens were assigned the same part-of-speech tag by all four libraries) as well as differences in accuracy depending on source: For example, spaCy achieved the best accuracy on Stack Overflow data with nearly 90% of tokens tagged correctly, while it was clearly outperformed by Google's SyntaxNet when parsing GitHub ReadMe files. Our work implies that researchers should make an informed decision about the particular NLP library they choose and that customizations to libraries might be necessary to achieve good results when analyzing software artifacts written in natural language.
format	text
author	AL OMRAN, Fouad N. A. TREUDE, Christoph
author_facet	AL OMRAN, Fouad N. A. TREUDE, Christoph
author_sort	AL OMRAN, Fouad N. A.
title	Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments
title_short	Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments
title_full	Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments
title_fullStr	Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments
title_full_unstemmed	Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments
title_sort	choosing an nlp library for analyzing software documentation: a systematic literature review and a series of experiments
publisher	Institutional Knowledge at Singapore Management University
publishDate	2017
url	https://ink.library.smu.edu.sg/sis_research/8850 https://ink.library.smu.edu.sg/context/sis_research/article/9853/viewcontent/msr17.pdf
_version_	1814047593777332224

Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments

Similar Items