Stop words for processing software engineering documents: Do they matter

Stop words, which are considered non-predictive, are often eliminated in natural language processing tasks. However, the definition of uninformative vocabulary is vague, so most algorithms use general knowledge-based stop lists to remove stop words. There is an ongoing debate among academics about t...

Full description

Saved in:
Bibliographic Details
Main Authors: FAN, Yaohou, ARORA, Chetan, TREUDE, Christoph
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8912
https://ink.library.smu.edu.sg/context/sis_research/article/9915/viewcontent/stop.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9915
record_format dspace
spelling sg-smu-ink.sis_research-99152024-06-27T08:08:20Z Stop words for processing software engineering documents: Do they matter FAN, Yaohou ARORA, Chetan TREUDE, Christoph Stop words, which are considered non-predictive, are often eliminated in natural language processing tasks. However, the definition of uninformative vocabulary is vague, so most algorithms use general knowledge-based stop lists to remove stop words. There is an ongoing debate among academics about the usefulness of stop word elimination, especially in domainspecific settings. In this work, we investigate the usefulness of stop word removal in a software engineering context. To do this, we replicate and experiment with three software engineering research tools from related work. Additionally, we construct a corpus of software engineering domain-related text from 10,000 Stack Overflow questions and identify 200 domain-specific stop words using traditional information-theoretic methods. Our results show that the use of domain-specific stop words significantly improved the performance of research tools compared to the use of a general stop list and that 17 out of 19 evaluation measures showed better performance. 2023-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8912 info:doi/10.1109/NLBSE59153.2023.00016 https://ink.library.smu.edu.sg/context/sis_research/article/9915/viewcontent/stop.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Natural Language Processing (NLP) Software Engineering Documents Stop Words Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Natural Language Processing (NLP)
Software Engineering Documents
Stop Words
Software Engineering
spellingShingle Natural Language Processing (NLP)
Software Engineering Documents
Stop Words
Software Engineering
FAN, Yaohou
ARORA, Chetan
TREUDE, Christoph
Stop words for processing software engineering documents: Do they matter
description Stop words, which are considered non-predictive, are often eliminated in natural language processing tasks. However, the definition of uninformative vocabulary is vague, so most algorithms use general knowledge-based stop lists to remove stop words. There is an ongoing debate among academics about the usefulness of stop word elimination, especially in domainspecific settings. In this work, we investigate the usefulness of stop word removal in a software engineering context. To do this, we replicate and experiment with three software engineering research tools from related work. Additionally, we construct a corpus of software engineering domain-related text from 10,000 Stack Overflow questions and identify 200 domain-specific stop words using traditional information-theoretic methods. Our results show that the use of domain-specific stop words significantly improved the performance of research tools compared to the use of a general stop list and that 17 out of 19 evaluation measures showed better performance.
format text
author FAN, Yaohou
ARORA, Chetan
TREUDE, Christoph
author_facet FAN, Yaohou
ARORA, Chetan
TREUDE, Christoph
author_sort FAN, Yaohou
title Stop words for processing software engineering documents: Do they matter
title_short Stop words for processing software engineering documents: Do they matter
title_full Stop words for processing software engineering documents: Do they matter
title_fullStr Stop words for processing software engineering documents: Do they matter
title_full_unstemmed Stop words for processing software engineering documents: Do they matter
title_sort stop words for processing software engineering documents: do they matter
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/8912
https://ink.library.smu.edu.sg/context/sis_research/article/9915/viewcontent/stop.pdf
_version_ 1814047629311475712