Predicting good configurations for github and stack overflow topic models

Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semant...

Full description

Saved in:

Bibliographic Details
Main Authors:	TREUDE, Christoph, WAGNER, Markus
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2019
Subjects:	Algorithm portfolio Corpus features Topic modelling Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/8836 https://ink.library.smu.edu.sg/context/sis_research/article/9839/viewcontent/msr19a.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9839
record_format	dspace
spelling	sg-smu-ink.sis_research-98392024-06-06T08:48:06Z Predicting good configurations for github and stack overflow topic models TREUDE, Christoph WAGNER, Markus Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories. 2019-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8836 info:doi/10.1109/MSR.2019.00022 https://ink.library.smu.edu.sg/context/sis_research/article/9839/viewcontent/msr19a.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Algorithm portfolio Corpus features Topic modelling Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Algorithm portfolio Corpus features Topic modelling Software Engineering
spellingShingle	Algorithm portfolio Corpus features Topic modelling Software Engineering TREUDE, Christoph WAGNER, Markus Predicting good configurations for github and stack overflow topic models
description	Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.
format	text
author	TREUDE, Christoph WAGNER, Markus
author_facet	TREUDE, Christoph WAGNER, Markus
author_sort	TREUDE, Christoph
title	Predicting good configurations for github and stack overflow topic models
title_short	Predicting good configurations for github and stack overflow topic models
title_full	Predicting good configurations for github and stack overflow topic models
title_fullStr	Predicting good configurations for github and stack overflow topic models
title_full_unstemmed	Predicting good configurations for github and stack overflow topic models
title_sort	predicting good configurations for github and stack overflow topic models
publisher	Institutional Knowledge at Singapore Management University
publishDate	2019
url	https://ink.library.smu.edu.sg/sis_research/8836 https://ink.library.smu.edu.sg/context/sis_research/article/9839/viewcontent/msr19a.pdf
_version_	1814047570395136000

Predicting good configurations for github and stack overflow topic models

Similar Items