Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant

Many software projects adopt mailing lists for the communication of developers and users. Researchers have been mining the history of such lists to study communities' behavior, organization, and evolution. A potential threat of this kind of study is that users often use multiple email addresses...

Full description

Saved in:

Bibliographic Details
Main Authors:	WIESE, Igor Scaliante, DA SILVA, José Teodoro, STEINMACHER, Igor, TREUDE, Christoph, GEROSA, Marco Aurélio
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2016
Subjects:	Apache software foundation Email address disambiguation Mailing lists Mining software repositories Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/8774 https://ink.library.smu.edu.sg/context/sis_research/article/9777/viewcontent/icsme16b.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9777
record_format	dspace
spelling	sg-smu-ink.sis_research-97772024-05-23T03:51:35Z Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant WIESE, Igor Scaliante DA SILVA, José Teodoro STEINMACHER, Igor TREUDE, Christoph GEROSA, Marco Aurélio Many software projects adopt mailing lists for the communication of developers and users. Researchers have been mining the history of such lists to study communities' behavior, organization, and evolution. A potential threat of this kind of study is that users often use multiple email addresses to interact in a single mailing list. This can affect the results and tools, when, for example, extracting social networks. This issue is particularly relevant for popular and long-term Open Source Software (OSS) projects, which attract participation of thousands of people. Researchers have proposed heuristics to identify multiple email addresses from the same participant, however there are few studies analyzing the effectiveness of these heuristics. In addition, many studies still do not use any heuristics for authors' disambiguation, which can compromise the results. In this paper, we compare six heuristics from the literature using data from 150 mailing lists from Apache Software Foundation projects. We found that the heuristics proposed by Oliva et al. and a Naïve heuristic outperformed the others in most cases, when considering the F-measure metric. We also found that the time window and the size of the dataset influence the effectiveness of each heuristic. These results may help researchers and tool developers to choose the most appropriate heuristic to use, besides highlighting the necessity of dealing with identity disambiguation, mainly in open source software communities with a large number of participants. 2016-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8774 info:doi/10.1109/ICSME.2016.13 https://ink.library.smu.edu.sg/context/sis_research/article/9777/viewcontent/icsme16b.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Apache software foundation Email address disambiguation Mailing lists Mining software repositories Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Apache software foundation Email address disambiguation Mailing lists Mining software repositories Software Engineering
spellingShingle	Apache software foundation Email address disambiguation Mailing lists Mining software repositories Software Engineering WIESE, Igor Scaliante DA SILVA, José Teodoro STEINMACHER, Igor TREUDE, Christoph GEROSA, Marco Aurélio Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
description	Many software projects adopt mailing lists for the communication of developers and users. Researchers have been mining the history of such lists to study communities' behavior, organization, and evolution. A potential threat of this kind of study is that users often use multiple email addresses to interact in a single mailing list. This can affect the results and tools, when, for example, extracting social networks. This issue is particularly relevant for popular and long-term Open Source Software (OSS) projects, which attract participation of thousands of people. Researchers have proposed heuristics to identify multiple email addresses from the same participant, however there are few studies analyzing the effectiveness of these heuristics. In addition, many studies still do not use any heuristics for authors' disambiguation, which can compromise the results. In this paper, we compare six heuristics from the literature using data from 150 mailing lists from Apache Software Foundation projects. We found that the heuristics proposed by Oliva et al. and a Naïve heuristic outperformed the others in most cases, when considering the F-measure metric. We also found that the time window and the size of the dataset influence the effectiveness of each heuristic. These results may help researchers and tool developers to choose the most appropriate heuristic to use, besides highlighting the necessity of dealing with identity disambiguation, mainly in open source software communities with a large number of participants.
format	text
author	WIESE, Igor Scaliante DA SILVA, José Teodoro STEINMACHER, Igor TREUDE, Christoph GEROSA, Marco Aurélio
author_facet	WIESE, Igor Scaliante DA SILVA, José Teodoro STEINMACHER, Igor TREUDE, Christoph GEROSA, Marco Aurélio
author_sort	WIESE, Igor Scaliante
title	Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
title_short	Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
title_full	Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
title_fullStr	Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
title_full_unstemmed	Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
title_sort	who is who in the mailing list? comparing six disambiguation heuristics to identify multiple addresses of a participant
publisher	Institutional Knowledge at Singapore Management University
publishDate	2016
url	https://ink.library.smu.edu.sg/sis_research/8774 https://ink.library.smu.edu.sg/context/sis_research/article/9777/viewcontent/icsme16b.pdf
_version_	1814047525451071488

Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant

Similar Items