Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant

Many software projects adopt mailing lists for the communication of developers and users. Researchers have been mining the history of such lists to study communities' behavior, organization, and evolution. A potential threat of this kind of study is that users often use multiple email addresses...

Full description

Saved in:
Bibliographic Details
Main Authors: WIESE, Igor Scaliante, DA SILVA, José Teodoro, STEINMACHER, Igor, TREUDE, Christoph, GEROSA, Marco Aurélio
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2016
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8774
https://ink.library.smu.edu.sg/context/sis_research/article/9777/viewcontent/icsme16b.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9777
record_format dspace
spelling sg-smu-ink.sis_research-97772024-05-23T03:51:35Z Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant WIESE, Igor Scaliante DA SILVA, José Teodoro STEINMACHER, Igor TREUDE, Christoph GEROSA, Marco Aurélio Many software projects adopt mailing lists for the communication of developers and users. Researchers have been mining the history of such lists to study communities' behavior, organization, and evolution. A potential threat of this kind of study is that users often use multiple email addresses to interact in a single mailing list. This can affect the results and tools, when, for example, extracting social networks. This issue is particularly relevant for popular and long-term Open Source Software (OSS) projects, which attract participation of thousands of people. Researchers have proposed heuristics to identify multiple email addresses from the same participant, however there are few studies analyzing the effectiveness of these heuristics. In addition, many studies still do not use any heuristics for authors' disambiguation, which can compromise the results. In this paper, we compare six heuristics from the literature using data from 150 mailing lists from Apache Software Foundation projects. We found that the heuristics proposed by Oliva et al. and a Naïve heuristic outperformed the others in most cases, when considering the F-measure metric. We also found that the time window and the size of the dataset influence the effectiveness of each heuristic. These results may help researchers and tool developers to choose the most appropriate heuristic to use, besides highlighting the necessity of dealing with identity disambiguation, mainly in open source software communities with a large number of participants. 2016-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8774 info:doi/10.1109/ICSME.2016.13 https://ink.library.smu.edu.sg/context/sis_research/article/9777/viewcontent/icsme16b.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Apache software foundation Email address disambiguation Mailing lists Mining software repositories Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Apache software foundation
Email address disambiguation
Mailing lists
Mining software repositories
Software Engineering
spellingShingle Apache software foundation
Email address disambiguation
Mailing lists
Mining software repositories
Software Engineering
WIESE, Igor Scaliante
DA SILVA, José Teodoro
STEINMACHER, Igor
TREUDE, Christoph
GEROSA, Marco Aurélio
Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
description Many software projects adopt mailing lists for the communication of developers and users. Researchers have been mining the history of such lists to study communities' behavior, organization, and evolution. A potential threat of this kind of study is that users often use multiple email addresses to interact in a single mailing list. This can affect the results and tools, when, for example, extracting social networks. This issue is particularly relevant for popular and long-term Open Source Software (OSS) projects, which attract participation of thousands of people. Researchers have proposed heuristics to identify multiple email addresses from the same participant, however there are few studies analyzing the effectiveness of these heuristics. In addition, many studies still do not use any heuristics for authors' disambiguation, which can compromise the results. In this paper, we compare six heuristics from the literature using data from 150 mailing lists from Apache Software Foundation projects. We found that the heuristics proposed by Oliva et al. and a Naïve heuristic outperformed the others in most cases, when considering the F-measure metric. We also found that the time window and the size of the dataset influence the effectiveness of each heuristic. These results may help researchers and tool developers to choose the most appropriate heuristic to use, besides highlighting the necessity of dealing with identity disambiguation, mainly in open source software communities with a large number of participants.
format text
author WIESE, Igor Scaliante
DA SILVA, José Teodoro
STEINMACHER, Igor
TREUDE, Christoph
GEROSA, Marco Aurélio
author_facet WIESE, Igor Scaliante
DA SILVA, José Teodoro
STEINMACHER, Igor
TREUDE, Christoph
GEROSA, Marco Aurélio
author_sort WIESE, Igor Scaliante
title Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
title_short Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
title_full Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
title_fullStr Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
title_full_unstemmed Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
title_sort who is who in the mailing list? comparing six disambiguation heuristics to identify multiple addresses of a participant
publisher Institutional Knowledge at Singapore Management University
publishDate 2016
url https://ink.library.smu.edu.sg/sis_research/8774
https://ink.library.smu.edu.sg/context/sis_research/article/9777/viewcontent/icsme16b.pdf
_version_ 1814047525451071488