CrowdLink: An Error-Tolerant Model for Linking Complex Records

Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases), which is a long-standing challenge in database management. Algorithmic approaches have been proposed to improve RL quali...

Full description

Saved in:
Bibliographic Details
Main Authors: ZHANG, Chen Jason, MENG, Rui, CHEN, Lei, ZHU, Feida
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2015
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/3136
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-4136
record_format dspace
spelling sg-smu-ink.sis_research-41362016-02-25T08:24:07Z CrowdLink: An Error-Tolerant Model for Linking Complex Records ZHANG, Chen Jason MENG, Rui CHEN, Lei ZHU, Feida Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases), which is a long-standing challenge in database management. Algorithmic approaches have been proposed to improve RL quality, but remain far from perfect. Crowdsourcing offers a more accurate but expensive (and slow) way to bring human insight into the process. In this paper, we propose a new probabilistic model, namely CrowdLink, to tackle the above limitations. In particular, our model gracefully handles the crowd error and the correlation among different pairs, as well as enables us to decompose the records into small pieces (i.e. attributes) so that crowdsourcing workers can easily verify. Further, we develop efficient and effective algorithms to select the most valuable questions, in order to reduce the monetary cost of crowdsourcing. We conducted extensive experiments on both synthetic and real-world datasets. The experimental results verified the effectiveness and the applicability of our model. 2015-05-31T07:00:00Z text https://ink.library.smu.edu.sg/sis_research/3136 info:doi/10.1145/2795218.2795222 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Databases and Information Systems
spellingShingle Databases and Information Systems
ZHANG, Chen Jason
MENG, Rui
CHEN, Lei
ZHU, Feida
CrowdLink: An Error-Tolerant Model for Linking Complex Records
description Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases), which is a long-standing challenge in database management. Algorithmic approaches have been proposed to improve RL quality, but remain far from perfect. Crowdsourcing offers a more accurate but expensive (and slow) way to bring human insight into the process. In this paper, we propose a new probabilistic model, namely CrowdLink, to tackle the above limitations. In particular, our model gracefully handles the crowd error and the correlation among different pairs, as well as enables us to decompose the records into small pieces (i.e. attributes) so that crowdsourcing workers can easily verify. Further, we develop efficient and effective algorithms to select the most valuable questions, in order to reduce the monetary cost of crowdsourcing. We conducted extensive experiments on both synthetic and real-world datasets. The experimental results verified the effectiveness and the applicability of our model.
format text
author ZHANG, Chen Jason
MENG, Rui
CHEN, Lei
ZHU, Feida
author_facet ZHANG, Chen Jason
MENG, Rui
CHEN, Lei
ZHU, Feida
author_sort ZHANG, Chen Jason
title CrowdLink: An Error-Tolerant Model for Linking Complex Records
title_short CrowdLink: An Error-Tolerant Model for Linking Complex Records
title_full CrowdLink: An Error-Tolerant Model for Linking Complex Records
title_fullStr CrowdLink: An Error-Tolerant Model for Linking Complex Records
title_full_unstemmed CrowdLink: An Error-Tolerant Model for Linking Complex Records
title_sort crowdlink: an error-tolerant model for linking complex records
publisher Institutional Knowledge at Singapore Management University
publishDate 2015
url https://ink.library.smu.edu.sg/sis_research/3136
_version_ 1770572823488626688