Metric similarity joins using MapReduce

Given two object sets Q and O , a metric similarity join finds similar object pairs according to a certain criterion. This operation has a wide variety of applications in data cleaning, data mining, to name but a few. However, the rapidly growing volume of data nowadays challenges traditional metric...

Full description

Saved in:

Bibliographic Details
Main Authors:	GAO, Yunjun, YANG, Keyu, CHEN, Lu, ZHENG, Baihua, CHEN, Gang, CHEN, Chun
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2017
Subjects:	Algorithm Similarity Joins Metric Space MapReduce Computer Sciences Databases and Information Systems Theory and Algorithms
Online Access:	https://ink.library.smu.edu.sg/sis_research/3323 https://ink.library.smu.edu.sg/context/sis_research/article/4325/viewcontent/MapReduce_2017.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-4325
record_format	dspace
spelling	sg-smu-ink.sis_research-43252020-01-14T14:26:55Z Metric similarity joins using MapReduce GAO, Yunjun YANG, Keyu CHEN, Lu ZHENG, Baihua CHEN, Gang CHEN, Chun Given two object sets Q and O , a metric similarity join finds similar object pairs according to a certain criterion. This operation has a wide variety of applications in data cleaning, data mining, to name but a few. However, the rapidly growing volume of data nowadays challenges traditional metric similarity join methods, and thus, a distributed method is required. In this paper, we adopt a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure the load balancing, we present two sampling based partition methods. One utilizes the pivot and the space-filling curve mappings to cluster the data into one-dimensional space, and then selects high quality centroids to enable equal-sized partitions. The other uses the KD-tree partitioning technique to equally divide the data after the pivot mapping. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where the range-object filtering, the double-pivot filtering, the pivot filtering, and the plane sweeping techniques are utilized for pruning. Extensive experiments with both real and synthetic data sets demonstrate that our solutions outperform significantly existing state-of-the-art competitors. 2017-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/3323 info:doi/10.1109/TKDE.2016.2631599 https://ink.library.smu.edu.sg/context/sis_research/article/4325/viewcontent/MapReduce_2017.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Algorithm Similarity Joins Metric Space MapReduce Computer Sciences Databases and Information Systems Theory and Algorithms
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Algorithm Similarity Joins Metric Space MapReduce Computer Sciences Databases and Information Systems Theory and Algorithms
spellingShingle	Algorithm Similarity Joins Metric Space MapReduce Computer Sciences Databases and Information Systems Theory and Algorithms GAO, Yunjun YANG, Keyu CHEN, Lu ZHENG, Baihua CHEN, Gang CHEN, Chun Metric similarity joins using MapReduce
description	Given two object sets Q and O , a metric similarity join finds similar object pairs according to a certain criterion. This operation has a wide variety of applications in data cleaning, data mining, to name but a few. However, the rapidly growing volume of data nowadays challenges traditional metric similarity join methods, and thus, a distributed method is required. In this paper, we adopt a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure the load balancing, we present two sampling based partition methods. One utilizes the pivot and the space-filling curve mappings to cluster the data into one-dimensional space, and then selects high quality centroids to enable equal-sized partitions. The other uses the KD-tree partitioning technique to equally divide the data after the pivot mapping. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where the range-object filtering, the double-pivot filtering, the pivot filtering, and the plane sweeping techniques are utilized for pruning. Extensive experiments with both real and synthetic data sets demonstrate that our solutions outperform significantly existing state-of-the-art competitors.
format	text
author	GAO, Yunjun YANG, Keyu CHEN, Lu ZHENG, Baihua CHEN, Gang CHEN, Chun
author_facet	GAO, Yunjun YANG, Keyu CHEN, Lu ZHENG, Baihua CHEN, Gang CHEN, Chun
author_sort	GAO, Yunjun
title	Metric similarity joins using MapReduce
title_short	Metric similarity joins using MapReduce
title_full	Metric similarity joins using MapReduce
title_fullStr	Metric similarity joins using MapReduce
title_full_unstemmed	Metric similarity joins using MapReduce
title_sort	metric similarity joins using mapreduce
publisher	Institutional Knowledge at Singapore Management University
publishDate	2017
url	https://ink.library.smu.edu.sg/sis_research/3323 https://ink.library.smu.edu.sg/context/sis_research/article/4325/viewcontent/MapReduce_2017.pdf
_version_	1770573113008848896

Metric similarity joins using MapReduce

Similar Items