Column heterogeneity as a measure of data quality

Data quality is a serious concern in every data management application, and a variety of quality measures have been proposed, including accuracy, freshness and completeness, to capture the common sources of data quality degradation. We identify and focus attention on a novel measure, column heteroge...

Full description

Saved in:
Bibliographic Details
Main Authors: DAI, Bing Tian, KOUDAS, Nick, OOI, Beng Chin, SRIVASTAVA, Divesh, VENKATASUBRAMANIAN, Suresh
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2007
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/4165
https://ink.library.smu.edu.sg/context/sis_research/article/5168/viewcontent/Dai2006ColumnHeterogeneityasa.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-5168
record_format dspace
spelling sg-smu-ink.sis_research-51682018-11-22T02:45:09Z Column heterogeneity as a measure of data quality DAI, Bing Tian KOUDAS, Nick OOI, Beng Chin SRIVASTAVA, Divesh VENKATASUBRAMANIAN, Suresh Data quality is a serious concern in every data management application, and a variety of quality measures have been proposed, including accuracy, freshness and completeness, to capture the common sources of data quality degradation. We identify and focus attention on a novel measure, column heterogeneity, that seeks to quantify the data quality problems that can arise when merging data from different sources. We identify desiderata that a column heterogeneity measure should intuitively satisfy, and discuss a promising direction of research to quantify database column heterogeneity based on using a novel combination of cluster entropy and soft clustering. Finally, we present a few preliminary experimental results, using diverse data sets of semantically different types, to demonstrate that this approach appears to provide a robust mechanism for identifying and quantifying database column heterogeneity. 2007-09-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4165 https://ink.library.smu.edu.sg/context/sis_research/article/5168/viewcontent/Dai2006ColumnHeterogeneityasa.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Databases and Information Systems
spellingShingle Databases and Information Systems
DAI, Bing Tian
KOUDAS, Nick
OOI, Beng Chin
SRIVASTAVA, Divesh
VENKATASUBRAMANIAN, Suresh
Column heterogeneity as a measure of data quality
description Data quality is a serious concern in every data management application, and a variety of quality measures have been proposed, including accuracy, freshness and completeness, to capture the common sources of data quality degradation. We identify and focus attention on a novel measure, column heterogeneity, that seeks to quantify the data quality problems that can arise when merging data from different sources. We identify desiderata that a column heterogeneity measure should intuitively satisfy, and discuss a promising direction of research to quantify database column heterogeneity based on using a novel combination of cluster entropy and soft clustering. Finally, we present a few preliminary experimental results, using diverse data sets of semantically different types, to demonstrate that this approach appears to provide a robust mechanism for identifying and quantifying database column heterogeneity.
format text
author DAI, Bing Tian
KOUDAS, Nick
OOI, Beng Chin
SRIVASTAVA, Divesh
VENKATASUBRAMANIAN, Suresh
author_facet DAI, Bing Tian
KOUDAS, Nick
OOI, Beng Chin
SRIVASTAVA, Divesh
VENKATASUBRAMANIAN, Suresh
author_sort DAI, Bing Tian
title Column heterogeneity as a measure of data quality
title_short Column heterogeneity as a measure of data quality
title_full Column heterogeneity as a measure of data quality
title_fullStr Column heterogeneity as a measure of data quality
title_full_unstemmed Column heterogeneity as a measure of data quality
title_sort column heterogeneity as a measure of data quality
publisher Institutional Knowledge at Singapore Management University
publishDate 2007
url https://ink.library.smu.edu.sg/sis_research/4165
https://ink.library.smu.edu.sg/context/sis_research/article/5168/viewcontent/Dai2006ColumnHeterogeneityasa.pdf
_version_ 1770574390099968000