Data Transformation Model For Addressing Incomplete And Inconsistent Quality Issues Of Big Data
Data Quality (DQ) assessment remains one of the major challenges for Big Data (BD) due to the complexity of handling large volumes of data. Traditional data transformation methods such as Extract-Transform-Load (ETL) use data sources from a diverse range of devices and locations resulting in incompl...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English English |
Published: |
2024
|
Subjects: | |
Online Access: | https://etd.uum.edu.my/11184/1/depositpermission-900601.pdf https://etd.uum.edu.my/11184/2/s900601_01.pdf https://etd.uum.edu.my/11184/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Utara Malaysia |
Language: | English English |
Summary: | Data Quality (DQ) assessment remains one of the major challenges for Big Data (BD) due to the complexity of handling large volumes of data. Traditional data transformation methods such as Extract-Transform-Load (ETL) use data sources from a diverse range of devices and locations resulting in incomplete and inconsistent DQ that may lead to wrong insights and decisions. Therefore, DQ is vital for the effective operation and management of BD. Recognizing many DQ features from its definition to the various dimensions is essential for equipping techniques and procedures to improve DQ. This research focuses on two aspects of DQ: completeness, and consistency. Firstly, an enhanced data transformation model (2CsDQT) is proposed to assess and improve big data quality. A new algorithm using ontology and clustering methods is used to identify and correct incomplete and inconsistent data, which resolves the availability and comprehensiveness of data, similarity between data items, and missing specific attributes of data. Secondly, using a clustering technique to analyse DQ, and improve employing results from the 2CsDQT model. The complete and consistent data are put into clusters, and the designed algorithm predicts the position of any incomplete and inconsistent data, based on its value to be added to the specific cluster. The study was evaluated using the developed model and benchmarked
with existing data transformation techniques in the literature. This research shows that the 2CsDQT model successfully improves BD quality and outperforms previously proposed methods. Data completeness and consistency results outperform related articles and benchmark studies in the literature on the datasets of two different test cases. The theoretical contribution of this research work is to provide insight into the importance of DQ issues in BD and the effect of inconsistency and incompleteness on BD application. The practical contribution is the provision of enhanced data transformation models for DQ leading to better data analysis and strategic planning. |
---|