A new approach for instance-based schema matching

Schema matching is a crucial phase in data integration that aims to find correspondences between schema attributes by utilizing schema information. However, this information is not always available or useful to be used since it could be abbreviation. Consequently, instances could be an alternative c...

Full description

Saved in:
Bibliographic Details
Main Author: Mahdi, Osamah Abdul Sattar
Format: Thesis
Language:English
Published: 2014
Online Access:http://psasir.upm.edu.my/id/eprint/40707/13/FSKTM%202014%205%20IR.pdf
http://psasir.upm.edu.my/id/eprint/40707/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Putra Malaysia
Language: English
id my.upm.eprints.40707
record_format eprints
spelling my.upm.eprints.407072017-01-17T05:35:33Z http://psasir.upm.edu.my/id/eprint/40707/ A new approach for instance-based schema matching Mahdi, Osamah Abdul Sattar Schema matching is a crucial phase in data integration that aims to find correspondences between schema attributes by utilizing schema information. However, this information is not always available or useful to be used since it could be abbreviation. Consequently, instances could be an alternative choice for schema information. Various instance based schema matching approaches have been proposed to achieve the goal of discovering correspondences between schema attributes, by treating the instances as strings including the numeric instances. This prevents discovering common patterns or performing statistical computation among the numeric instances. As a consequence, this causes unidentified matches especially for attribute with numeric instances which further reduces the quality of match results. This thesis aims at proposing an efficient approach which is able to identify attribute matches between schemas by fully exploiting the instances. The approach utilizes the concept of pattern recognition to determine attribute matches for numeric and mix instances. This is acquired by automatically creating regular expression based on the instances. While, for alphabetic instances the approach calculates the semantic similarity score by utilizing Google similarity to capture the semantic relationships between instances. The proposed approach consists of five main phases, namely: (i) analysing instances, (ii) classifying schema attributes, (iii) extracting the optimal sample size, (iv) identifying instance similarity, and (v) identifying the match. Three analyses have been designed and conducted on two different data sets, namely: (i) Restaurant and (ii) Census, with respect to precision (P), recall (R), and F-measure (F). The first analysis aims at identifying the optimal sample size of tuples to be used during the phase of extracting the optimal sample size. The purpose of identifying the optimal sample size is to reduce the number of comparisons between the instances which lead to reduce the processing time of matching operation. This analysis showed that the optimal sample size is 50% from the actual table size of both data sets. The second analysis aims to investigate and to prove that combining both Google similarity and regular expression as in our proposed approach achieve higher accuracy compared to utilizing Google similarity or regular expression separately. The results showed that our proposed approach achieved precision (P), recall (R), and F-measure (F) in the range of 93% - 99% for both data sets. On the other hand, Google similarity and regular expression which are performed separately achieved precision (P), recall (R), and F-measure (F) in the range of 36% - 74%. While the third analysis intents to compare the performance of our proposed approach to the previous approaches. The results showed that our proposed approach outperformed the previous approaches although only a sample of instances is used instead of considering the whole instances during the process of instance based schema matching as used in the previous works. 2014-05 Thesis NonPeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/40707/13/FSKTM%202014%205%20IR.pdf Mahdi, Osamah Abdul Sattar (2014) A new approach for instance-based schema matching. Masters thesis, Universiti Putra Malaysia.
institution Universiti Putra Malaysia
building UPM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Putra Malaysia
content_source UPM Institutional Repository
url_provider http://psasir.upm.edu.my/
language English
description Schema matching is a crucial phase in data integration that aims to find correspondences between schema attributes by utilizing schema information. However, this information is not always available or useful to be used since it could be abbreviation. Consequently, instances could be an alternative choice for schema information. Various instance based schema matching approaches have been proposed to achieve the goal of discovering correspondences between schema attributes, by treating the instances as strings including the numeric instances. This prevents discovering common patterns or performing statistical computation among the numeric instances. As a consequence, this causes unidentified matches especially for attribute with numeric instances which further reduces the quality of match results. This thesis aims at proposing an efficient approach which is able to identify attribute matches between schemas by fully exploiting the instances. The approach utilizes the concept of pattern recognition to determine attribute matches for numeric and mix instances. This is acquired by automatically creating regular expression based on the instances. While, for alphabetic instances the approach calculates the semantic similarity score by utilizing Google similarity to capture the semantic relationships between instances. The proposed approach consists of five main phases, namely: (i) analysing instances, (ii) classifying schema attributes, (iii) extracting the optimal sample size, (iv) identifying instance similarity, and (v) identifying the match. Three analyses have been designed and conducted on two different data sets, namely: (i) Restaurant and (ii) Census, with respect to precision (P), recall (R), and F-measure (F). The first analysis aims at identifying the optimal sample size of tuples to be used during the phase of extracting the optimal sample size. The purpose of identifying the optimal sample size is to reduce the number of comparisons between the instances which lead to reduce the processing time of matching operation. This analysis showed that the optimal sample size is 50% from the actual table size of both data sets. The second analysis aims to investigate and to prove that combining both Google similarity and regular expression as in our proposed approach achieve higher accuracy compared to utilizing Google similarity or regular expression separately. The results showed that our proposed approach achieved precision (P), recall (R), and F-measure (F) in the range of 93% - 99% for both data sets. On the other hand, Google similarity and regular expression which are performed separately achieved precision (P), recall (R), and F-measure (F) in the range of 36% - 74%. While the third analysis intents to compare the performance of our proposed approach to the previous approaches. The results showed that our proposed approach outperformed the previous approaches although only a sample of instances is used instead of considering the whole instances during the process of instance based schema matching as used in the previous works.
format Thesis
author Mahdi, Osamah Abdul Sattar
spellingShingle Mahdi, Osamah Abdul Sattar
A new approach for instance-based schema matching
author_facet Mahdi, Osamah Abdul Sattar
author_sort Mahdi, Osamah Abdul Sattar
title A new approach for instance-based schema matching
title_short A new approach for instance-based schema matching
title_full A new approach for instance-based schema matching
title_fullStr A new approach for instance-based schema matching
title_full_unstemmed A new approach for instance-based schema matching
title_sort new approach for instance-based schema matching
publishDate 2014
url http://psasir.upm.edu.my/id/eprint/40707/13/FSKTM%202014%205%20IR.pdf
http://psasir.upm.edu.my/id/eprint/40707/
_version_ 1643832793844678656