A new approach for instance-based schema matching

Schema matching is a crucial phase in data integration that aims to find correspondences between schema attributes by utilizing schema information. However, this information is not always available or useful to be used since it could be abbreviation. Consequently, instances could be an alternative c...

Full description

Saved in:

Bibliographic Details
Main Author:	Mahdi, Osamah Abdul Sattar
Format:	Thesis
Language:	English
Published:	2014
Online Access:	http://psasir.upm.edu.my/id/eprint/40707/13/FSKTM%202014%205%20IR.pdf http://psasir.upm.edu.my/id/eprint/40707/
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Universiti Putra Malaysia
Language:	English

id	my.upm.eprints.40707
record_format	eprints
spelling	my.upm.eprints.407072017-01-17T05:35:33Z http://psasir.upm.edu.my/id/eprint/40707/ A new approach for instance-based schema matching Mahdi, Osamah Abdul Sattar Schema matching is a crucial phase in data integration that aims to find correspondences between schema attributes by utilizing schema information. However, this information is not always available or useful to be used since it could be abbreviation. Consequently, instances could be an alternative choice for schema information. Various instance based schema matching approaches have been proposed to achieve the goal of discovering correspondences between schema attributes, by treating the instances as strings including the numeric instances. This prevents discovering common patterns or performing statistical computation among the numeric instances. As a consequence, this causes unidentified matches especially for attribute with numeric instances which further reduces the quality of match results. This thesis aims at proposing an efficient approach which is able to identify attribute matches between schemas by fully exploiting the instances. The approach utilizes the concept of pattern recognition to determine attribute matches for numeric and mix instances. This is acquired by automatically creating regular expression based on the instances. While, for alphabetic instances the approach calculates the semantic similarity score by utilizing Google similarity to capture the semantic relationships between instances. The proposed approach consists of five main phases, namely: (i) analysing instances, (ii) classifying schema attributes, (iii) extracting the optimal sample size, (iv) identifying instance similarity, and (v) identifying the match. Three analyses have been designed and conducted on two different data sets, namely: (i) Restaurant and (ii) Census, with respect to precision (P), recall (R), and F-measure (F). The first analysis aims at identifying the optimal sample size of tuples to be used during the phase of extracting the optimal sample size. The purpose of identifying the optimal sample size is to reduce the number of comparisons between the instances which lead to reduce the processing time of matching operation. This analysis showed that the optimal sample size is 50% from the actual table size of both data sets. The second analysis aims to investigate and to prove that combining both Google similarity and regular expression as in our proposed approach achieve higher accuracy compared to utilizing Google similarity or regular expression separately. The results showed that our proposed approach achieved precision (P), recall (R), and F-measure (F) in the range of 93% - 99% for both data sets. On the other hand, Google similarity and regular expression which are performed separately achieved precision (P), recall (R), and F-measure (F) in the range of 36% - 74%. While the third analysis intents to compare the performance of our proposed approach to the previous approaches. The results showed that our proposed approach outperformed the previous approaches although only a sample of instances is used instead of considering the whole instances during the process of instance based schema matching as used in the previous works. 2014-05 Thesis NonPeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/40707/13/FSKTM%202014%205%20IR.pdf Mahdi, Osamah Abdul Sattar (2014) A new approach for instance-based schema matching. Masters thesis, Universiti Putra Malaysia.
institution	Universiti Putra Malaysia
building	UPM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Putra Malaysia
content_source	UPM Institutional Repository
url_provider	http://psasir.upm.edu.my/
language	English
description	Schema matching is a crucial phase in data integration that aims to find correspondences between schema attributes by utilizing schema information. However, this information is not always available or useful to be used since it could be abbreviation. Consequently, instances could be an alternative choice for schema information. Various instance based schema matching approaches have been proposed to achieve the goal of discovering correspondences between schema attributes, by treating the instances as strings including the numeric instances. This prevents discovering common patterns or performing statistical computation among the numeric instances. As a consequence, this causes unidentified matches especially for attribute with numeric instances which further reduces the quality of match results. This thesis aims at proposing an efficient approach which is able to identify attribute matches between schemas by fully exploiting the instances. The approach utilizes the concept of pattern recognition to determine attribute matches for numeric and mix instances. This is acquired by automatically creating regular expression based on the instances. While, for alphabetic instances the approach calculates the semantic similarity score by utilizing Google similarity to capture the semantic relationships between instances. The proposed approach consists of five main phases, namely: (i) analysing instances, (ii) classifying schema attributes, (iii) extracting the optimal sample size, (iv) identifying instance similarity, and (v) identifying the match. Three analyses have been designed and conducted on two different data sets, namely: (i) Restaurant and (ii) Census, with respect to precision (P), recall (R), and F-measure (F). The first analysis aims at identifying the optimal sample size of tuples to be used during the phase of extracting the optimal sample size. The purpose of identifying the optimal sample size is to reduce the number of comparisons between the instances which lead to reduce the processing time of matching operation. This analysis showed that the optimal sample size is 50% from the actual table size of both data sets. The second analysis aims to investigate and to prove that combining both Google similarity and regular expression as in our proposed approach achieve higher accuracy compared to utilizing Google similarity or regular expression separately. The results showed that our proposed approach achieved precision (P), recall (R), and F-measure (F) in the range of 93% - 99% for both data sets. On the other hand, Google similarity and regular expression which are performed separately achieved precision (P), recall (R), and F-measure (F) in the range of 36% - 74%. While the third analysis intents to compare the performance of our proposed approach to the previous approaches. The results showed that our proposed approach outperformed the previous approaches although only a sample of instances is used instead of considering the whole instances during the process of instance based schema matching as used in the previous works.
format	Thesis
author	Mahdi, Osamah Abdul Sattar
spellingShingle	Mahdi, Osamah Abdul Sattar A new approach for instance-based schema matching
author_facet	Mahdi, Osamah Abdul Sattar
author_sort	Mahdi, Osamah Abdul Sattar
title	A new approach for instance-based schema matching
title_short	A new approach for instance-based schema matching
title_full	A new approach for instance-based schema matching
title_fullStr	A new approach for instance-based schema matching
title_full_unstemmed	A new approach for instance-based schema matching
title_sort	new approach for instance-based schema matching
publishDate	2014
url	http://psasir.upm.edu.my/id/eprint/40707/13/FSKTM%202014%205%20IR.pdf http://psasir.upm.edu.my/id/eprint/40707/
_version_	1643832793844678656

A new approach for instance-based schema matching

Similar Items