Using domain knowledge to improve the quality of query clusters.

Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using t...

Full description

Saved in:

Bibliographic Details
Main Author:	Tan, Swee Peng.
Other Authors:	Goh, Dion Hoe Lian
Format:	Theses and Dissertations
Published:	2008
Subjects:	DRNTU::Library and information science::Libraries::Information retrieval and analysis
Online Access:	http://hdl.handle.net/10356/1814
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University

id	sg-ntu-dr.10356-1814
record_format	dspace
spelling	sg-ntu-dr.10356-18142019-12-10T14:46:27Z Using domain knowledge to improve the quality of query clusters. Tan, Swee Peng. Goh, Dion Hoe Lian Wee Kim Wee School of Communication and Information DRNTU::Library and information science::Libraries::Information retrieval and analysis Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using the Wordnet lexical database, each query term in the query log was replaced with corresponding synonym synsets. These synonym synsets are identified as features which are different from the contentbased approach where features are constructed from the query terms. The synsets were weighted to reflect their importance in a query and similarities between pairs of queries computed using the cosine similarity measure. Our clustering algorithm placed two queries in the same cluster whenever the similarity between them exceeded a certain threshold. In this way, clusters were created for four different thresholds to facilitate comparison between them. The quality of the clusters were evaluated using five different performance measures of average cluster size, coverage, precision, recall and the F-measure against the judgments of two human evaluators on a sample of clusters. A comparison of the current study and previous study conducted by Chandrani (2004) show that the performance measures were lower at all the four thresholds in terms of coverage, precision, recall and F-measure. We identified two key reasons for the lower values in these performance measures due to the additional preprocessing that reduced the query log size and also the clusters formed were mainly engineering related subjects. The evaluation is further extended to incorporate domain knowledge element into the evaluators. The three performance measures were computed in terms of the average cluster size, coverage and precision and the results were compared with the current study and the previous study conducted by Chandrani (2004). Overall, there is an improvement in terms of precision contributed by the importance of domain knowledge of the evaluators. We propose that further preprocessing and finding ways to extract the elements of domain knowledge to feed into the clustering process can significantly improve the precision, which is left as future work. Master of Science (Information Studies) 2008-09-10T08:36:27Z 2008-09-10T08:36:27Z 2007 2007 Thesis http://hdl.handle.net/10356/1814 Nanyang Technological University application/pdf
institution	Nanyang Technological University
building	NTU Library
country	Singapore
collection	DR-NTU
topic	DRNTU::Library and information science::Libraries::Information retrieval and analysis
spellingShingle	DRNTU::Library and information science::Libraries::Information retrieval and analysis Tan, Swee Peng. Using domain knowledge to improve the quality of query clusters.
description	Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using the Wordnet lexical database, each query term in the query log was replaced with corresponding synonym synsets. These synonym synsets are identified as features which are different from the contentbased approach where features are constructed from the query terms. The synsets were weighted to reflect their importance in a query and similarities between pairs of queries computed using the cosine similarity measure. Our clustering algorithm placed two queries in the same cluster whenever the similarity between them exceeded a certain threshold. In this way, clusters were created for four different thresholds to facilitate comparison between them. The quality of the clusters were evaluated using five different performance measures of average cluster size, coverage, precision, recall and the F-measure against the judgments of two human evaluators on a sample of clusters. A comparison of the current study and previous study conducted by Chandrani (2004) show that the performance measures were lower at all the four thresholds in terms of coverage, precision, recall and F-measure. We identified two key reasons for the lower values in these performance measures due to the additional preprocessing that reduced the query log size and also the clusters formed were mainly engineering related subjects. The evaluation is further extended to incorporate domain knowledge element into the evaluators. The three performance measures were computed in terms of the average cluster size, coverage and precision and the results were compared with the current study and the previous study conducted by Chandrani (2004). Overall, there is an improvement in terms of precision contributed by the importance of domain knowledge of the evaluators. We propose that further preprocessing and finding ways to extract the elements of domain knowledge to feed into the clustering process can significantly improve the precision, which is left as future work.
author2	Goh, Dion Hoe Lian
author_facet	Goh, Dion Hoe Lian Tan, Swee Peng.
format	Theses and Dissertations
author	Tan, Swee Peng.
author_sort	Tan, Swee Peng.
title	Using domain knowledge to improve the quality of query clusters.
title_short	Using domain knowledge to improve the quality of query clusters.
title_full	Using domain knowledge to improve the quality of query clusters.
title_fullStr	Using domain knowledge to improve the quality of query clusters.
title_full_unstemmed	Using domain knowledge to improve the quality of query clusters.
title_sort	using domain knowledge to improve the quality of query clusters.
publishDate	2008
url	http://hdl.handle.net/10356/1814
_version_	1681035705160040448

Using domain knowledge to improve the quality of query clusters.

Similar Items