Using domain knowledge to improve the quality of query clusters.

Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using t...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, Swee Peng.
Other Authors: Goh, Dion Hoe Lian
Format: Theses and Dissertations
Published: 2008
Subjects:
Online Access:http://hdl.handle.net/10356/1814
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
id sg-ntu-dr.10356-1814
record_format dspace
spelling sg-ntu-dr.10356-18142019-12-10T14:46:27Z Using domain knowledge to improve the quality of query clusters. Tan, Swee Peng. Goh, Dion Hoe Lian Wee Kim Wee School of Communication and Information DRNTU::Library and information science::Libraries::Information retrieval and analysis Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using the Wordnet lexical database, each query term in the query log was replaced with corresponding synonym synsets. These synonym synsets are identified as features which are different from the contentbased approach where features are constructed from the query terms. The synsets were weighted to reflect their importance in a query and similarities between pairs of queries computed using the cosine similarity measure. Our clustering algorithm placed two queries in the same cluster whenever the similarity between them exceeded a certain threshold. In this way, clusters were created for four different thresholds to facilitate comparison between them. The quality of the clusters were evaluated using five different performance measures of average cluster size, coverage, precision, recall and the F-measure against the judgments of two human evaluators on a sample of clusters. A comparison of the current study and previous study conducted by Chandrani (2004) show that the performance measures were lower at all the four thresholds in terms of coverage, precision, recall and F-measure. We identified two key reasons for the lower values in these performance measures due to the additional preprocessing that reduced the query log size and also the clusters formed were mainly engineering related subjects. The evaluation is further extended to incorporate domain knowledge element into the evaluators. The three performance measures were computed in terms of the average cluster size, coverage and precision and the results were compared with the current study and the previous study conducted by Chandrani (2004). Overall, there is an improvement in terms of precision contributed by the importance of domain knowledge of the evaluators. We propose that further preprocessing and finding ways to extract the elements of domain knowledge to feed into the clustering process can significantly improve the precision, which is left as future work. Master of Science (Information Studies) 2008-09-10T08:36:27Z 2008-09-10T08:36:27Z 2007 2007 Thesis http://hdl.handle.net/10356/1814 Nanyang Technological University application/pdf
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
topic DRNTU::Library and information science::Libraries::Information retrieval and analysis
spellingShingle DRNTU::Library and information science::Libraries::Information retrieval and analysis
Tan, Swee Peng.
Using domain knowledge to improve the quality of query clusters.
description Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using the Wordnet lexical database, each query term in the query log was replaced with corresponding synonym synsets. These synonym synsets are identified as features which are different from the contentbased approach where features are constructed from the query terms. The synsets were weighted to reflect their importance in a query and similarities between pairs of queries computed using the cosine similarity measure. Our clustering algorithm placed two queries in the same cluster whenever the similarity between them exceeded a certain threshold. In this way, clusters were created for four different thresholds to facilitate comparison between them. The quality of the clusters were evaluated using five different performance measures of average cluster size, coverage, precision, recall and the F-measure against the judgments of two human evaluators on a sample of clusters. A comparison of the current study and previous study conducted by Chandrani (2004) show that the performance measures were lower at all the four thresholds in terms of coverage, precision, recall and F-measure. We identified two key reasons for the lower values in these performance measures due to the additional preprocessing that reduced the query log size and also the clusters formed were mainly engineering related subjects. The evaluation is further extended to incorporate domain knowledge element into the evaluators. The three performance measures were computed in terms of the average cluster size, coverage and precision and the results were compared with the current study and the previous study conducted by Chandrani (2004). Overall, there is an improvement in terms of precision contributed by the importance of domain knowledge of the evaluators. We propose that further preprocessing and finding ways to extract the elements of domain knowledge to feed into the clustering process can significantly improve the precision, which is left as future work.
author2 Goh, Dion Hoe Lian
author_facet Goh, Dion Hoe Lian
Tan, Swee Peng.
format Theses and Dissertations
author Tan, Swee Peng.
author_sort Tan, Swee Peng.
title Using domain knowledge to improve the quality of query clusters.
title_short Using domain knowledge to improve the quality of query clusters.
title_full Using domain knowledge to improve the quality of query clusters.
title_fullStr Using domain knowledge to improve the quality of query clusters.
title_full_unstemmed Using domain knowledge to improve the quality of query clusters.
title_sort using domain knowledge to improve the quality of query clusters.
publishDate 2008
url http://hdl.handle.net/10356/1814
_version_ 1681035705160040448