Using domain knowledge to improve the quality of query clusters.

Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using t...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, Swee Peng.
Other Authors: Goh, Dion Hoe Lian
Format: Theses and Dissertations
Published: 2008
Subjects:
Online Access:http://hdl.handle.net/10356/1814
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Description
Summary:Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using the Wordnet lexical database, each query term in the query log was replaced with corresponding synonym synsets. These synonym synsets are identified as features which are different from the contentbased approach where features are constructed from the query terms. The synsets were weighted to reflect their importance in a query and similarities between pairs of queries computed using the cosine similarity measure. Our clustering algorithm placed two queries in the same cluster whenever the similarity between them exceeded a certain threshold. In this way, clusters were created for four different thresholds to facilitate comparison between them. The quality of the clusters were evaluated using five different performance measures of average cluster size, coverage, precision, recall and the F-measure against the judgments of two human evaluators on a sample of clusters. A comparison of the current study and previous study conducted by Chandrani (2004) show that the performance measures were lower at all the four thresholds in terms of coverage, precision, recall and F-measure. We identified two key reasons for the lower values in these performance measures due to the additional preprocessing that reduced the query log size and also the clusters formed were mainly engineering related subjects. The evaluation is further extended to incorporate domain knowledge element into the evaluators. The three performance measures were computed in terms of the average cluster size, coverage and precision and the results were compared with the current study and the previous study conducted by Chandrani (2004). Overall, there is an improvement in terms of precision contributed by the importance of domain knowledge of the evaluators. We propose that further preprocessing and finding ways to extract the elements of domain knowledge to feed into the clustering process can significantly improve the precision, which is left as future work.