Using domain knowledge to improve the quality of query clusters.
Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using t...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Published: |
2008
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/1814 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
id |
sg-ntu-dr.10356-1814 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-18142019-12-10T14:46:27Z Using domain knowledge to improve the quality of query clusters. Tan, Swee Peng. Goh, Dion Hoe Lian Wee Kim Wee School of Communication and Information DRNTU::Library and information science::Libraries::Information retrieval and analysis Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using the Wordnet lexical database, each query term in the query log was replaced with corresponding synonym synsets. These synonym synsets are identified as features which are different from the contentbased approach where features are constructed from the query terms. The synsets were weighted to reflect their importance in a query and similarities between pairs of queries computed using the cosine similarity measure. Our clustering algorithm placed two queries in the same cluster whenever the similarity between them exceeded a certain threshold. In this way, clusters were created for four different thresholds to facilitate comparison between them. The quality of the clusters were evaluated using five different performance measures of average cluster size, coverage, precision, recall and the F-measure against the judgments of two human evaluators on a sample of clusters. A comparison of the current study and previous study conducted by Chandrani (2004) show that the performance measures were lower at all the four thresholds in terms of coverage, precision, recall and F-measure. We identified two key reasons for the lower values in these performance measures due to the additional preprocessing that reduced the query log size and also the clusters formed were mainly engineering related subjects. The evaluation is further extended to incorporate domain knowledge element into the evaluators. The three performance measures were computed in terms of the average cluster size, coverage and precision and the results were compared with the current study and the previous study conducted by Chandrani (2004). Overall, there is an improvement in terms of precision contributed by the importance of domain knowledge of the evaluators. We propose that further preprocessing and finding ways to extract the elements of domain knowledge to feed into the clustering process can significantly improve the precision, which is left as future work. Master of Science (Information Studies) 2008-09-10T08:36:27Z 2008-09-10T08:36:27Z 2007 2007 Thesis http://hdl.handle.net/10356/1814 Nanyang Technological University application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
country |
Singapore |
collection |
DR-NTU |
topic |
DRNTU::Library and information science::Libraries::Information retrieval and analysis |
spellingShingle |
DRNTU::Library and information science::Libraries::Information retrieval and analysis Tan, Swee Peng. Using domain knowledge to improve the quality of query clusters. |
description |
Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using the Wordnet lexical database, each query term in the query log was replaced with corresponding synonym synsets.
These synonym synsets are identified as features which are different from the contentbased approach where features are constructed from the query terms. The synsets were weighted to reflect their importance in a query and similarities between pairs of queries computed using the cosine similarity measure. Our clustering algorithm placed two queries in the same cluster whenever the similarity between them exceeded a certain threshold. In this way, clusters were created for four different thresholds to facilitate comparison between them. The quality of the clusters were evaluated using five different performance measures of average cluster size, coverage, precision, recall and the F-measure against the judgments of two human evaluators on a sample of clusters. A comparison of the current study and previous study conducted by Chandrani (2004) show that the performance measures were lower at all the four thresholds in terms of coverage,
precision, recall and F-measure. We identified two key reasons for the lower values in these performance measures due to the additional preprocessing that reduced the query log size and also the clusters formed were mainly engineering related subjects. The evaluation is further extended to incorporate domain knowledge element into the evaluators. The three performance measures were computed in terms of the average
cluster size, coverage and precision and the results were compared with the current study and the previous study conducted by Chandrani (2004). Overall, there is an improvement in terms of precision contributed by the importance of domain knowledge of the evaluators. We propose that further preprocessing and finding ways to extract the elements of domain knowledge to feed into the clustering process can significantly improve the precision, which is left as future work. |
author2 |
Goh, Dion Hoe Lian |
author_facet |
Goh, Dion Hoe Lian Tan, Swee Peng. |
format |
Theses and Dissertations |
author |
Tan, Swee Peng. |
author_sort |
Tan, Swee Peng. |
title |
Using domain knowledge to improve the quality of query clusters. |
title_short |
Using domain knowledge to improve the quality of query clusters. |
title_full |
Using domain knowledge to improve the quality of query clusters. |
title_fullStr |
Using domain knowledge to improve the quality of query clusters. |
title_full_unstemmed |
Using domain knowledge to improve the quality of query clusters. |
title_sort |
using domain knowledge to improve the quality of query clusters. |
publishDate |
2008 |
url |
http://hdl.handle.net/10356/1814 |
_version_ |
1681035705160040448 |