Evaluation of retrieval effectiveness using clustering techniques in Malay document retrieval / Nurazzah Abd Rahman

Information Retrieval (IR) deals with the representation, storage, organization of and access to information items. The main function of information retrieval system is to provide the users with tools to perform searching effectively and efficiently. For the past thirty years since research on IR ha...

Full description

Saved in:
Bibliographic Details
Main Author: Abd Rahman, Nurazzah
Format: Thesis
Language:English
Published: 2011
Subjects:
Online Access:https://ir.uitm.edu.my/id/eprint/65319/1/65319.pdf
https://ir.uitm.edu.my/id/eprint/65319/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Teknologi Mara
Language: English
Description
Summary:Information Retrieval (IR) deals with the representation, storage, organization of and access to information items. The main function of information retrieval system is to provide the users with tools to perform searching effectively and efficiently. For the past thirty years since research on IR has been established, research on IR using Malay Language has only emerged in the middle of 1990s. Cluster Analysis is a technique for multivariate analysis that assigns items to automatically created groups based on a calculation of the degree of association between items or group. In clusterbased information retrieval, clustering can be applied to terms in documents, or all documents in the corpus, or the user queries or the retrieval results itself. Each type of clustering will improve the retrieval effectiveness. This thesis focuses on document clustering. The Malay documents corpus consists of digitized Malay translated hadith text from well-known Islamic scholars, which are Sahih Muslim, Sahih Bukhari, Sunan Ibnu Majjah, Sunan At-Tirmidzi, Sunan Abu Daud and Sunan An-Nasaie. The corpus was developed by scanning, editing and proofreading the Malay text into digital form. Pre-processing for Malay translated hadith text need to be executed as most of the texts are in Indonesian Language. Differences in the meaning of many terms need to be clarified and converted to Malay language using dictionary and also human experts in both languages. Experts in the Hadith domain is sought after for reliability of the Malay translated Hadith text documents. A digitized updated Malay thesaurus is used in the first experiment to improve the effectiveness of Malay document retrieval. For Clustering Analysis, the Malay translated hadith test collection consists of 2028 documents from Sahih Bukhari, where each Hadith document consists of words ranging from 13 to 2561. The determination of interdocument similarity depends on both the document representation in terms of the weights assigned to the indexing terms characterizing each document and the similarity coefficient chosen. This thesis presents the results of applying five different hierarchical agglomerative clustering techniques, namely Single Linkage, Complete Linkage, Group Average Linkage, Weighted Median Linkage and Ward's Method, using Dice, Jaccard and Cosine similarity coefficients on Malay corpus. The evaluation of the experiments uses redefined well-known IR metrics Recall (R), proportion of relevant documents that is clustered, and Precision (P), proportion of clustered documents that are relevant. The results of first experiment obtained shows that by using Dice similarity coefficient, Complete Linkage is the most effective and Average Linkage is highest in precision, in clustering Malay translated Hadith text documents. By using Jaccard similarity coefficient, Single Linkage is the most effective in clustering Malay translated Hadith text documents, while Ward's Method is the highest in precision. Lastly, by using Cosine coefficient, Complete Linkage gives the highest precision in clustering Malay translated Hadith text documents. Therefore, Complete Linkage combined with Cosine coefficient is used to run on a larger Malay Hadith corpus in the second experiment, which is Sahih Bukhari that consists of 2028 text documents. Different testing proved that the Precision is increased from 18% to 55% if the corpus is clustered into 100 clusters, compared to 50 and 20 clusters. This has led to the conclusion that larger the number of clusters has higher precision compared to smaller number of clusters, since larger number of clusters has smaller number of documents in each cluster. Hence, recall is decreased and precision increased.