Document reference and citation analysis

With knowledge growing rampantly as more papers were published, researchers are finding the task to retrieve relevant papers to their domain of research a time-consuming task. The aim for this project is to lessen the stress on searching for relevant papers with the idea that, researcher should just...

Full description

Saved in:
Bibliographic Details
Main Author: Yap, Lina.
Other Authors: Sun Aixin
Format: Final Year Project
Language:English
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/10356/52087
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:With knowledge growing rampantly as more papers were published, researchers are finding the task to retrieve relevant papers to their domain of research a time-consuming task. The aim for this project is to lessen the stress on searching for relevant papers with the idea that, researcher should just retrieve an article of interest, and system should based on that article recommend related articles. With this idea in mind, the main goals were outlined to first retrieve two hops of articles based on the selected document, then find clusters of similar papers defined by a set of features, and finally rank and recommend papers. For experimental purposes, the data set used in this project was in the field of Biomedical and Life Sciences, downloaded from PMC Open Access Subset. The features selected to cluster the dataset were namely bibliographic coupling degree, and degree of similarity for title's, and degree of similarity for abstract's topic vector. Bibliographic coupling degree is defined as the number of matching outgoing citation between two articles. Topic vector, obtained through a Topic Modelling library, was then computed for its cosine angle to determine the degree of similarity. This project also employed an interesting No-SQL database, MongoDB, for persisting articles. One of the limitation of this project was evaluating the relevancy of recommendation. While PMC Open Access Subset had provided a large data set enough to gather sufficient articles from two hops, it would require personnel who are highly knowledgeable in this field to evaluate if the recommended articles were relevant. Moreover, the prototype was limited by only three features and could have been further enhanced by using better criteria to select better results. For example, weight of each citation in the feature vector could be represented count of the number of times referenced in an article, instead of representing the feature vector in a binary form.