Information extraction from bibliography data

Digital Bibliography and Library Project (DBLP) is an online service which provides rich amounts of information in various Computer Science publications. This project aims to build a sentiment analysis model to analyse the polarity of an author’s comment on a citation using the publications in the...

Full description

Saved in:
Bibliographic Details
Main Author: Ng, Jian Cheng
Other Authors: Ke Yiping, Kelly
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/137909
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Digital Bibliography and Library Project (DBLP) is an online service which provides rich amounts of information in various Computer Science publications. This project aims to build a sentiment analysis model to analyse the polarity of an author’s comment on a citation using the publications in the DBLP dataset. This aim can be achieved in the following steps. Firstly, the DBLP XML file was parsed using StAX Parser to extract relevant features before loading into MySQL database. Secondly, data analytics was conducted to understand the DBLP data to discover interesting insights that DBLP data might have. These insights include analysing the distribution of publication, author’s experience, collaborator analysis and prediction and Topic Modelling. Thirdly, the sentiment analysis model was built using various approaches. Before building the model, sentiment text was collected from the publications in the DBLP dataset, and their polarity will be determined based on their direct mentions to another paper, or a list of common positive and negative unigrams and bigram. After collection of the dataset, the model was then built upon various approaches. These approaches include Lexicon Based Approach using TextBlob and VADER Sentiment, Deep Learning Approach using LSTM, and Machine Learning Approach using Decision Tree, Logistic Regression and Naïve Bayes. The parameters were fine tuned to their best accuracy. A comparison between the different models was evaluated using precision and recall. Lastly, a GUI was built to facilitate querying for publication by their name, author, field of study or year of publication. Publicly available PDF file will be downloaded to analyse sentences containing citations. These sentences will have their polarity classified based on the sentiment analysis model.