Information extraction from bibliography data
Digital Bibliography and Library Project (DBLP) is an online service which provides rich amounts of information in various Computer Science publications. This project aims to build a sentiment analysis model to analyse the polarity of an author’s comment on a citation using the publications in the...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/137909 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Digital Bibliography and Library Project (DBLP) is an online service which provides rich amounts of information in various Computer Science publications. This project aims to build a sentiment analysis model to analyse the polarity of an author’s comment on a citation using the publications in the DBLP dataset. This aim can be achieved in the following steps.
Firstly, the DBLP XML file was parsed using StAX Parser to extract relevant features before loading into MySQL database. Secondly, data analytics was conducted to understand the DBLP data to discover interesting insights that DBLP data might have. These insights include analysing the distribution of publication, author’s experience, collaborator analysis and prediction and Topic Modelling.
Thirdly, the sentiment analysis model was built using various approaches. Before building the model, sentiment text was collected from the publications in the DBLP dataset, and their polarity will be determined based on their direct mentions to another paper, or a list of common positive and negative unigrams and bigram.
After collection of the dataset, the model was then built upon various approaches. These approaches include Lexicon Based Approach using TextBlob and VADER Sentiment, Deep Learning Approach using LSTM, and Machine Learning Approach using Decision Tree, Logistic Regression and Naïve Bayes. The parameters were fine tuned to their best accuracy. A comparison between the different models was evaluated using precision and recall.
Lastly, a GUI was built to facilitate querying for publication by their name, author, field of study or year of publication. Publicly available PDF file will be downloaded to analyse sentences containing citations. These sentences will have their polarity classified based on the sentiment analysis model. |
---|