Information extraction from bibliography data

Digital Bibliography and Library Project (DBLP) is an online service which provides rich amounts of information in various Computer Science publications. This project aims to build a sentiment analysis model to analyse the polarity of an author’s comment on a citation using the publications in the...

Full description

Saved in:
Bibliographic Details
Main Author: Ng, Jian Cheng
Other Authors: Ke Yiping, Kelly
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/137909
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-137909
record_format dspace
spelling sg-ntu-dr.10356-1379092020-04-18T03:23:37Z Information extraction from bibliography data Ng, Jian Cheng Ke Yiping, Kelly School of Computer Science and Engineering ypke@ntu.edu.sg Engineering::Computer science and engineering::Data Engineering::Computer science and engineering::Software Digital Bibliography and Library Project (DBLP) is an online service which provides rich amounts of information in various Computer Science publications. This project aims to build a sentiment analysis model to analyse the polarity of an author’s comment on a citation using the publications in the DBLP dataset. This aim can be achieved in the following steps. Firstly, the DBLP XML file was parsed using StAX Parser to extract relevant features before loading into MySQL database. Secondly, data analytics was conducted to understand the DBLP data to discover interesting insights that DBLP data might have. These insights include analysing the distribution of publication, author’s experience, collaborator analysis and prediction and Topic Modelling. Thirdly, the sentiment analysis model was built using various approaches. Before building the model, sentiment text was collected from the publications in the DBLP dataset, and their polarity will be determined based on their direct mentions to another paper, or a list of common positive and negative unigrams and bigram. After collection of the dataset, the model was then built upon various approaches. These approaches include Lexicon Based Approach using TextBlob and VADER Sentiment, Deep Learning Approach using LSTM, and Machine Learning Approach using Decision Tree, Logistic Regression and Naïve Bayes. The parameters were fine tuned to their best accuracy. A comparison between the different models was evaluated using precision and recall. Lastly, a GUI was built to facilitate querying for publication by their name, author, field of study or year of publication. Publicly available PDF file will be downloaded to analyse sentences containing citations. These sentences will have their polarity classified based on the sentiment analysis model. Bachelor of Engineering (Computer Science) 2020-04-18T03:23:37Z 2020-04-18T03:23:37Z 2020 Final Year Project (FYP) https://hdl.handle.net/10356/137909 en SCE19-0333 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Data
Engineering::Computer science and engineering::Software
spellingShingle Engineering::Computer science and engineering::Data
Engineering::Computer science and engineering::Software
Ng, Jian Cheng
Information extraction from bibliography data
description Digital Bibliography and Library Project (DBLP) is an online service which provides rich amounts of information in various Computer Science publications. This project aims to build a sentiment analysis model to analyse the polarity of an author’s comment on a citation using the publications in the DBLP dataset. This aim can be achieved in the following steps. Firstly, the DBLP XML file was parsed using StAX Parser to extract relevant features before loading into MySQL database. Secondly, data analytics was conducted to understand the DBLP data to discover interesting insights that DBLP data might have. These insights include analysing the distribution of publication, author’s experience, collaborator analysis and prediction and Topic Modelling. Thirdly, the sentiment analysis model was built using various approaches. Before building the model, sentiment text was collected from the publications in the DBLP dataset, and their polarity will be determined based on their direct mentions to another paper, or a list of common positive and negative unigrams and bigram. After collection of the dataset, the model was then built upon various approaches. These approaches include Lexicon Based Approach using TextBlob and VADER Sentiment, Deep Learning Approach using LSTM, and Machine Learning Approach using Decision Tree, Logistic Regression and Naïve Bayes. The parameters were fine tuned to their best accuracy. A comparison between the different models was evaluated using precision and recall. Lastly, a GUI was built to facilitate querying for publication by their name, author, field of study or year of publication. Publicly available PDF file will be downloaded to analyse sentences containing citations. These sentences will have their polarity classified based on the sentiment analysis model.
author2 Ke Yiping, Kelly
author_facet Ke Yiping, Kelly
Ng, Jian Cheng
format Final Year Project
author Ng, Jian Cheng
author_sort Ng, Jian Cheng
title Information extraction from bibliography data
title_short Information extraction from bibliography data
title_full Information extraction from bibliography data
title_fullStr Information extraction from bibliography data
title_full_unstemmed Information extraction from bibliography data
title_sort information extraction from bibliography data
publisher Nanyang Technological University
publishDate 2020
url https://hdl.handle.net/10356/137909
_version_ 1681058562016542720