Information extraction from bibliography data

The Computer Science community and research have grown exponentially for the past decade. Analysing trends in research topics is a common practice to keep track of the increasing amount of research and to provide insights to those in the academia, businesses, government, and other stakeholders. H...

Full description

Saved in:
Bibliographic Details
Main Author: Leong, Kai Ling
Other Authors: Ke Yiping, Kelly
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/156488
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The Computer Science community and research have grown exponentially for the past decade. Analysing trends in research topics is a common practice to keep track of the increasing amount of research and to provide insights to those in the academia, businesses, government, and other stakeholders. However, these trends observed in the Computer Science community may not be reflected in the general audience. This project aims to analyse the trends observed from the field of Computer Science and the general audience, and whether they follow the same trends. The project adhered to the OSEMN framework. The dblp dataset was an XML file downloaded from the dblp website. A StAX parser was implemented to parse the dataset into MySQL database. The data was cleaned and pre-processed and used to implement the LDA model. The final parameters of the LDA model were chosen and the topics were finalised based on their keywords. The topics trends were extracted from these topics. Additional data were queried from Google Ngram Viewer as a second set of topics trends and represent the general audience. Trend analysis was performed using line of best fit, correlations and Mann Kendall trend test. Line of best fit was easy to implement and provided a visual representation of the trends but was statistically unreliable in concluding the topics trends. Although correlations were not intended to be used for trend detection, it was a more statistically reliable approach to determine trends. Mann Kendall trend test was the best approach as it was designed to detect trends, and numerous studies over the years have proposed modified versions to accommodate different limitations and types of data. From the observed trends, about half of the topics of both communities have similar trends. This is not enough evidence to confidently conclude that both communities share the same trends. As such, this project concludes that those in the field of Computer Science and the general audience, do not share the same trends. With this finding, the academia, organisations, and other stakeholders should consider the field of Computer Science and the general audience to be separate communities and may have different growth direction in the observed research topics.