Information extraction from bibliography data

The Computer Science community and research have grown exponentially for the past decade. Analysing trends in research topics is a common practice to keep track of the increasing amount of research and to provide insights to those in the academia, businesses, government, and other stakeholders. H...

Full description

Saved in:
Bibliographic Details
Main Author: Leong, Kai Ling
Other Authors: Ke Yiping, Kelly
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/156488
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-156488
record_format dspace
spelling sg-ntu-dr.10356-1564882022-04-17T12:56:13Z Information extraction from bibliography data Leong, Kai Ling Ke Yiping, Kelly School of Computer Science and Engineering ypke@ntu.edu.sg Engineering::Computer science and engineering The Computer Science community and research have grown exponentially for the past decade. Analysing trends in research topics is a common practice to keep track of the increasing amount of research and to provide insights to those in the academia, businesses, government, and other stakeholders. However, these trends observed in the Computer Science community may not be reflected in the general audience. This project aims to analyse the trends observed from the field of Computer Science and the general audience, and whether they follow the same trends. The project adhered to the OSEMN framework. The dblp dataset was an XML file downloaded from the dblp website. A StAX parser was implemented to parse the dataset into MySQL database. The data was cleaned and pre-processed and used to implement the LDA model. The final parameters of the LDA model were chosen and the topics were finalised based on their keywords. The topics trends were extracted from these topics. Additional data were queried from Google Ngram Viewer as a second set of topics trends and represent the general audience. Trend analysis was performed using line of best fit, correlations and Mann Kendall trend test. Line of best fit was easy to implement and provided a visual representation of the trends but was statistically unreliable in concluding the topics trends. Although correlations were not intended to be used for trend detection, it was a more statistically reliable approach to determine trends. Mann Kendall trend test was the best approach as it was designed to detect trends, and numerous studies over the years have proposed modified versions to accommodate different limitations and types of data. From the observed trends, about half of the topics of both communities have similar trends. This is not enough evidence to confidently conclude that both communities share the same trends. As such, this project concludes that those in the field of Computer Science and the general audience, do not share the same trends. With this finding, the academia, organisations, and other stakeholders should consider the field of Computer Science and the general audience to be separate communities and may have different growth direction in the observed research topics. Bachelor of Engineering (Computer Science) 2022-04-17T12:56:12Z 2022-04-17T12:56:12Z 2022 Final Year Project (FYP) Leong, K. L. (2022). Information extraction from bibliography data. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/156488 https://hdl.handle.net/10356/156488 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Leong, Kai Ling
Information extraction from bibliography data
description The Computer Science community and research have grown exponentially for the past decade. Analysing trends in research topics is a common practice to keep track of the increasing amount of research and to provide insights to those in the academia, businesses, government, and other stakeholders. However, these trends observed in the Computer Science community may not be reflected in the general audience. This project aims to analyse the trends observed from the field of Computer Science and the general audience, and whether they follow the same trends. The project adhered to the OSEMN framework. The dblp dataset was an XML file downloaded from the dblp website. A StAX parser was implemented to parse the dataset into MySQL database. The data was cleaned and pre-processed and used to implement the LDA model. The final parameters of the LDA model were chosen and the topics were finalised based on their keywords. The topics trends were extracted from these topics. Additional data were queried from Google Ngram Viewer as a second set of topics trends and represent the general audience. Trend analysis was performed using line of best fit, correlations and Mann Kendall trend test. Line of best fit was easy to implement and provided a visual representation of the trends but was statistically unreliable in concluding the topics trends. Although correlations were not intended to be used for trend detection, it was a more statistically reliable approach to determine trends. Mann Kendall trend test was the best approach as it was designed to detect trends, and numerous studies over the years have proposed modified versions to accommodate different limitations and types of data. From the observed trends, about half of the topics of both communities have similar trends. This is not enough evidence to confidently conclude that both communities share the same trends. As such, this project concludes that those in the field of Computer Science and the general audience, do not share the same trends. With this finding, the academia, organisations, and other stakeholders should consider the field of Computer Science and the general audience to be separate communities and may have different growth direction in the observed research topics.
author2 Ke Yiping, Kelly
author_facet Ke Yiping, Kelly
Leong, Kai Ling
format Final Year Project
author Leong, Kai Ling
author_sort Leong, Kai Ling
title Information extraction from bibliography data
title_short Information extraction from bibliography data
title_full Information extraction from bibliography data
title_fullStr Information extraction from bibliography data
title_full_unstemmed Information extraction from bibliography data
title_sort information extraction from bibliography data
publisher Nanyang Technological University
publishDate 2022
url https://hdl.handle.net/10356/156488
_version_ 1731235802476707840