Information extraction from bibliography data
The Computer Science community and research have grown exponentially for the past decade. Analysing trends in research topics is a common practice to keep track of the increasing amount of research and to provide insights to those in the academia, businesses, government, and other stakeholders. H...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/156488 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-156488 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1564882022-04-17T12:56:13Z Information extraction from bibliography data Leong, Kai Ling Ke Yiping, Kelly School of Computer Science and Engineering ypke@ntu.edu.sg Engineering::Computer science and engineering The Computer Science community and research have grown exponentially for the past decade. Analysing trends in research topics is a common practice to keep track of the increasing amount of research and to provide insights to those in the academia, businesses, government, and other stakeholders. However, these trends observed in the Computer Science community may not be reflected in the general audience. This project aims to analyse the trends observed from the field of Computer Science and the general audience, and whether they follow the same trends. The project adhered to the OSEMN framework. The dblp dataset was an XML file downloaded from the dblp website. A StAX parser was implemented to parse the dataset into MySQL database. The data was cleaned and pre-processed and used to implement the LDA model. The final parameters of the LDA model were chosen and the topics were finalised based on their keywords. The topics trends were extracted from these topics. Additional data were queried from Google Ngram Viewer as a second set of topics trends and represent the general audience. Trend analysis was performed using line of best fit, correlations and Mann Kendall trend test. Line of best fit was easy to implement and provided a visual representation of the trends but was statistically unreliable in concluding the topics trends. Although correlations were not intended to be used for trend detection, it was a more statistically reliable approach to determine trends. Mann Kendall trend test was the best approach as it was designed to detect trends, and numerous studies over the years have proposed modified versions to accommodate different limitations and types of data. From the observed trends, about half of the topics of both communities have similar trends. This is not enough evidence to confidently conclude that both communities share the same trends. As such, this project concludes that those in the field of Computer Science and the general audience, do not share the same trends. With this finding, the academia, organisations, and other stakeholders should consider the field of Computer Science and the general audience to be separate communities and may have different growth direction in the observed research topics. Bachelor of Engineering (Computer Science) 2022-04-17T12:56:12Z 2022-04-17T12:56:12Z 2022 Final Year Project (FYP) Leong, K. L. (2022). Information extraction from bibliography data. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/156488 https://hdl.handle.net/10356/156488 en application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering |
spellingShingle |
Engineering::Computer science and engineering Leong, Kai Ling Information extraction from bibliography data |
description |
The Computer Science community and research have grown exponentially for the past decade.
Analysing trends in research topics is a common practice to keep track of the increasing
amount of research and to provide insights to those in the academia, businesses, government,
and other stakeholders. However, these trends observed in the Computer Science community
may not be reflected in the general audience. This project aims to analyse the trends observed
from the field of Computer Science and the general audience, and whether they follow the
same trends.
The project adhered to the OSEMN framework. The dblp dataset was an XML file downloaded
from the dblp website. A StAX parser was implemented to parse the dataset into MySQL
database. The data was cleaned and pre-processed and used to implement the LDA model.
The final parameters of the LDA model were chosen and the topics were finalised based on
their keywords. The topics trends were extracted from these topics. Additional data were
queried from Google Ngram Viewer as a second set of topics trends and represent the general
audience.
Trend analysis was performed using line of best fit, correlations and Mann Kendall trend test.
Line of best fit was easy to implement and provided a visual representation of the trends but
was statistically unreliable in concluding the topics trends. Although correlations were not
intended to be used for trend detection, it was a more statistically reliable approach to
determine trends. Mann Kendall trend test was the best approach as it was designed to detect
trends, and numerous studies over the years have proposed modified versions to
accommodate different limitations and types of data.
From the observed trends, about half of the topics of both communities have similar trends.
This is not enough evidence to confidently conclude that both communities share the same
trends. As such, this project concludes that those in the field of Computer Science and the
general audience, do not share the same trends. With this finding, the academia,
organisations, and other stakeholders should consider the field of Computer Science and the
general audience to be separate communities and may have different growth direction in the
observed research topics. |
author2 |
Ke Yiping, Kelly |
author_facet |
Ke Yiping, Kelly Leong, Kai Ling |
format |
Final Year Project |
author |
Leong, Kai Ling |
author_sort |
Leong, Kai Ling |
title |
Information extraction from bibliography data |
title_short |
Information extraction from bibliography data |
title_full |
Information extraction from bibliography data |
title_fullStr |
Information extraction from bibliography data |
title_full_unstemmed |
Information extraction from bibliography data |
title_sort |
information extraction from bibliography data |
publisher |
Nanyang Technological University |
publishDate |
2022 |
url |
https://hdl.handle.net/10356/156488 |
_version_ |
1731235802476707840 |