Information extraction from bibliography data

The research community for the computer science field has grown exponentially in recent years yet there is skills shortage in the Information Communication Technology (ICT) industry. The academic system acts as a “sorting hat” in determining which field of work individuals eventually land in. Studyi...

Full description

Saved in:
Bibliographic Details
Main Author: Ang, Yong Loong
Other Authors: Ke Yiping, Kelly
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/147819
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-147819
record_format dspace
spelling sg-ntu-dr.10356-1478192021-04-15T13:36:29Z Information extraction from bibliography data Ang, Yong Loong Ke Yiping, Kelly School of Computer Science and Engineering ypke@ntu.edu.sg Engineering::Computer science and engineering The research community for the computer science field has grown exponentially in recent years yet there is skills shortage in the Information Communication Technology (ICT) industry. The academic system acts as a “sorting hat” in determining which field of work individuals eventually land in. Studying academia trends in the Computer Science publications will aid in resource planning, and thus address the issue of the lack of talents in the ICT industry. This project aimed to account for the additional factors such as the employment market environment to provide a better understanding for the academia and policy makers to evaluate and fulfil the demands of the talent pool in the ICT industry. The project followed closely to the OSEMN framework. The DBLP Dataset was downloaded from the DBLP website as a XML file. A SAX parser was developed to parse the data from XML into MySQL database. The data was cleaned and then put into use in LDA topic modelling. Once the parameters were decided, the final model was built and topics were assigned to each set of keywords. Additional data from Occupational Employment Statistics was provided by U.S. Bureau of Labour Statistics. The occupation title was then assigned to each topic name from the topic modelling. Pearson’s correlation coefficient was used to compare the correlation between the data from topic modelling and these statistics. From the trends shown, it is generally observed that the estimated total employment had a positive correlation with the number of articles published for each topic, while the annual median wage had a negative correlation. With the support of the trend, academia should be convinced that they are able to influence the supply of the workforce. With this new finding, academia could work closely with companies and recruiters to understand the demands in the ICT industry and allocate resources to the field that requires more employment or research work. Possible future works included in this project are supervised learning for the topic modelling, more forms of data taken into consideration and extension of study to other fields. Bachelor of Engineering (Computer Science) 2021-04-15T13:36:29Z 2021-04-15T13:36:29Z 2021 Final Year Project (FYP) Ang, Y. L. (2021). Information extraction from bibliography data. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/147819 https://hdl.handle.net/10356/147819 en SCSE20-0451 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Ang, Yong Loong
Information extraction from bibliography data
description The research community for the computer science field has grown exponentially in recent years yet there is skills shortage in the Information Communication Technology (ICT) industry. The academic system acts as a “sorting hat” in determining which field of work individuals eventually land in. Studying academia trends in the Computer Science publications will aid in resource planning, and thus address the issue of the lack of talents in the ICT industry. This project aimed to account for the additional factors such as the employment market environment to provide a better understanding for the academia and policy makers to evaluate and fulfil the demands of the talent pool in the ICT industry. The project followed closely to the OSEMN framework. The DBLP Dataset was downloaded from the DBLP website as a XML file. A SAX parser was developed to parse the data from XML into MySQL database. The data was cleaned and then put into use in LDA topic modelling. Once the parameters were decided, the final model was built and topics were assigned to each set of keywords. Additional data from Occupational Employment Statistics was provided by U.S. Bureau of Labour Statistics. The occupation title was then assigned to each topic name from the topic modelling. Pearson’s correlation coefficient was used to compare the correlation between the data from topic modelling and these statistics. From the trends shown, it is generally observed that the estimated total employment had a positive correlation with the number of articles published for each topic, while the annual median wage had a negative correlation. With the support of the trend, academia should be convinced that they are able to influence the supply of the workforce. With this new finding, academia could work closely with companies and recruiters to understand the demands in the ICT industry and allocate resources to the field that requires more employment or research work. Possible future works included in this project are supervised learning for the topic modelling, more forms of data taken into consideration and extension of study to other fields.
author2 Ke Yiping, Kelly
author_facet Ke Yiping, Kelly
Ang, Yong Loong
format Final Year Project
author Ang, Yong Loong
author_sort Ang, Yong Loong
title Information extraction from bibliography data
title_short Information extraction from bibliography data
title_full Information extraction from bibliography data
title_fullStr Information extraction from bibliography data
title_full_unstemmed Information extraction from bibliography data
title_sort information extraction from bibliography data
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/147819
_version_ 1698713701318656000