Information extraction from bibliography data

The research community for the computer science field has grown exponentially in recent years yet there is skills shortage in the Information Communication Technology (ICT) industry. The academic system acts as a “sorting hat” in determining which field of work individuals eventually land in. Studyi...

Full description

Saved in:
Bibliographic Details
Main Author: Ang, Yong Loong
Other Authors: Ke Yiping, Kelly
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/147819
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The research community for the computer science field has grown exponentially in recent years yet there is skills shortage in the Information Communication Technology (ICT) industry. The academic system acts as a “sorting hat” in determining which field of work individuals eventually land in. Studying academia trends in the Computer Science publications will aid in resource planning, and thus address the issue of the lack of talents in the ICT industry. This project aimed to account for the additional factors such as the employment market environment to provide a better understanding for the academia and policy makers to evaluate and fulfil the demands of the talent pool in the ICT industry. The project followed closely to the OSEMN framework. The DBLP Dataset was downloaded from the DBLP website as a XML file. A SAX parser was developed to parse the data from XML into MySQL database. The data was cleaned and then put into use in LDA topic modelling. Once the parameters were decided, the final model was built and topics were assigned to each set of keywords. Additional data from Occupational Employment Statistics was provided by U.S. Bureau of Labour Statistics. The occupation title was then assigned to each topic name from the topic modelling. Pearson’s correlation coefficient was used to compare the correlation between the data from topic modelling and these statistics. From the trends shown, it is generally observed that the estimated total employment had a positive correlation with the number of articles published for each topic, while the annual median wage had a negative correlation. With the support of the trend, academia should be convinced that they are able to influence the supply of the workforce. With this new finding, academia could work closely with companies and recruiters to understand the demands in the ICT industry and allocate resources to the field that requires more employment or research work. Possible future works included in this project are supervised learning for the topic modelling, more forms of data taken into consideration and extension of study to other fields.