Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents

© 2020 Elsevier Ltd The advancements of search engines for traditional text documents have enabled the effective retrieval of massive textual information in a resource-efficient manner. However, such conventional search methodologies often suffer from poor retrieval accuracy especially when document...

Full description

Saved in:
Bibliographic Details
Main Authors: Iqra Safder, Saeed Ul Hassan, Anna Visvizi, Thanapon Noraset, Raheel Nawaz, Suppawong Tuarob
Other Authors: Information Technology University
Format: Article
Published: 2020
Subjects:
Online Access:https://repository.li.mahidol.ac.th/handle/123456789/57817
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Mahidol University
id th-mahidol.57817
record_format dspace
spelling th-mahidol.578172020-08-25T18:51:43Z Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents Iqra Safder Saeed Ul Hassan Anna Visvizi Thanapon Noraset Raheel Nawaz Suppawong Tuarob Information Technology University American College of Greece Manchester Metropolitan University Mahidol University Computer Science Decision Sciences Engineering Social Sciences © 2020 Elsevier Ltd The advancements of search engines for traditional text documents have enabled the effective retrieval of massive textual information in a resource-efficient manner. However, such conventional search methodologies often suffer from poor retrieval accuracy especially when documents exhibit unique properties that behoove specialized and deeper semantic extraction. Recently, AlgorithmSeer, a search engine for algorithms has been proposed, that extracts pseudo-codes and shallow textual metadata from scientific publications and treats them as traditional documents so that the conventional search engine methodology could be applied. However, such a system fails to facilitate user search queries that seek to identify algorithm-specific information, such as the datasets on which algorithms operate, the performance of algorithms, and runtime complexity, etc. In this paper, a set of enhancements to the previously proposed algorithm search engine are presented. Specifically, we propose a set of methods to automatically identify and extract algorithmic pseudo-codes and the sentences that convey related algorithmic metadata using a set of machine-learning techniques. In an experiment with over 93,000 text lines, we introduce 60 novel features, comprising content-based, font style based and structure-based feature groups, to extract algorithmic pseudo-codes. Our proposed pseudo-code extraction method achieves 93.32% F1-score, outperforming the state-of-the-art techniques by 28%. Additionally, we propose a method to extract algorithmic-related sentences using deep neural networks and achieve an accuracy of 78.5%, outperforming a Rule-based model and a support vector machine model by 28% and 16%, respectively. 2020-08-25T09:35:14Z 2020-08-25T09:35:14Z 2020-11-01 Article Information Processing and Management. Vol.57, No.6 (2020) 10.1016/j.ipm.2020.102269 03064573 2-s2.0-85085523063 https://repository.li.mahidol.ac.th/handle/123456789/57817 Mahidol University SCOPUS https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85085523063&origin=inward
institution Mahidol University
building Mahidol University Library
continent Asia
country Thailand
Thailand
content_provider Mahidol University Library
collection Mahidol University Institutional Repository
topic Computer Science
Decision Sciences
Engineering
Social Sciences
spellingShingle Computer Science
Decision Sciences
Engineering
Social Sciences
Iqra Safder
Saeed Ul Hassan
Anna Visvizi
Thanapon Noraset
Raheel Nawaz
Suppawong Tuarob
Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents
description © 2020 Elsevier Ltd The advancements of search engines for traditional text documents have enabled the effective retrieval of massive textual information in a resource-efficient manner. However, such conventional search methodologies often suffer from poor retrieval accuracy especially when documents exhibit unique properties that behoove specialized and deeper semantic extraction. Recently, AlgorithmSeer, a search engine for algorithms has been proposed, that extracts pseudo-codes and shallow textual metadata from scientific publications and treats them as traditional documents so that the conventional search engine methodology could be applied. However, such a system fails to facilitate user search queries that seek to identify algorithm-specific information, such as the datasets on which algorithms operate, the performance of algorithms, and runtime complexity, etc. In this paper, a set of enhancements to the previously proposed algorithm search engine are presented. Specifically, we propose a set of methods to automatically identify and extract algorithmic pseudo-codes and the sentences that convey related algorithmic metadata using a set of machine-learning techniques. In an experiment with over 93,000 text lines, we introduce 60 novel features, comprising content-based, font style based and structure-based feature groups, to extract algorithmic pseudo-codes. Our proposed pseudo-code extraction method achieves 93.32% F1-score, outperforming the state-of-the-art techniques by 28%. Additionally, we propose a method to extract algorithmic-related sentences using deep neural networks and achieve an accuracy of 78.5%, outperforming a Rule-based model and a support vector machine model by 28% and 16%, respectively.
author2 Information Technology University
author_facet Information Technology University
Iqra Safder
Saeed Ul Hassan
Anna Visvizi
Thanapon Noraset
Raheel Nawaz
Suppawong Tuarob
format Article
author Iqra Safder
Saeed Ul Hassan
Anna Visvizi
Thanapon Noraset
Raheel Nawaz
Suppawong Tuarob
author_sort Iqra Safder
title Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents
title_short Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents
title_full Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents
title_fullStr Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents
title_full_unstemmed Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents
title_sort deep learning-based extraction of algorithmic metadata in full-text scholarly documents
publishDate 2020
url https://repository.li.mahidol.ac.th/handle/123456789/57817
_version_ 1763487506817351680