EXTRACTION OF STATISTICAL METADATA INFORMATION IN SCIENTIFIC RESEARCH ARTICLES USING MACHINE LEARNING ALGORITHMS

Statistical metadata is useful as a reference in planning, conducting, and evaluating a series of statistical activities. It is divided into basic, sectoral, and special statistical metadata. All three are differentiated based on the purpose and executor of its activities. Basic statistics are ca...

Full description

Saved in:
Bibliographic Details
Main Author: Winingsih, Dahlia
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/71415
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Statistical metadata is useful as a reference in planning, conducting, and evaluating a series of statistical activities. It is divided into basic, sectoral, and special statistical metadata. All three are differentiated based on the purpose and executor of its activities. Basic statistics are carried out by BPS, sectoral statistics by government agencies, and specific statistics are carried out by other administrators such as private institutions and individuals. The number of specific statistical metadata collected in statistical reference system is the lowest with 388 metadata when compared with 3.613 sectoral statistical metadata. One way to obtain information related to the implementation of specific statistical activities is to search for statistical research articles which serve as media for publicity for researchers and other research organizers. However, to obtain the required information in statistical metadata from a scientific research article requires a long series of processes. The process of searching for information in a document in the form of text can be done by extracting information. The problem that arises in applying information extraction techniques to find statistical metadata information consisting of titles, organizer identities, publications, years of activity, variables, data sources and periods, units of observation, and analytical methods used in a research article is the diversity of characteristics of each information that requires different treatment to obtain the appropriate information. This study proposes a feature-based statistical metadata extraction model design obtained by applying a machine learning algorithm. The algorithms used are random forest, naïve bayes, support vector machine, and decision tree. The features used include the characteristics of text writing, layout, content, and linguistic patterns contained in words/phrases related to appropriate statistical information. The results of the model performance measurement show that the model with the random forest and decision tree algorithms has the highest average f1-score value of 0,92 while the lowest average f1-score value of 0,88 is in the naïve Bayes model.