TOKEN CLASSIFICATION ON INDONESIAN ARTICLE FOR 5W1H EVENT EXTRACTION WITH CNN-BIDIRECTIONAL LSTM

A news article contains information about events that comprise what events occur (what), the participants involved (who), the place (where) and time (when) of the event, as well as the event description of why and how events can occur, also known as 5W1H information. <br /> <br /> <...

Full description

Saved in:
Bibliographic Details
Main Author: NURDIN (NIM : 23515035), ARLIYANTI
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/21251
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:21251
spelling id-itb.:212512017-09-27T15:37:11ZTOKEN CLASSIFICATION ON INDONESIAN ARTICLE FOR 5W1H EVENT EXTRACTION WITH CNN-BIDIRECTIONAL LSTM NURDIN (NIM : 23515035), ARLIYANTI Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/21251 A news article contains information about events that comprise what events occur (what), the participants involved (who), the place (where) and time (when) of the event, as well as the event description of why and how events can occur, also known as 5W1H information. <br /> <br /> <br /> By 5W1H information in the texts or documents, it is easy to understand their overall information. In order to find out the 5W1H information in the text, it can be automatically done by applying information extraction technique. <br /> <br /> <br /> The information extraction of 5W1H in an Indonesian article can be done by classifying each token in the article into 13 classes, namely B-who, I-who, B-what, I-what, B-when, I-when, B-where, I-where, B-why, I-why, B-how, I-how, and Other. Information-context token in both lexical and sentence level are used in order to determine token label. Furthermore, Convolutional Neural Network (CNN) is used to extract syntactic features and semantics in the sentences while Bidirectional Long Short Term Memory (BLSTM) is used to learn sequential modeling of lexical token level. The result of study is that the average of the performance of F-measure model is 0.808 with feature set which consist of token features and relative position among tokens in the sentences (SENT), feature of lexical sequences (LEX), and token location (LOCT and LOCS). The experimental result shows that deep learning method CNN-BLSTM outperforms other shallow method namely IBk, C4.5, and Naïve Bayes. The best performance was obtained by CNN-BLSTM with F-measure 0.808, while IBk, C4.5, and Naïve Bayes were obtained F-measure 0.655, 0.645, and 0.595, respectively. <br /> text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description A news article contains information about events that comprise what events occur (what), the participants involved (who), the place (where) and time (when) of the event, as well as the event description of why and how events can occur, also known as 5W1H information. <br /> <br /> <br /> By 5W1H information in the texts or documents, it is easy to understand their overall information. In order to find out the 5W1H information in the text, it can be automatically done by applying information extraction technique. <br /> <br /> <br /> The information extraction of 5W1H in an Indonesian article can be done by classifying each token in the article into 13 classes, namely B-who, I-who, B-what, I-what, B-when, I-when, B-where, I-where, B-why, I-why, B-how, I-how, and Other. Information-context token in both lexical and sentence level are used in order to determine token label. Furthermore, Convolutional Neural Network (CNN) is used to extract syntactic features and semantics in the sentences while Bidirectional Long Short Term Memory (BLSTM) is used to learn sequential modeling of lexical token level. The result of study is that the average of the performance of F-measure model is 0.808 with feature set which consist of token features and relative position among tokens in the sentences (SENT), feature of lexical sequences (LEX), and token location (LOCT and LOCS). The experimental result shows that deep learning method CNN-BLSTM outperforms other shallow method namely IBk, C4.5, and Naïve Bayes. The best performance was obtained by CNN-BLSTM with F-measure 0.808, while IBk, C4.5, and Naïve Bayes were obtained F-measure 0.655, 0.645, and 0.595, respectively. <br />
format Theses
author NURDIN (NIM : 23515035), ARLIYANTI
spellingShingle NURDIN (NIM : 23515035), ARLIYANTI
TOKEN CLASSIFICATION ON INDONESIAN ARTICLE FOR 5W1H EVENT EXTRACTION WITH CNN-BIDIRECTIONAL LSTM
author_facet NURDIN (NIM : 23515035), ARLIYANTI
author_sort NURDIN (NIM : 23515035), ARLIYANTI
title TOKEN CLASSIFICATION ON INDONESIAN ARTICLE FOR 5W1H EVENT EXTRACTION WITH CNN-BIDIRECTIONAL LSTM
title_short TOKEN CLASSIFICATION ON INDONESIAN ARTICLE FOR 5W1H EVENT EXTRACTION WITH CNN-BIDIRECTIONAL LSTM
title_full TOKEN CLASSIFICATION ON INDONESIAN ARTICLE FOR 5W1H EVENT EXTRACTION WITH CNN-BIDIRECTIONAL LSTM
title_fullStr TOKEN CLASSIFICATION ON INDONESIAN ARTICLE FOR 5W1H EVENT EXTRACTION WITH CNN-BIDIRECTIONAL LSTM
title_full_unstemmed TOKEN CLASSIFICATION ON INDONESIAN ARTICLE FOR 5W1H EVENT EXTRACTION WITH CNN-BIDIRECTIONAL LSTM
title_sort token classification on indonesian article for 5w1h event extraction with cnn-bidirectional lstm
url https://digilib.itb.ac.id/gdl/view/21251
_version_ 1822920111739109376