SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION

Sequence to sequence learning tries to directly model a sequence of words from the source sentence into a sequence of words of target sentence. Most of sequence to sequence learning uses RNN model with an encoder – decoder framework. Neural Machine Translation (NMT) is one application of sequence...

Full description

Saved in:
Bibliographic Details
Main Author: Muhammad Shahih - NIM: 23516084 , Khaidzir
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/28459
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:28459
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Sequence to sequence learning tries to directly model a sequence of words from the source sentence into a sequence of words of target sentence. Most of sequence to sequence learning uses RNN model with an encoder – decoder framework. Neural Machine Translation (NMT) is one application of sequence to sequence learning. Recently, NMT was able to surpass the phrase-based machine translation (PBMT) performance which previously was the best technique of machine translation. Most of proposed NMTs used word representation as encoder input. <br /> <br /> <br /> <br /> <br /> Combining word and character representation has been done for sequence labeling case for POS tagging and NER tasks. This technique is shown can improve performance of POS tagging and NER tasks. The use of word and character representation has been done in the NMT case, but the technique doesn’t combine both of them at the same time, only for out of vocabulary (oov) case, which character-based representation is used for oov words. The use of both word and character representation has never been published before for NMT case, especially for translation from English to Indonesian. This research focus on development of NMT that combining word and character representation. The word-based NMT model is used as a baseline. The proposed NMT is built using encoder – decoder framework with bidirectional LSTM as an encoder and LSTM model for decoder with attention mechanism. The decoding process uses beam search with various beam size is tested. <br /> <br /> <br /> <br /> <br /> The process of generating word representation based on its characters uses the bidirectional LSTM and/or CNN models. The word representation itself is obtained from English GloVe pre-trained word embedding. Combining word and character representation is done by concatenating both of them. From here, 6 types of NMT models were built with different input vector representations: wordbased NMT, concatenation of word and character representation using bidirectional LSTM, concatenation of word and character representation using <br /> <br /> <br /> <br /> <br /> CNN, concatenation of word and character representation using both bidirectional LSTM and CNN by addition operation, concatenation of word and character representation using both bidirectional LSTM and CNN by averaging operation, and concatenation of word and character representation using both bidirectional LSTM and CNN by pointwise multiplication operation. The obtained result is, the NMT model with concatenation of word and character representation obtained BLEU score higher than baseline model, ranging from 9.14 points to 11.65 points, for all of models that combining both word and character representation, except for model that combining word and character representation using both bidirectional LSTM and CNN by addition operation. The highest BLEU score achieved was 42.48 compared to the 30.83 of the baseline model. <br /> <br /> <br /> <br /> <br /> Nevertheless, the resulting translation is still not good enough to translate sentence that contained named entities or numerical words. Hence, information of named entity is included by annotating training data with detected named entities and changing numeric words into a particular token. Two scenarios are tested: the use of binary named entity and n-ary (n=7) named entity. Binary named entity annotation scenario annotates every word whether named entity or not, while the 7-ary named entity annotation scenario annotates every words based on their named entity types (7 types of named entity). Then, for both scenarios, after named entities annotation, annotation of numerical words into a special token <num> is performed. Based on the experiments, the obtained results have not been able to improve BLEU score further compared to the models that are built without training data annotations. This is due to many errors in alignment from the translation results. However, the BLEU scores obtained are still better than the baseline model, ranging from 5.81 to 10.1 points higher.
format Theses
author Muhammad Shahih - NIM: 23516084 , Khaidzir
spellingShingle Muhammad Shahih - NIM: 23516084 , Khaidzir
SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION
author_facet Muhammad Shahih - NIM: 23516084 , Khaidzir
author_sort Muhammad Shahih - NIM: 23516084 , Khaidzir
title SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION
title_short SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION
title_full SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION
title_fullStr SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION
title_full_unstemmed SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION
title_sort sequence to sequence learning using word and character representation and named entity information for machine translation
url https://digilib.itb.ac.id/gdl/view/28459
_version_ 1821995078270320640
spelling id-itb.:284592018-10-01T10:09:31ZSEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION Muhammad Shahih - NIM: 23516084 , Khaidzir Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/28459 Sequence to sequence learning tries to directly model a sequence of words from the source sentence into a sequence of words of target sentence. Most of sequence to sequence learning uses RNN model with an encoder – decoder framework. Neural Machine Translation (NMT) is one application of sequence to sequence learning. Recently, NMT was able to surpass the phrase-based machine translation (PBMT) performance which previously was the best technique of machine translation. Most of proposed NMTs used word representation as encoder input. <br /> <br /> <br /> <br /> <br /> Combining word and character representation has been done for sequence labeling case for POS tagging and NER tasks. This technique is shown can improve performance of POS tagging and NER tasks. The use of word and character representation has been done in the NMT case, but the technique doesn’t combine both of them at the same time, only for out of vocabulary (oov) case, which character-based representation is used for oov words. The use of both word and character representation has never been published before for NMT case, especially for translation from English to Indonesian. This research focus on development of NMT that combining word and character representation. The word-based NMT model is used as a baseline. The proposed NMT is built using encoder – decoder framework with bidirectional LSTM as an encoder and LSTM model for decoder with attention mechanism. The decoding process uses beam search with various beam size is tested. <br /> <br /> <br /> <br /> <br /> The process of generating word representation based on its characters uses the bidirectional LSTM and/or CNN models. The word representation itself is obtained from English GloVe pre-trained word embedding. Combining word and character representation is done by concatenating both of them. From here, 6 types of NMT models were built with different input vector representations: wordbased NMT, concatenation of word and character representation using bidirectional LSTM, concatenation of word and character representation using <br /> <br /> <br /> <br /> <br /> CNN, concatenation of word and character representation using both bidirectional LSTM and CNN by addition operation, concatenation of word and character representation using both bidirectional LSTM and CNN by averaging operation, and concatenation of word and character representation using both bidirectional LSTM and CNN by pointwise multiplication operation. The obtained result is, the NMT model with concatenation of word and character representation obtained BLEU score higher than baseline model, ranging from 9.14 points to 11.65 points, for all of models that combining both word and character representation, except for model that combining word and character representation using both bidirectional LSTM and CNN by addition operation. The highest BLEU score achieved was 42.48 compared to the 30.83 of the baseline model. <br /> <br /> <br /> <br /> <br /> Nevertheless, the resulting translation is still not good enough to translate sentence that contained named entities or numerical words. Hence, information of named entity is included by annotating training data with detected named entities and changing numeric words into a particular token. Two scenarios are tested: the use of binary named entity and n-ary (n=7) named entity. Binary named entity annotation scenario annotates every word whether named entity or not, while the 7-ary named entity annotation scenario annotates every words based on their named entity types (7 types of named entity). Then, for both scenarios, after named entities annotation, annotation of numerical words into a special token <num> is performed. Based on the experiments, the obtained results have not been able to improve BLEU score further compared to the models that are built without training data annotations. This is due to many errors in alignment from the translation results. However, the BLEU scores obtained are still better than the baseline model, ranging from 5.81 to 10.1 points higher. text