SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION
Sequence to sequence learning tries to directly model a sequence of words from the source sentence into a sequence of words of target sentence. Most of sequence to sequence learning uses RNN model with an encoder – decoder framework. Neural Machine Translation (NMT) is one application of sequence...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/28459 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:28459 |
---|---|
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Sequence to sequence learning tries to directly model a sequence of words from the source sentence into a sequence of words of target sentence. Most of sequence to sequence learning uses RNN model with an encoder – decoder framework. Neural Machine Translation (NMT) is one application of sequence to sequence learning. Recently, NMT was able to surpass the phrase-based machine translation (PBMT) performance which previously was the best technique of machine translation. Most of proposed NMTs used word representation as encoder input. <br />
<br />
<br />
<br />
<br />
Combining word and character representation has been done for sequence labeling case for POS tagging and NER tasks. This technique is shown can improve performance of POS tagging and NER tasks. The use of word and character representation has been done in the NMT case, but the technique doesn’t combine both of them at the same time, only for out of vocabulary (oov) case, which character-based representation is used for oov words. The use of both word and character representation has never been published before for NMT case, especially for translation from English to Indonesian. This research focus on development of NMT that combining word and character representation. The word-based NMT model is used as a baseline. The proposed NMT is built using encoder – decoder framework with bidirectional LSTM as an encoder and LSTM model for decoder with attention mechanism. The decoding process uses beam search with various beam size is tested. <br />
<br />
<br />
<br />
<br />
The process of generating word representation based on its characters uses the bidirectional LSTM and/or CNN models. The word representation itself is obtained from English GloVe pre-trained word embedding. Combining word and character representation is done by concatenating both of them. From here, 6 types of NMT models were built with different input vector representations: wordbased NMT, concatenation of word and character representation using bidirectional LSTM, concatenation of word and character representation using <br />
<br />
<br />
<br />
<br />
CNN, concatenation of word and character representation using both bidirectional LSTM and CNN by addition operation, concatenation of word and character representation using both bidirectional LSTM and CNN by averaging operation, and concatenation of word and character representation using both bidirectional LSTM and CNN by pointwise multiplication operation. The obtained result is, the NMT model with concatenation of word and character representation obtained BLEU score higher than baseline model, ranging from 9.14 points to 11.65 points, for all of models that combining both word and character representation, except for model that combining word and character representation using both bidirectional LSTM and CNN by addition operation. The highest BLEU score achieved was 42.48 compared to the 30.83 of the baseline model. <br />
<br />
<br />
<br />
<br />
Nevertheless, the resulting translation is still not good enough to translate sentence that contained named entities or numerical words. Hence, information of named entity is included by annotating training data with detected named entities and changing numeric words into a particular token. Two scenarios are tested: the use of binary named entity and n-ary (n=7) named entity. Binary named entity annotation scenario annotates every word whether named entity or not, while the 7-ary named entity annotation scenario annotates every words based on their named entity types (7 types of named entity). Then, for both scenarios, after named entities annotation, annotation of numerical words into a special token <num> is performed. Based on the experiments, the obtained results have not been able to improve BLEU score further compared to the models that are built without training data annotations. This is due to many errors in alignment from the translation results. However, the BLEU scores obtained are still better than the baseline model, ranging from 5.81 to 10.1 points higher. |
format |
Theses |
author |
Muhammad Shahih - NIM: 23516084 , Khaidzir |
spellingShingle |
Muhammad Shahih - NIM: 23516084 , Khaidzir SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION |
author_facet |
Muhammad Shahih - NIM: 23516084 , Khaidzir |
author_sort |
Muhammad Shahih - NIM: 23516084 , Khaidzir |
title |
SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION |
title_short |
SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION |
title_full |
SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION |
title_fullStr |
SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION |
title_full_unstemmed |
SEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION |
title_sort |
sequence to sequence learning using word and character representation and named entity information for machine translation |
url |
https://digilib.itb.ac.id/gdl/view/28459 |
_version_ |
1821995078270320640 |
spelling |
id-itb.:284592018-10-01T10:09:31ZSEQUENCE TO SEQUENCE LEARNING USING WORD AND CHARACTER REPRESENTATION AND NAMED ENTITY INFORMATION FOR MACHINE TRANSLATION Muhammad Shahih - NIM: 23516084 , Khaidzir Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/28459 Sequence to sequence learning tries to directly model a sequence of words from the source sentence into a sequence of words of target sentence. Most of sequence to sequence learning uses RNN model with an encoder – decoder framework. Neural Machine Translation (NMT) is one application of sequence to sequence learning. Recently, NMT was able to surpass the phrase-based machine translation (PBMT) performance which previously was the best technique of machine translation. Most of proposed NMTs used word representation as encoder input. <br /> <br /> <br /> <br /> <br /> Combining word and character representation has been done for sequence labeling case for POS tagging and NER tasks. This technique is shown can improve performance of POS tagging and NER tasks. The use of word and character representation has been done in the NMT case, but the technique doesn’t combine both of them at the same time, only for out of vocabulary (oov) case, which character-based representation is used for oov words. The use of both word and character representation has never been published before for NMT case, especially for translation from English to Indonesian. This research focus on development of NMT that combining word and character representation. The word-based NMT model is used as a baseline. The proposed NMT is built using encoder – decoder framework with bidirectional LSTM as an encoder and LSTM model for decoder with attention mechanism. The decoding process uses beam search with various beam size is tested. <br /> <br /> <br /> <br /> <br /> The process of generating word representation based on its characters uses the bidirectional LSTM and/or CNN models. The word representation itself is obtained from English GloVe pre-trained word embedding. Combining word and character representation is done by concatenating both of them. From here, 6 types of NMT models were built with different input vector representations: wordbased NMT, concatenation of word and character representation using bidirectional LSTM, concatenation of word and character representation using <br /> <br /> <br /> <br /> <br /> CNN, concatenation of word and character representation using both bidirectional LSTM and CNN by addition operation, concatenation of word and character representation using both bidirectional LSTM and CNN by averaging operation, and concatenation of word and character representation using both bidirectional LSTM and CNN by pointwise multiplication operation. The obtained result is, the NMT model with concatenation of word and character representation obtained BLEU score higher than baseline model, ranging from 9.14 points to 11.65 points, for all of models that combining both word and character representation, except for model that combining word and character representation using both bidirectional LSTM and CNN by addition operation. The highest BLEU score achieved was 42.48 compared to the 30.83 of the baseline model. <br /> <br /> <br /> <br /> <br /> Nevertheless, the resulting translation is still not good enough to translate sentence that contained named entities or numerical words. Hence, information of named entity is included by annotating training data with detected named entities and changing numeric words into a particular token. Two scenarios are tested: the use of binary named entity and n-ary (n=7) named entity. Binary named entity annotation scenario annotates every word whether named entity or not, while the 7-ary named entity annotation scenario annotates every words based on their named entity types (7 types of named entity). Then, for both scenarios, after named entities annotation, annotation of numerical words into a special token <num> is performed. Based on the experiments, the obtained results have not been able to improve BLEU score further compared to the models that are built without training data annotations. This is due to many errors in alignment from the translation results. However, the BLEU scores obtained are still better than the baseline model, ranging from 5.81 to 10.1 points higher. text |