POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS
Statistical-based translation with limited parallel corpus (low resources) produces many source language words that are not successfully translated into the target language. These words are known as an unknown word or out-ofvocabulary (OOV) word. The number of unknown words leads to poor translat...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/37154 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:37154 |
---|---|
spelling |
id-itb.:371542019-03-19T09:52:25ZPOSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS Ardiyanti Suryani, Arie Indonesia Dissertations morhpological analysis, stemming, unknown word, OOV, Bleu score INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/37154 Statistical-based translation with limited parallel corpus (low resources) produces many source language words that are not successfully translated into the target language. These words are known as an unknown word or out-ofvocabulary (OOV) word. The number of unknown words leads to poor translation quality. This problem also occurs in the translation of Sundanese into Indonesian language that is currently does not have any parallel corpus and text processing tools. The dissertation addresses improvement of text translation of Sundanese into Indonesian resulted in a statistical based machine translation by translating OOV words using morphological analysis. The translation of an affixed word OOV is done by adding the postprocessing stage on a statistical-based machine translation. This postprocessing stage consists of two processes, which are the identification of the Sundanese affixed patterns and the formation of affixed word in Indonesian. The first process is carried out using the Sundanese morphological analyzer or stemmer. This stemmer is a rule based stemmer, which was created based on the generation of a Sundanese affixed word. The first process produces one or more affix pattern and a stem word contained in an affixed word OOV. In the second process, these affix patterns was then mapped into the Indonesian affix patterns and then combined with the stem words to generate an Indonesian affixed word. The performance evaluation was done in two stages. The first is observation of translation rule coverage by using of 4338 unique Sundanese affixed words included in 106 affixed patterns. The second evaluation is identifying the improvement of morphological analysis in translating the OOV. The second evaluation test uses 2412 and 1204 pairs of training sentences respectively. The rule coverage of the first stage is done by measuring the accuracy of the translation produced by each pattern, while the second stage evaluation is done by measuring the proximity of the translation results to the reference file using the Bleu score. In addition, the number of OOV produced by the Baseline (without using morphological analysis techniques) and postprocessing schema is also calculated. The first evaluation shows that rule translates 53% of 105 of affix types used in the test data with the translation accuracy reaches 72%. The second evaluation gives an increase of 2.17 points bleu score (4.43%) on affixed word translation and an increase of 3.65 bleu score (7.45%) on OOV translation of affixed word, stem and simple reduplcation word. Some of the obstacles that are still being faced are the ambiguity problem that found in the stage of stemming, selection of stem meaning, and the formation of Indonesian affixed word. In addition, the use of the borrowed words and the Indonesian affixes pattern in Sundanese are also being a problem in this research. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Statistical-based translation with limited parallel corpus (low resources)
produces many source language words that are not successfully translated into
the target language. These words are known as an unknown word or out-ofvocabulary
(OOV) word. The number of unknown words leads to poor translation
quality. This problem also occurs in the translation of Sundanese into Indonesian
language that is currently does not have any parallel corpus and text processing
tools. The dissertation addresses improvement of text translation of Sundanese
into Indonesian resulted in a statistical based machine translation by translating
OOV words using morphological analysis.
The translation of an affixed word OOV is done by adding the postprocessing
stage on a statistical-based machine translation. This postprocessing stage
consists of two processes, which are the identification of the Sundanese affixed
patterns and the formation of affixed word in Indonesian. The first process is
carried out using the Sundanese morphological analyzer or stemmer. This
stemmer is a rule based stemmer, which was created based on the generation of a
Sundanese affixed word. The first process produces one or more affix pattern and
a stem word contained in an affixed word OOV. In the second process, these affix
patterns was then mapped into the Indonesian affix patterns and then combined
with the stem words to generate an Indonesian affixed word.
The performance evaluation was done in two stages. The first is observation of
translation rule coverage by using of 4338 unique Sundanese affixed words
included in 106 affixed patterns. The second evaluation is identifying the
improvement of morphological analysis in translating the OOV. The second
evaluation test uses 2412 and 1204 pairs of training sentences respectively. The
rule coverage of the first stage is done by measuring the accuracy of the
translation produced by each pattern, while the second stage evaluation is done
by measuring the proximity of the translation results to the reference file using the
Bleu score. In addition, the number of OOV produced by the Baseline (without
using morphological analysis techniques) and postprocessing schema is also
calculated.
The first evaluation shows that rule translates 53% of 105 of affix types used in
the test data with the translation accuracy reaches 72%. The second evaluation
gives an increase of 2.17 points bleu score (4.43%) on affixed word translation
and an increase of 3.65 bleu score (7.45%) on OOV translation of affixed word,
stem and simple reduplcation word.
Some of the obstacles that are still being faced are the ambiguity problem that
found in the stage of stemming, selection of stem meaning, and the formation of
Indonesian affixed word. In addition, the use of the borrowed words and the
Indonesian affixes pattern in Sundanese are also being a problem in this research. |
format |
Dissertations |
author |
Ardiyanti Suryani, Arie |
spellingShingle |
Ardiyanti Suryani, Arie POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS |
author_facet |
Ardiyanti Suryani, Arie |
author_sort |
Ardiyanti Suryani, Arie |
title |
POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS |
title_short |
POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS |
title_full |
POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS |
title_fullStr |
POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS |
title_full_unstemmed |
POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS |
title_sort |
postprocessing of sundanese into indonesian statistical machine translation results using morphological analysis |
url |
https://digilib.itb.ac.id/gdl/view/37154 |
_version_ |
1821997317003149312 |