POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS

Statistical-based translation with limited parallel corpus (low resources) produces many source language words that are not successfully translated into the target language. These words are known as an unknown word or out-ofvocabulary (OOV) word. The number of unknown words leads to poor translat...

Full description

Saved in:
Bibliographic Details
Main Author: Ardiyanti Suryani, Arie
Format: Dissertations
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/37154
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:37154
spelling id-itb.:371542019-03-19T09:52:25ZPOSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS Ardiyanti Suryani, Arie Indonesia Dissertations morhpological analysis, stemming, unknown word, OOV, Bleu score INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/37154 Statistical-based translation with limited parallel corpus (low resources) produces many source language words that are not successfully translated into the target language. These words are known as an unknown word or out-ofvocabulary (OOV) word. The number of unknown words leads to poor translation quality. This problem also occurs in the translation of Sundanese into Indonesian language that is currently does not have any parallel corpus and text processing tools. The dissertation addresses improvement of text translation of Sundanese into Indonesian resulted in a statistical based machine translation by translating OOV words using morphological analysis. The translation of an affixed word OOV is done by adding the postprocessing stage on a statistical-based machine translation. This postprocessing stage consists of two processes, which are the identification of the Sundanese affixed patterns and the formation of affixed word in Indonesian. The first process is carried out using the Sundanese morphological analyzer or stemmer. This stemmer is a rule based stemmer, which was created based on the generation of a Sundanese affixed word. The first process produces one or more affix pattern and a stem word contained in an affixed word OOV. In the second process, these affix patterns was then mapped into the Indonesian affix patterns and then combined with the stem words to generate an Indonesian affixed word. The performance evaluation was done in two stages. The first is observation of translation rule coverage by using of 4338 unique Sundanese affixed words included in 106 affixed patterns. The second evaluation is identifying the improvement of morphological analysis in translating the OOV. The second evaluation test uses 2412 and 1204 pairs of training sentences respectively. The rule coverage of the first stage is done by measuring the accuracy of the translation produced by each pattern, while the second stage evaluation is done by measuring the proximity of the translation results to the reference file using the Bleu score. In addition, the number of OOV produced by the Baseline (without using morphological analysis techniques) and postprocessing schema is also calculated. The first evaluation shows that rule translates 53% of 105 of affix types used in the test data with the translation accuracy reaches 72%. The second evaluation gives an increase of 2.17 points bleu score (4.43%) on affixed word translation and an increase of 3.65 bleu score (7.45%) on OOV translation of affixed word, stem and simple reduplcation word. Some of the obstacles that are still being faced are the ambiguity problem that found in the stage of stemming, selection of stem meaning, and the formation of Indonesian affixed word. In addition, the use of the borrowed words and the Indonesian affixes pattern in Sundanese are also being a problem in this research. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Statistical-based translation with limited parallel corpus (low resources) produces many source language words that are not successfully translated into the target language. These words are known as an unknown word or out-ofvocabulary (OOV) word. The number of unknown words leads to poor translation quality. This problem also occurs in the translation of Sundanese into Indonesian language that is currently does not have any parallel corpus and text processing tools. The dissertation addresses improvement of text translation of Sundanese into Indonesian resulted in a statistical based machine translation by translating OOV words using morphological analysis. The translation of an affixed word OOV is done by adding the postprocessing stage on a statistical-based machine translation. This postprocessing stage consists of two processes, which are the identification of the Sundanese affixed patterns and the formation of affixed word in Indonesian. The first process is carried out using the Sundanese morphological analyzer or stemmer. This stemmer is a rule based stemmer, which was created based on the generation of a Sundanese affixed word. The first process produces one or more affix pattern and a stem word contained in an affixed word OOV. In the second process, these affix patterns was then mapped into the Indonesian affix patterns and then combined with the stem words to generate an Indonesian affixed word. The performance evaluation was done in two stages. The first is observation of translation rule coverage by using of 4338 unique Sundanese affixed words included in 106 affixed patterns. The second evaluation is identifying the improvement of morphological analysis in translating the OOV. The second evaluation test uses 2412 and 1204 pairs of training sentences respectively. The rule coverage of the first stage is done by measuring the accuracy of the translation produced by each pattern, while the second stage evaluation is done by measuring the proximity of the translation results to the reference file using the Bleu score. In addition, the number of OOV produced by the Baseline (without using morphological analysis techniques) and postprocessing schema is also calculated. The first evaluation shows that rule translates 53% of 105 of affix types used in the test data with the translation accuracy reaches 72%. The second evaluation gives an increase of 2.17 points bleu score (4.43%) on affixed word translation and an increase of 3.65 bleu score (7.45%) on OOV translation of affixed word, stem and simple reduplcation word. Some of the obstacles that are still being faced are the ambiguity problem that found in the stage of stemming, selection of stem meaning, and the formation of Indonesian affixed word. In addition, the use of the borrowed words and the Indonesian affixes pattern in Sundanese are also being a problem in this research.
format Dissertations
author Ardiyanti Suryani, Arie
spellingShingle Ardiyanti Suryani, Arie
POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS
author_facet Ardiyanti Suryani, Arie
author_sort Ardiyanti Suryani, Arie
title POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS
title_short POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS
title_full POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS
title_fullStr POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS
title_full_unstemmed POSTPROCESSING OF SUNDANESE INTO INDONESIAN STATISTICAL MACHINE TRANSLATION RESULTS USING MORPHOLOGICAL ANALYSIS
title_sort postprocessing of sundanese into indonesian statistical machine translation results using morphological analysis
url https://digilib.itb.ac.id/gdl/view/37154
_version_ 1821997317003149312