PUNCTUATION INSERTION USING MACHINE TRANSLATION FOR INDONESIAN SENTENCES

Several texts, such as texts that resulted from automatic speech recognition, usually does not contain punctuation marks. Inserting punctuation marks in the text can improve the readability of the text. There are three approaches for punctuation insertion, namely language modeling, sequence label...

Full description

Saved in:
Bibliographic Details
Main Author: Rifaldi Utomo, Rifqi
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/49947
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:49947
spelling id-itb.:499472020-09-21T16:17:20ZPUNCTUATION INSERTION USING MACHINE TRANSLATION FOR INDONESIAN SENTENCES Rifaldi Utomo, Rifqi Indonesia Final Project punctuation insertion, Transformer, word embedding, checkpoint averaging INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/49947 Several texts, such as texts that resulted from automatic speech recognition, usually does not contain punctuation marks. Inserting punctuation marks in the text can improve the readability of the text. There are three approaches for punctuation insertion, namely language modeling, sequence labeling, and machine translation. This final project focuses on inserting punctuation using a machine translation approach. This final project predicts four punctuation marks: period, comma, exclamation mark, and question mark. The dataset used in the final project was obtained from the speech and interview transcriptions scraped from the website of the Ministry of State Secretariat of the Republic of Indonesia and the Secretariat of the Cabinet, with the total training data of 199,019 lines and test data of 22,114 lines. Because it uses a machine translation approach, there is a possibility of a length difference between the model's prediction results and the input. In order to handle the difference in length, this final project cut and add padding to the predictiom results to make it match the input length. The best model is obtained from the Transformer model which uses pre-training word embedding in the target language with 16 checkpoint averaging. This model produces a weighted average F1-score of 0.5292, with an F1-score for periods, commas, exclamation points, and question marks are 0.5166, 0.5432, 0.2617, and 0.3929, respectively. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Several texts, such as texts that resulted from automatic speech recognition, usually does not contain punctuation marks. Inserting punctuation marks in the text can improve the readability of the text. There are three approaches for punctuation insertion, namely language modeling, sequence labeling, and machine translation. This final project focuses on inserting punctuation using a machine translation approach. This final project predicts four punctuation marks: period, comma, exclamation mark, and question mark. The dataset used in the final project was obtained from the speech and interview transcriptions scraped from the website of the Ministry of State Secretariat of the Republic of Indonesia and the Secretariat of the Cabinet, with the total training data of 199,019 lines and test data of 22,114 lines. Because it uses a machine translation approach, there is a possibility of a length difference between the model's prediction results and the input. In order to handle the difference in length, this final project cut and add padding to the predictiom results to make it match the input length. The best model is obtained from the Transformer model which uses pre-training word embedding in the target language with 16 checkpoint averaging. This model produces a weighted average F1-score of 0.5292, with an F1-score for periods, commas, exclamation points, and question marks are 0.5166, 0.5432, 0.2617, and 0.3929, respectively.
format Final Project
author Rifaldi Utomo, Rifqi
spellingShingle Rifaldi Utomo, Rifqi
PUNCTUATION INSERTION USING MACHINE TRANSLATION FOR INDONESIAN SENTENCES
author_facet Rifaldi Utomo, Rifqi
author_sort Rifaldi Utomo, Rifqi
title PUNCTUATION INSERTION USING MACHINE TRANSLATION FOR INDONESIAN SENTENCES
title_short PUNCTUATION INSERTION USING MACHINE TRANSLATION FOR INDONESIAN SENTENCES
title_full PUNCTUATION INSERTION USING MACHINE TRANSLATION FOR INDONESIAN SENTENCES
title_fullStr PUNCTUATION INSERTION USING MACHINE TRANSLATION FOR INDONESIAN SENTENCES
title_full_unstemmed PUNCTUATION INSERTION USING MACHINE TRANSLATION FOR INDONESIAN SENTENCES
title_sort punctuation insertion using machine translation for indonesian sentences
url https://digilib.itb.ac.id/gdl/view/49947
_version_ 1822928319715213312