PUNCTUATION INSERTION USING MACHINE TRANSLATION FOR INDONESIAN SENTENCES

Several texts, such as texts that resulted from automatic speech recognition, usually does not contain punctuation marks. Inserting punctuation marks in the text can improve the readability of the text. There are three approaches for punctuation insertion, namely language modeling, sequence label...

Full description

Saved in:
Bibliographic Details
Main Author: Rifaldi Utomo, Rifqi
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/49947
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Several texts, such as texts that resulted from automatic speech recognition, usually does not contain punctuation marks. Inserting punctuation marks in the text can improve the readability of the text. There are three approaches for punctuation insertion, namely language modeling, sequence labeling, and machine translation. This final project focuses on inserting punctuation using a machine translation approach. This final project predicts four punctuation marks: period, comma, exclamation mark, and question mark. The dataset used in the final project was obtained from the speech and interview transcriptions scraped from the website of the Ministry of State Secretariat of the Republic of Indonesia and the Secretariat of the Cabinet, with the total training data of 199,019 lines and test data of 22,114 lines. Because it uses a machine translation approach, there is a possibility of a length difference between the model's prediction results and the input. In order to handle the difference in length, this final project cut and add padding to the predictiom results to make it match the input length. The best model is obtained from the Transformer model which uses pre-training word embedding in the target language with 16 checkpoint averaging. This model produces a weighted average F1-score of 0.5292, with an F1-score for periods, commas, exclamation points, and question marks are 0.5166, 0.5432, 0.2617, and 0.3929, respectively.