PUNCTUATION INSERTION USING MACHINE TRANSLATION FOR INDONESIAN SENTENCES
Several texts, such as texts that resulted from automatic speech recognition, usually does not contain punctuation marks. Inserting punctuation marks in the text can improve the readability of the text. There are three approaches for punctuation insertion, namely language modeling, sequence label...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/49947 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Several texts, such as texts that resulted from automatic speech recognition, usually
does not contain punctuation marks. Inserting punctuation marks in the text can
improve the readability of the text. There are three approaches for punctuation
insertion, namely language modeling, sequence labeling, and machine translation.
This final project focuses on inserting punctuation using a machine translation
approach.
This final project predicts four punctuation marks: period, comma, exclamation
mark, and question mark. The dataset used in the final project was obtained from
the speech and interview transcriptions scraped from the website of the Ministry of
State Secretariat of the Republic of Indonesia and the Secretariat of the Cabinet,
with the total training data of 199,019 lines and test data of 22,114 lines. Because
it uses a machine translation approach, there is a possibility of a length difference
between the model's prediction results and the input. In order to handle the
difference in length, this final project cut and add padding to the predictiom results
to make it match the input length.
The best model is obtained from the Transformer model which uses pre-training
word embedding in the target language with 16 checkpoint averaging. This model
produces a weighted average F1-score of 0.5292, with an F1-score for periods,
commas, exclamation points, and question marks are 0.5166, 0.5432, 0.2617, and
0.3929, respectively.
|
---|