PUNCTUATION PREDICTION IN AUTOMATIC SPEECH RECOGNITION SYSTEM RESULTS USING SEQUENCE LABELLING AND MACHINE TRANSLATION-BASED APPROACHES

Automatic Speech Recognition (ASR) systems provide output in the form of speech recognition text. This text is generally not punctuated (Ostendorf et al., 2008). The formatting of speech recognition results is important for both humans and machines, because it can eliminate the ambiguity of meani...

Full description

Saved in:
Bibliographic Details
Main Author: Irfaan Dzakiy, M.
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/72145
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Automatic Speech Recognition (ASR) systems provide output in the form of speech recognition text. This text is generally not punctuated (Ostendorf et al., 2008). The formatting of speech recognition results is important for both humans and machines, because it can eliminate the ambiguity of meaning in sentences, and can be used in various NLP tasks. This research intends to add periods, commas, and question marks to the speech recognition system results. Punctuation prediction can be done using Language Modeling, Sequence Labelling, and Machine Translation approaches. The best F1 score from previous research was obtained from the Sequence Labelling and Machine Translation approaches. The sequence labelling approach uses the Conditional Random Fields model with various word range and n_gram configurations (Lu and Ng, 2010). The machine translation approach uses the Neural Machine Translation model with RNN, Bi-RNN, CNN, and Transformer encoder algorithms, as well as RNN, CNN, and Transformer decoder algorithms (Vandeghinste et al., 2018). The Indo4B corpus and text data from YouTube automatic captions were used in this study. This research also tested the best sampling technique in overcoming the imbalance in the number of punctuation marks in the dataset. Experiments were conducted by changing the sampling method and the architecture configuration used to obtain the best configuration. Based on the experiments conducted, the best sampling method is the Random Undersampling method, which produces a dataset with a balanced distribution of punctuation marks. The best model obtained is the CRF model with a configuration of word range 6 and n_gram 3. The best F-measure for the model is: 78.69% for periods; 40.30% for commas; and 81.54% for question marks. In addition, various variations of f1 score for ASR recognition were simulated. The best F-measure is obtained from simulating ASR with 100% f1 score with the best CRF model, namely: 66.59% for periods; 20.75% for commas; and 40.36% for question marks.