PUNCTUATION PREDICTION IN AUTOMATIC SPEECH RECOGNITION SYSTEM RESULTS USING SEQUENCE LABELLING AND MACHINE TRANSLATION-BASED APPROACHES
Automatic Speech Recognition (ASR) systems provide output in the form of speech recognition text. This text is generally not punctuated (Ostendorf et al., 2008). The formatting of speech recognition results is important for both humans and machines, because it can eliminate the ambiguity of meani...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/72145 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Automatic Speech Recognition (ASR) systems provide output in the form of speech
recognition text. This text is generally not punctuated (Ostendorf et al., 2008). The
formatting of speech recognition results is important for both humans and machines,
because it can eliminate the ambiguity of meaning in sentences, and can be used in
various NLP tasks. This research intends to add periods, commas, and question
marks to the speech recognition system results.
Punctuation prediction can be done using Language Modeling, Sequence Labelling,
and Machine Translation approaches. The best F1 score from previous research was
obtained from the Sequence Labelling and Machine Translation approaches. The
sequence labelling approach uses the Conditional Random Fields model with various
word range and n_gram configurations (Lu and Ng, 2010). The machine translation
approach uses the Neural Machine Translation model with RNN, Bi-RNN, CNN, and
Transformer encoder algorithms, as well as RNN, CNN, and Transformer decoder
algorithms (Vandeghinste et al., 2018). The Indo4B corpus and text data from
YouTube automatic captions were used in this study. This research also tested the
best sampling technique in overcoming the imbalance in the number of punctuation
marks in the dataset.
Experiments were conducted by changing the sampling method and the architecture
configuration used to obtain the best configuration. Based on the experiments
conducted, the best sampling method is the Random Undersampling method, which
produces a dataset with a balanced distribution of punctuation marks. The best model
obtained is the CRF model with a configuration of word range 6 and n_gram 3. The
best F-measure for the model is: 78.69% for periods; 40.30% for commas; and
81.54% for question marks. In addition, various variations of f1 score for ASR
recognition were simulated. The best F-measure is obtained from simulating ASR
with 100% f1 score with the best CRF model, namely: 66.59% for periods; 20.75%
for commas; and 40.36% for question marks. |
---|