DISAMBIGUATION OF STRUCTURAL AMBIGUITY IN INDONESIAN SPEECH BY UTILIZING PROSODIC INFORMATION BASED ON TRANSFORMER FRAMEWORKS FOR SPEECH-TO-TEXT TRANSLATION
Ambiguity, particularly structural ambiguity, is one of the challenges in natural language that is still overlooked by most Indonesian speech recognition systems. No speech recognition system has utilized prosodic information to address structural ambiguity. Therefore, this study develops the first...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/75260 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Ambiguity, particularly structural ambiguity, is one of the challenges in natural language that is still overlooked by most Indonesian speech recognition systems. No speech recognition system has utilized prosodic information to address structural ambiguity. Therefore, this study develops the first system capable of disambiguating structurally ambiguous utterances into unambiguous interpretation texts in Indonesian by using prosodic speech information from the utterances.
The contributions of this study include the construction of a structurally ambiguous speech corpus and an Indonesian speech disambiguation system. The corpus creation process involves generating structurally ambiguous sentences along with their two interpretations and recording speech. Two prosodic cues used for the disambiguation were pause and pitch, with the features used to store pauses being mel-spectrogram and energy and F0 for pitch. The disambiguation systems were built by adapting both cascade and direct approaches to speech-to-text mapping, specifically the task of speech-to-text translation systems, using the Transformer framework. The cascade approach comprises an ASR system and a new model called the Text Disambiguation (TD) model, while the direct approach consists of a new model called the Speech Disambiguation (SD) model.
The construction of the corpus results in 400 structurally ambiguous sentences and 4800 structurally ambiguous utterances in Indonesian. The research findings demonstrate that the constructed disambiguation systems can produce fairly accurate interpretation texts. The best-performing system in this study is the direct approach with mel-spectrogram concatenated with F0 and energy as audio input, which achieved an average disambiguation accuracy of 82.2%. The best cascade system, which adds meaning tags and uses the same input combination, delivers slightly worse performance with an average disambiguation accuracy of 79.6%. |
---|