HANDLING INDONESIAN-ENGLISH CODE-SWITCHING IN MULTILINGUAL TEXT-TO-SPEECH (TTS) USING STEN-TTS

A multilingual text-to-speech (TTS) is a system that generates speech from text in multiple languages. Sometimes, text sentences contain parts in different languages, known as code-switching. This phenomenon is common in Indonesia, particularly between Indonesian and English. However, no research...

Full description

Saved in:
Bibliographic Details
Main Author: Alfani Handoyo, Ahmad
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/83144
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:A multilingual text-to-speech (TTS) is a system that generates speech from text in multiple languages. Sometimes, text sentences contain parts in different languages, known as code-switching. This phenomenon is common in Indonesia, particularly between Indonesian and English. However, no research has yet developed a multilingual TTS system that handles code-switching between these two languages. The Tacotron 2-based autoregressive TTS model has weaknesses such as slow inference and repeated or skipped word pronunciations. Meanwhile, non- autoregressive TTS models produce white noise when synthesizing cross-language speech with brief reference speaker voices. The STEN-TTS model with a Style- Enhanced Normalization (STEN) approach eliminates white noise and provides good results in five languages, including Indonesian and English, but has not yet been capable of code-switching. This research addresses Indonesian-English code-switching in STEN-TTS, which comprises text-to-phoneme conversion, Style Encoder, encoder, language embedding, variance adaptor, decoder, and STEN components. The main modification of STEN-TTS involves adding a language identification component to the text-to-phoneme conversion using fine-tuned BERT to identify the language per word, and removing the language embedding component. Experiments show that the code-switching model has better speech naturalness, with an increase in MOS value of 1.216 to 3.379 compared to the English baseline STEN-TTS, and an increase of 1.538 to 3.379 compared to the Indonesian baseline STEN-TTS. The code-switching model also has better speech intelligibility, with a reduction in WER error rate by 24.75% to 12.87% compared to the English baseline STEN-TTS, and a reduction of 19.01% to 12.87% compared to the Indonesian baseline STEN- TTS.