HANDLING INDONESIAN-ENGLISH CODE-SWITCHING IN MULTILINGUAL TEXT-TO-SPEECH (TTS) USING STEN-TTS
A multilingual text-to-speech (TTS) is a system that generates speech from text in multiple languages. Sometimes, text sentences contain parts in different languages, known as code-switching. This phenomenon is common in Indonesia, particularly between Indonesian and English. However, no research...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/83144 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | A multilingual text-to-speech (TTS) is a system that generates speech from text in
multiple languages. Sometimes, text sentences contain parts in different languages,
known as code-switching. This phenomenon is common in Indonesia, particularly
between Indonesian and English. However, no research has yet developed a
multilingual TTS system that handles code-switching between these two languages.
The Tacotron 2-based autoregressive TTS model has weaknesses such as slow
inference and repeated or skipped word pronunciations. Meanwhile, non-
autoregressive TTS models produce white noise when synthesizing cross-language
speech with brief reference speaker voices. The STEN-TTS model with a Style-
Enhanced Normalization (STEN) approach eliminates white noise and provides
good results in five languages, including Indonesian and English, but has not yet
been capable of code-switching.
This research addresses Indonesian-English code-switching in STEN-TTS, which
comprises text-to-phoneme conversion, Style Encoder, encoder, language
embedding, variance adaptor, decoder, and STEN components. The main
modification of STEN-TTS involves adding a language identification component
to the text-to-phoneme conversion using fine-tuned BERT to identify the language
per word, and removing the language embedding component. Experiments show
that the code-switching model has better speech naturalness, with an increase in
MOS value of 1.216 to 3.379 compared to the English baseline STEN-TTS, and an
increase of 1.538 to 3.379 compared to the Indonesian baseline STEN-TTS. The
code-switching model also has better speech intelligibility, with a reduction in
WER error rate by 24.75% to 12.87% compared to the English baseline STEN-TTS,
and a reduction of 19.01% to 12.87% compared to the Indonesian baseline STEN-
TTS. |
---|