HANDLING INDONESIAN-ENGLISH CODE-SWITCHING IN MULTILINGUAL TEXT-TO-SPEECH (TTS) USING STEN-TTS
A multilingual text-to-speech (TTS) is a system that generates speech from text in multiple languages. Sometimes, text sentences contain parts in different languages, known as code-switching. This phenomenon is common in Indonesia, particularly between Indonesian and English. However, no research...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/83144 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:83144 |
---|---|
spelling |
id-itb.:831442024-08-03T10:10:04ZHANDLING INDONESIAN-ENGLISH CODE-SWITCHING IN MULTILINGUAL TEXT-TO-SPEECH (TTS) USING STEN-TTS Alfani Handoyo, Ahmad Indonesia Final Project code-switching, multilingual text-to-speech, STEN-TTS, language identification, fine-tuned BERT. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/83144 A multilingual text-to-speech (TTS) is a system that generates speech from text in multiple languages. Sometimes, text sentences contain parts in different languages, known as code-switching. This phenomenon is common in Indonesia, particularly between Indonesian and English. However, no research has yet developed a multilingual TTS system that handles code-switching between these two languages. The Tacotron 2-based autoregressive TTS model has weaknesses such as slow inference and repeated or skipped word pronunciations. Meanwhile, non- autoregressive TTS models produce white noise when synthesizing cross-language speech with brief reference speaker voices. The STEN-TTS model with a Style- Enhanced Normalization (STEN) approach eliminates white noise and provides good results in five languages, including Indonesian and English, but has not yet been capable of code-switching. This research addresses Indonesian-English code-switching in STEN-TTS, which comprises text-to-phoneme conversion, Style Encoder, encoder, language embedding, variance adaptor, decoder, and STEN components. The main modification of STEN-TTS involves adding a language identification component to the text-to-phoneme conversion using fine-tuned BERT to identify the language per word, and removing the language embedding component. Experiments show that the code-switching model has better speech naturalness, with an increase in MOS value of 1.216 to 3.379 compared to the English baseline STEN-TTS, and an increase of 1.538 to 3.379 compared to the Indonesian baseline STEN-TTS. The code-switching model also has better speech intelligibility, with a reduction in WER error rate by 24.75% to 12.87% compared to the English baseline STEN-TTS, and a reduction of 19.01% to 12.87% compared to the Indonesian baseline STEN- TTS. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
A multilingual text-to-speech (TTS) is a system that generates speech from text in
multiple languages. Sometimes, text sentences contain parts in different languages,
known as code-switching. This phenomenon is common in Indonesia, particularly
between Indonesian and English. However, no research has yet developed a
multilingual TTS system that handles code-switching between these two languages.
The Tacotron 2-based autoregressive TTS model has weaknesses such as slow
inference and repeated or skipped word pronunciations. Meanwhile, non-
autoregressive TTS models produce white noise when synthesizing cross-language
speech with brief reference speaker voices. The STEN-TTS model with a Style-
Enhanced Normalization (STEN) approach eliminates white noise and provides
good results in five languages, including Indonesian and English, but has not yet
been capable of code-switching.
This research addresses Indonesian-English code-switching in STEN-TTS, which
comprises text-to-phoneme conversion, Style Encoder, encoder, language
embedding, variance adaptor, decoder, and STEN components. The main
modification of STEN-TTS involves adding a language identification component
to the text-to-phoneme conversion using fine-tuned BERT to identify the language
per word, and removing the language embedding component. Experiments show
that the code-switching model has better speech naturalness, with an increase in
MOS value of 1.216 to 3.379 compared to the English baseline STEN-TTS, and an
increase of 1.538 to 3.379 compared to the Indonesian baseline STEN-TTS. The
code-switching model also has better speech intelligibility, with a reduction in
WER error rate by 24.75% to 12.87% compared to the English baseline STEN-TTS,
and a reduction of 19.01% to 12.87% compared to the Indonesian baseline STEN-
TTS. |
format |
Final Project |
author |
Alfani Handoyo, Ahmad |
spellingShingle |
Alfani Handoyo, Ahmad HANDLING INDONESIAN-ENGLISH CODE-SWITCHING IN MULTILINGUAL TEXT-TO-SPEECH (TTS) USING STEN-TTS |
author_facet |
Alfani Handoyo, Ahmad |
author_sort |
Alfani Handoyo, Ahmad |
title |
HANDLING INDONESIAN-ENGLISH CODE-SWITCHING IN MULTILINGUAL TEXT-TO-SPEECH (TTS) USING STEN-TTS |
title_short |
HANDLING INDONESIAN-ENGLISH CODE-SWITCHING IN MULTILINGUAL TEXT-TO-SPEECH (TTS) USING STEN-TTS |
title_full |
HANDLING INDONESIAN-ENGLISH CODE-SWITCHING IN MULTILINGUAL TEXT-TO-SPEECH (TTS) USING STEN-TTS |
title_fullStr |
HANDLING INDONESIAN-ENGLISH CODE-SWITCHING IN MULTILINGUAL TEXT-TO-SPEECH (TTS) USING STEN-TTS |
title_full_unstemmed |
HANDLING INDONESIAN-ENGLISH CODE-SWITCHING IN MULTILINGUAL TEXT-TO-SPEECH (TTS) USING STEN-TTS |
title_sort |
handling indonesian-english code-switching in multilingual text-to-speech (tts) using sten-tts |
url |
https://digilib.itb.ac.id/gdl/view/83144 |
_version_ |
1822997980050882560 |