DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many studies focus on multi-speaker TTS that can clone human voices by capturing characteristics from the voice. In 2023, Wang et al. proposed a new approach for voice cloning TTS systems using a Transformer-based neural co...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/86054 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:86054 |
---|---|
spelling |
id-itb.:860542024-09-13T08:27:09ZDEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL Raditya Pratama Roosadi, Hizkia Indonesia Theses TTS, voice cloning, Vall-E, transformer, neural codec language model, speech enhancement INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/86054 In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many studies focus on multi-speaker TTS that can clone human voices by capturing characteristics from the voice. In 2023, Wang et al. proposed a new approach for voice cloning TTS systems using a Transformer-based neural codec language model, Vall-E, which achieved state-of-the-art performance. For the Indonesian language, there have not yet been any TTS studies using a language model approach like Vall-E. There is also potential for improvement in the speech synthesis produced by systems using the Vall-E model. This thesis develops a TTS system using the Vall-E model and enhances the system's synthesis output. The dataset, containing audio-transcript pairs, is taken from previous research on Indonesian language speech processing. Data processing and preparation are carried out by converting the audio into audio codec tokens and the transcripts into phoneme tokens. Afterward, the neural codec language model is trained following Wang et al. (2023) with the help of open-source tools (Li, 2023). System components are then assembled to generate Indonesian speech. As a form of enhancement, this thesis also adds a speech enhancement component, implemented using the VoiceFixer tool (Liu et al., 2022). The use of speech enhancement techniques with VoiceFixer increased the MOS (mean opinion score) naturalness from 3.34 before enhancement to 3.95. This demonstrates that applying speech enhancement can improve the naturalness of speech synthesis. Overall, the TTS system produced a MOS naturalness score of 3.489 and a MOS similarity score of 3.521. The system achieved a WER (word error rate) of 19.71% and speaker embedding vector similarity, which can be visualized. This indicates that the TTS system using the Vall-E model can generate Indonesian speech with a good resemblance to the speaker. The evaluation also highlighted the importance of the number of speakers, data selection, processing components, modeling, and speech duration during training in determining synthesis quality. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many
studies focus on multi-speaker TTS that can clone human voices by capturing
characteristics from the voice. In 2023, Wang et al. proposed a new approach for
voice cloning TTS systems using a Transformer-based neural codec language
model, Vall-E, which achieved state-of-the-art performance. For the Indonesian
language, there have not yet been any TTS studies using a language model
approach like Vall-E. There is also potential for improvement in the speech
synthesis produced by systems using the Vall-E model.
This thesis develops a TTS system using the Vall-E model and enhances the
system's synthesis output. The dataset, containing audio-transcript pairs, is taken
from previous research on Indonesian language speech processing. Data
processing and preparation are carried out by converting the audio into audio
codec tokens and the transcripts into phoneme tokens. Afterward, the neural
codec language model is trained following Wang et al. (2023) with the help of
open-source tools (Li, 2023). System components are then assembled to generate
Indonesian speech. As a form of enhancement, this thesis also adds a speech
enhancement component, implemented using the VoiceFixer tool (Liu et al., 2022).
The use of speech enhancement techniques with VoiceFixer increased the MOS
(mean opinion score) naturalness from 3.34 before enhancement to 3.95. This
demonstrates that applying speech enhancement can improve the naturalness of
speech synthesis. Overall, the TTS system produced a MOS naturalness score of
3.489 and a MOS similarity score of 3.521. The system achieved a WER (word
error rate) of 19.71% and speaker embedding vector similarity, which can be
visualized. This indicates that the TTS system using the Vall-E model can generate
Indonesian speech with a good resemblance to the speaker. The evaluation also
highlighted the importance of the number of speakers, data selection, processing
components, modeling, and speech duration during training in determining
synthesis quality. |
format |
Theses |
author |
Raditya Pratama Roosadi, Hizkia |
spellingShingle |
Raditya Pratama Roosadi, Hizkia DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL |
author_facet |
Raditya Pratama Roosadi, Hizkia |
author_sort |
Raditya Pratama Roosadi, Hizkia |
title |
DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL |
title_short |
DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL |
title_full |
DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL |
title_fullStr |
DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL |
title_full_unstemmed |
DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL |
title_sort |
development of indonesian voice cloning text-to-speech system with vall-e based model |
url |
https://digilib.itb.ac.id/gdl/view/86054 |
_version_ |
1822999419626192896 |