DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many studies focus on multi-speaker TTS that can clone human voices by capturing characteristics from the voice. In 2023, Wang et al. proposed a new approach for voice cloning TTS systems using a Transformer-based neural co...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/86054 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many
studies focus on multi-speaker TTS that can clone human voices by capturing
characteristics from the voice. In 2023, Wang et al. proposed a new approach for
voice cloning TTS systems using a Transformer-based neural codec language
model, Vall-E, which achieved state-of-the-art performance. For the Indonesian
language, there have not yet been any TTS studies using a language model
approach like Vall-E. There is also potential for improvement in the speech
synthesis produced by systems using the Vall-E model.
This thesis develops a TTS system using the Vall-E model and enhances the
system's synthesis output. The dataset, containing audio-transcript pairs, is taken
from previous research on Indonesian language speech processing. Data
processing and preparation are carried out by converting the audio into audio
codec tokens and the transcripts into phoneme tokens. Afterward, the neural
codec language model is trained following Wang et al. (2023) with the help of
open-source tools (Li, 2023). System components are then assembled to generate
Indonesian speech. As a form of enhancement, this thesis also adds a speech
enhancement component, implemented using the VoiceFixer tool (Liu et al., 2022).
The use of speech enhancement techniques with VoiceFixer increased the MOS
(mean opinion score) naturalness from 3.34 before enhancement to 3.95. This
demonstrates that applying speech enhancement can improve the naturalness of
speech synthesis. Overall, the TTS system produced a MOS naturalness score of
3.489 and a MOS similarity score of 3.521. The system achieved a WER (word
error rate) of 19.71% and speaker embedding vector similarity, which can be
visualized. This indicates that the TTS system using the Vall-E model can generate
Indonesian speech with a good resemblance to the speaker. The evaluation also
highlighted the importance of the number of speakers, data selection, processing
components, modeling, and speech duration during training in determining
synthesis quality. |
---|