DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL

In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many studies focus on multi-speaker TTS that can clone human voices by capturing characteristics from the voice. In 2023, Wang et al. proposed a new approach for voice cloning TTS systems using a Transformer-based neural co...

Full description

Saved in:

Bibliographic Details
Main Author:	Raditya Pratama Roosadi, Hizkia
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/86054
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

Description
Summary:	In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many studies focus on multi-speaker TTS that can clone human voices by capturing characteristics from the voice. In 2023, Wang et al. proposed a new approach for voice cloning TTS systems using a Transformer-based neural codec language model, Vall-E, which achieved state-of-the-art performance. For the Indonesian language, there have not yet been any TTS studies using a language model approach like Vall-E. There is also potential for improvement in the speech synthesis produced by systems using the Vall-E model. This thesis develops a TTS system using the Vall-E model and enhances the system's synthesis output. The dataset, containing audio-transcript pairs, is taken from previous research on Indonesian language speech processing. Data processing and preparation are carried out by converting the audio into audio codec tokens and the transcripts into phoneme tokens. Afterward, the neural codec language model is trained following Wang et al. (2023) with the help of open-source tools (Li, 2023). System components are then assembled to generate Indonesian speech. As a form of enhancement, this thesis also adds a speech enhancement component, implemented using the VoiceFixer tool (Liu et al., 2022). The use of speech enhancement techniques with VoiceFixer increased the MOS (mean opinion score) naturalness from 3.34 before enhancement to 3.95. This demonstrates that applying speech enhancement can improve the naturalness of speech synthesis. Overall, the TTS system produced a MOS naturalness score of 3.489 and a MOS similarity score of 3.521. The system achieved a WER (word error rate) of 19.71% and speaker embedding vector similarity, which can be visualized. This indicates that the TTS system using the Vall-E model can generate Indonesian speech with a good resemblance to the speaker. The evaluation also highlighted the importance of the number of speakers, data selection, processing components, modeling, and speech duration during training in determining synthesis quality.

DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL

Similar Items