DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL

In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many studies focus on multi-speaker TTS that can clone human voices by capturing characteristics from the voice. In 2023, Wang et al. proposed a new approach for voice cloning TTS systems using a Transformer-based neural co...

Full description

Saved in:

Bibliographic Details
Main Author:	Raditya Pratama Roosadi, Hizkia
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/86054
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:86054
spelling	id-itb.:860542024-09-13T08:27:09ZDEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL Raditya Pratama Roosadi, Hizkia Indonesia Theses TTS, voice cloning, Vall-E, transformer, neural codec language model, speech enhancement INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/86054 In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many studies focus on multi-speaker TTS that can clone human voices by capturing characteristics from the voice. In 2023, Wang et al. proposed a new approach for voice cloning TTS systems using a Transformer-based neural codec language model, Vall-E, which achieved state-of-the-art performance. For the Indonesian language, there have not yet been any TTS studies using a language model approach like Vall-E. There is also potential for improvement in the speech synthesis produced by systems using the Vall-E model. This thesis develops a TTS system using the Vall-E model and enhances the system's synthesis output. The dataset, containing audio-transcript pairs, is taken from previous research on Indonesian language speech processing. Data processing and preparation are carried out by converting the audio into audio codec tokens and the transcripts into phoneme tokens. Afterward, the neural codec language model is trained following Wang et al. (2023) with the help of open-source tools (Li, 2023). System components are then assembled to generate Indonesian speech. As a form of enhancement, this thesis also adds a speech enhancement component, implemented using the VoiceFixer tool (Liu et al., 2022). The use of speech enhancement techniques with VoiceFixer increased the MOS (mean opinion score) naturalness from 3.34 before enhancement to 3.95. This demonstrates that applying speech enhancement can improve the naturalness of speech synthesis. Overall, the TTS system produced a MOS naturalness score of 3.489 and a MOS similarity score of 3.521. The system achieved a WER (word error rate) of 19.71% and speaker embedding vector similarity, which can be visualized. This indicates that the TTS system using the Vall-E model can generate Indonesian speech with a good resemblance to the speaker. The evaluation also highlighted the importance of the number of speakers, data selection, processing components, modeling, and speech duration during training in determining synthesis quality. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many studies focus on multi-speaker TTS that can clone human voices by capturing characteristics from the voice. In 2023, Wang et al. proposed a new approach for voice cloning TTS systems using a Transformer-based neural codec language model, Vall-E, which achieved state-of-the-art performance. For the Indonesian language, there have not yet been any TTS studies using a language model approach like Vall-E. There is also potential for improvement in the speech synthesis produced by systems using the Vall-E model. This thesis develops a TTS system using the Vall-E model and enhances the system's synthesis output. The dataset, containing audio-transcript pairs, is taken from previous research on Indonesian language speech processing. Data processing and preparation are carried out by converting the audio into audio codec tokens and the transcripts into phoneme tokens. Afterward, the neural codec language model is trained following Wang et al. (2023) with the help of open-source tools (Li, 2023). System components are then assembled to generate Indonesian speech. As a form of enhancement, this thesis also adds a speech enhancement component, implemented using the VoiceFixer tool (Liu et al., 2022). The use of speech enhancement techniques with VoiceFixer increased the MOS (mean opinion score) naturalness from 3.34 before enhancement to 3.95. This demonstrates that applying speech enhancement can improve the naturalness of speech synthesis. Overall, the TTS system produced a MOS naturalness score of 3.489 and a MOS similarity score of 3.521. The system achieved a WER (word error rate) of 19.71% and speaker embedding vector similarity, which can be visualized. This indicates that the TTS system using the Vall-E model can generate Indonesian speech with a good resemblance to the speaker. The evaluation also highlighted the importance of the number of speakers, data selection, processing components, modeling, and speech duration during training in determining synthesis quality.
format	Theses
author	Raditya Pratama Roosadi, Hizkia
spellingShingle	Raditya Pratama Roosadi, Hizkia DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
author_facet	Raditya Pratama Roosadi, Hizkia
author_sort	Raditya Pratama Roosadi, Hizkia
title	DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
title_short	DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
title_full	DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
title_fullStr	DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
title_full_unstemmed	DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
title_sort	development of indonesian voice cloning text-to-speech system with vall-e based model
url	https://digilib.itb.ac.id/gdl/view/86054
_version_	1822999419626192896

DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL

Similar Items