DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS

Voice cloning is the process of speech synthesis that efficiently uses data to produce various speaker voice characteristics. This method was developed to address the limitations of conventional text-to-speech (TTS) models, which typically generate only one type of voice characteristic. The focus...

Full description

Saved in:

Bibliographic Details
Main Author:	Syauqy Irsyad, Azka
Format:	Final Project
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/85056
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:85056
spelling	id-itb.:850562024-08-19T14:04:41ZDEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS Syauqy Irsyad, Azka Indonesia Final Project voice cloning, speech synthesis, Indonesian language, YourTTS, zero- shot learning INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/85056 Voice cloning is the process of speech synthesis that efficiently uses data to produce various speaker voice characteristics. This method was developed to address the limitations of conventional text-to-speech (TTS) models, which typically generate only one type of voice characteristic. The focus of this development is to build a model capable of performing zero-shot learning. To achieve high-quality voice synthesis, it is crucial to consider factors such as the availability and quality of the dataset used, as well as the selection of an appropriate TTS model in the construction of the voice cloning model. This research focuses on developing a voice cloning model for the Indonesian language by utilizing the YourTTS model to achieve high levels of similarity and naturalness in the synthesized voice. YourTTS was chosen for its ability to perform fine-tuning and support multiple languages, which aligns well with the characteristics of the limited Indonesian dataset. In this study, two experiments were conducted to produce two voice synthesis models, differing in their spectrogram segment size settings, inference noise scale, length scale, and noise scale in the duration predictor. Both models were evaluated objectively using the speaker encoder cosine similarity (SECS) metric and subjectively using the mean opinion score (MOS) metric to assess the similarity and naturalness of the generated voices. The objective evaluation results of the two developed models showed an average SECS score of 0.8413 for seen speakers and 0.8603 for unseen speakers. The subjective evaluation results indicated a MOS similarity score of 3.62 for the first model and 3.49 for the second model, as well as a MOS naturalness score of 3.16 for the first model and 3.29 for the second model. Based on the subjective and objective evaluations, the developed models have demonstrated sufficiently high synthesis quality in terms of both voice similarity and naturalness. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Voice cloning is the process of speech synthesis that efficiently uses data to produce various speaker voice characteristics. This method was developed to address the limitations of conventional text-to-speech (TTS) models, which typically generate only one type of voice characteristic. The focus of this development is to build a model capable of performing zero-shot learning. To achieve high-quality voice synthesis, it is crucial to consider factors such as the availability and quality of the dataset used, as well as the selection of an appropriate TTS model in the construction of the voice cloning model. This research focuses on developing a voice cloning model for the Indonesian language by utilizing the YourTTS model to achieve high levels of similarity and naturalness in the synthesized voice. YourTTS was chosen for its ability to perform fine-tuning and support multiple languages, which aligns well with the characteristics of the limited Indonesian dataset. In this study, two experiments were conducted to produce two voice synthesis models, differing in their spectrogram segment size settings, inference noise scale, length scale, and noise scale in the duration predictor. Both models were evaluated objectively using the speaker encoder cosine similarity (SECS) metric and subjectively using the mean opinion score (MOS) metric to assess the similarity and naturalness of the generated voices. The objective evaluation results of the two developed models showed an average SECS score of 0.8413 for seen speakers and 0.8603 for unseen speakers. The subjective evaluation results indicated a MOS similarity score of 3.62 for the first model and 3.49 for the second model, as well as a MOS naturalness score of 3.16 for the first model and 3.29 for the second model. Based on the subjective and objective evaluations, the developed models have demonstrated sufficiently high synthesis quality in terms of both voice similarity and naturalness.
format	Final Project
author	Syauqy Irsyad, Azka
spellingShingle	Syauqy Irsyad, Azka DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
author_facet	Syauqy Irsyad, Azka
author_sort	Syauqy Irsyad, Azka
title	DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
title_short	DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
title_full	DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
title_fullStr	DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
title_full_unstemmed	DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
title_sort	development of an indonesian voice cloning model using yourtts
url	https://digilib.itb.ac.id/gdl/view/85056
_version_	1822998904263671808

DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS

Similar Items