DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS

Voice cloning is the process of speech synthesis that efficiently uses data to produce various speaker voice characteristics. This method was developed to address the limitations of conventional text-to-speech (TTS) models, which typically generate only one type of voice characteristic. The focus...

Full description

Saved in:
Bibliographic Details
Main Author: Syauqy Irsyad, Azka
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/85056
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:85056
spelling id-itb.:850562024-08-19T14:04:41ZDEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS Syauqy Irsyad, Azka Indonesia Final Project voice cloning, speech synthesis, Indonesian language, YourTTS, zero- shot learning INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/85056 Voice cloning is the process of speech synthesis that efficiently uses data to produce various speaker voice characteristics. This method was developed to address the limitations of conventional text-to-speech (TTS) models, which typically generate only one type of voice characteristic. The focus of this development is to build a model capable of performing zero-shot learning. To achieve high-quality voice synthesis, it is crucial to consider factors such as the availability and quality of the dataset used, as well as the selection of an appropriate TTS model in the construction of the voice cloning model. This research focuses on developing a voice cloning model for the Indonesian language by utilizing the YourTTS model to achieve high levels of similarity and naturalness in the synthesized voice. YourTTS was chosen for its ability to perform fine-tuning and support multiple languages, which aligns well with the characteristics of the limited Indonesian dataset. In this study, two experiments were conducted to produce two voice synthesis models, differing in their spectrogram segment size settings, inference noise scale, length scale, and noise scale in the duration predictor. Both models were evaluated objectively using the speaker encoder cosine similarity (SECS) metric and subjectively using the mean opinion score (MOS) metric to assess the similarity and naturalness of the generated voices. The objective evaluation results of the two developed models showed an average SECS score of 0.8413 for seen speakers and 0.8603 for unseen speakers. The subjective evaluation results indicated a MOS similarity score of 3.62 for the first model and 3.49 for the second model, as well as a MOS naturalness score of 3.16 for the first model and 3.29 for the second model. Based on the subjective and objective evaluations, the developed models have demonstrated sufficiently high synthesis quality in terms of both voice similarity and naturalness. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Voice cloning is the process of speech synthesis that efficiently uses data to produce various speaker voice characteristics. This method was developed to address the limitations of conventional text-to-speech (TTS) models, which typically generate only one type of voice characteristic. The focus of this development is to build a model capable of performing zero-shot learning. To achieve high-quality voice synthesis, it is crucial to consider factors such as the availability and quality of the dataset used, as well as the selection of an appropriate TTS model in the construction of the voice cloning model. This research focuses on developing a voice cloning model for the Indonesian language by utilizing the YourTTS model to achieve high levels of similarity and naturalness in the synthesized voice. YourTTS was chosen for its ability to perform fine-tuning and support multiple languages, which aligns well with the characteristics of the limited Indonesian dataset. In this study, two experiments were conducted to produce two voice synthesis models, differing in their spectrogram segment size settings, inference noise scale, length scale, and noise scale in the duration predictor. Both models were evaluated objectively using the speaker encoder cosine similarity (SECS) metric and subjectively using the mean opinion score (MOS) metric to assess the similarity and naturalness of the generated voices. The objective evaluation results of the two developed models showed an average SECS score of 0.8413 for seen speakers and 0.8603 for unseen speakers. The subjective evaluation results indicated a MOS similarity score of 3.62 for the first model and 3.49 for the second model, as well as a MOS naturalness score of 3.16 for the first model and 3.29 for the second model. Based on the subjective and objective evaluations, the developed models have demonstrated sufficiently high synthesis quality in terms of both voice similarity and naturalness.
format Final Project
author Syauqy Irsyad, Azka
spellingShingle Syauqy Irsyad, Azka
DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
author_facet Syauqy Irsyad, Azka
author_sort Syauqy Irsyad, Azka
title DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
title_short DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
title_full DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
title_fullStr DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
title_full_unstemmed DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
title_sort development of an indonesian voice cloning model using yourtts
url https://digilib.itb.ac.id/gdl/view/85056
_version_ 1822998904263671808