DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS

Voice cloning is the process of speech synthesis that efficiently uses data to produce various speaker voice characteristics. This method was developed to address the limitations of conventional text-to-speech (TTS) models, which typically generate only one type of voice characteristic. The focus...

Full description

Saved in:
Bibliographic Details
Main Author: Syauqy Irsyad, Azka
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/85056
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Voice cloning is the process of speech synthesis that efficiently uses data to produce various speaker voice characteristics. This method was developed to address the limitations of conventional text-to-speech (TTS) models, which typically generate only one type of voice characteristic. The focus of this development is to build a model capable of performing zero-shot learning. To achieve high-quality voice synthesis, it is crucial to consider factors such as the availability and quality of the dataset used, as well as the selection of an appropriate TTS model in the construction of the voice cloning model. This research focuses on developing a voice cloning model for the Indonesian language by utilizing the YourTTS model to achieve high levels of similarity and naturalness in the synthesized voice. YourTTS was chosen for its ability to perform fine-tuning and support multiple languages, which aligns well with the characteristics of the limited Indonesian dataset. In this study, two experiments were conducted to produce two voice synthesis models, differing in their spectrogram segment size settings, inference noise scale, length scale, and noise scale in the duration predictor. Both models were evaluated objectively using the speaker encoder cosine similarity (SECS) metric and subjectively using the mean opinion score (MOS) metric to assess the similarity and naturalness of the generated voices. The objective evaluation results of the two developed models showed an average SECS score of 0.8413 for seen speakers and 0.8603 for unseen speakers. The subjective evaluation results indicated a MOS similarity score of 3.62 for the first model and 3.49 for the second model, as well as a MOS naturalness score of 3.16 for the first model and 3.29 for the second model. Based on the subjective and objective evaluations, the developed models have demonstrated sufficiently high synthesis quality in terms of both voice similarity and naturalness.