DEVELOPMENT OF AN INDONESIAN VOICE CLONING MODEL USING YOURTTS
Voice cloning is the process of speech synthesis that efficiently uses data to produce various speaker voice characteristics. This method was developed to address the limitations of conventional text-to-speech (TTS) models, which typically generate only one type of voice characteristic. The focus...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/85056 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Voice cloning is the process of speech synthesis that efficiently uses data to produce
various speaker voice characteristics. This method was developed to address the
limitations of conventional text-to-speech (TTS) models, which typically generate
only one type of voice characteristic. The focus of this development is to build a
model capable of performing zero-shot learning. To achieve high-quality voice
synthesis, it is crucial to consider factors such as the availability and quality of the
dataset used, as well as the selection of an appropriate TTS model in the
construction of the voice cloning model.
This research focuses on developing a voice cloning model for the Indonesian
language by utilizing the YourTTS model to achieve high levels of similarity and
naturalness in the synthesized voice. YourTTS was chosen for its ability to perform
fine-tuning and support multiple languages, which aligns well with the
characteristics of the limited Indonesian dataset. In this study, two experiments
were conducted to produce two voice synthesis models, differing in their
spectrogram segment size settings, inference noise scale, length scale, and noise
scale in the duration predictor. Both models were evaluated objectively using the
speaker encoder cosine similarity (SECS) metric and subjectively using the mean
opinion score (MOS) metric to assess the similarity and naturalness of the generated
voices.
The objective evaluation results of the two developed models showed an average
SECS score of 0.8413 for seen speakers and 0.8603 for unseen speakers. The
subjective evaluation results indicated a MOS similarity score of 3.62 for the first
model and 3.49 for the second model, as well as a MOS naturalness score of 3.16
for the first model and 3.29 for the second model. Based on the subjective and
objective evaluations, the developed models have demonstrated sufficiently high
synthesis quality in terms of both voice similarity and naturalness. |
---|