DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL

In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many studies focus on multi-speaker TTS that can clone human voices by capturing characteristics from the voice. In 2023, Wang et al. proposed a new approach for voice cloning TTS systems using a Transformer-based neural co...

Full description

Saved in:
Bibliographic Details
Main Author: Raditya Pratama Roosadi, Hizkia
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/86054
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:86054
spelling id-itb.:860542024-09-13T08:27:09ZDEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL Raditya Pratama Roosadi, Hizkia Indonesia Theses TTS, voice cloning, Vall-E, transformer, neural codec language model, speech enhancement INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/86054 In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many studies focus on multi-speaker TTS that can clone human voices by capturing characteristics from the voice. In 2023, Wang et al. proposed a new approach for voice cloning TTS systems using a Transformer-based neural codec language model, Vall-E, which achieved state-of-the-art performance. For the Indonesian language, there have not yet been any TTS studies using a language model approach like Vall-E. There is also potential for improvement in the speech synthesis produced by systems using the Vall-E model. This thesis develops a TTS system using the Vall-E model and enhances the system's synthesis output. The dataset, containing audio-transcript pairs, is taken from previous research on Indonesian language speech processing. Data processing and preparation are carried out by converting the audio into audio codec tokens and the transcripts into phoneme tokens. Afterward, the neural codec language model is trained following Wang et al. (2023) with the help of open-source tools (Li, 2023). System components are then assembled to generate Indonesian speech. As a form of enhancement, this thesis also adds a speech enhancement component, implemented using the VoiceFixer tool (Liu et al., 2022). The use of speech enhancement techniques with VoiceFixer increased the MOS (mean opinion score) naturalness from 3.34 before enhancement to 3.95. This demonstrates that applying speech enhancement can improve the naturalness of speech synthesis. Overall, the TTS system produced a MOS naturalness score of 3.489 and a MOS similarity score of 3.521. The system achieved a WER (word error rate) of 19.71% and speaker embedding vector similarity, which can be visualized. This indicates that the TTS system using the Vall-E model can generate Indonesian speech with a good resemblance to the speaker. The evaluation also highlighted the importance of the number of speakers, data selection, processing components, modeling, and speech duration during training in determining synthesis quality. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description In recent years, Text-to-Speech (TTS) technology has continued to evolve. Many studies focus on multi-speaker TTS that can clone human voices by capturing characteristics from the voice. In 2023, Wang et al. proposed a new approach for voice cloning TTS systems using a Transformer-based neural codec language model, Vall-E, which achieved state-of-the-art performance. For the Indonesian language, there have not yet been any TTS studies using a language model approach like Vall-E. There is also potential for improvement in the speech synthesis produced by systems using the Vall-E model. This thesis develops a TTS system using the Vall-E model and enhances the system's synthesis output. The dataset, containing audio-transcript pairs, is taken from previous research on Indonesian language speech processing. Data processing and preparation are carried out by converting the audio into audio codec tokens and the transcripts into phoneme tokens. Afterward, the neural codec language model is trained following Wang et al. (2023) with the help of open-source tools (Li, 2023). System components are then assembled to generate Indonesian speech. As a form of enhancement, this thesis also adds a speech enhancement component, implemented using the VoiceFixer tool (Liu et al., 2022). The use of speech enhancement techniques with VoiceFixer increased the MOS (mean opinion score) naturalness from 3.34 before enhancement to 3.95. This demonstrates that applying speech enhancement can improve the naturalness of speech synthesis. Overall, the TTS system produced a MOS naturalness score of 3.489 and a MOS similarity score of 3.521. The system achieved a WER (word error rate) of 19.71% and speaker embedding vector similarity, which can be visualized. This indicates that the TTS system using the Vall-E model can generate Indonesian speech with a good resemblance to the speaker. The evaluation also highlighted the importance of the number of speakers, data selection, processing components, modeling, and speech duration during training in determining synthesis quality.
format Theses
author Raditya Pratama Roosadi, Hizkia
spellingShingle Raditya Pratama Roosadi, Hizkia
DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
author_facet Raditya Pratama Roosadi, Hizkia
author_sort Raditya Pratama Roosadi, Hizkia
title DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
title_short DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
title_full DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
title_fullStr DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
title_full_unstemmed DEVELOPMENT OF INDONESIAN VOICE CLONING TEXT-TO-SPEECH SYSTEM WITH VALL-E BASED MODEL
title_sort development of indonesian voice cloning text-to-speech system with vall-e based model
url https://digilib.itb.ac.id/gdl/view/86054
_version_ 1822999419626192896