TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
Dysphonia is the second most common speech disorder in the United States that can affect anyone. Dysphonia causes difficulties in communicating and potentially decreasing their overall quality of life. Moreover, dysphonia speakers have difficulties producing sound and experience fatigue and throa...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/74896 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:74896 |
---|---|
spelling |
id-itb.:748962023-07-24T11:44:07ZTEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH Ulhaq Dewangga, Dhiya Indonesia Theses speech disorder, dysphonia, text-to-speech, speech synthesis, voice cloning, adversarial network. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/74896 Dysphonia is the second most common speech disorder in the United States that can affect anyone. Dysphonia causes difficulties in communicating and potentially decreasing their overall quality of life. Moreover, dysphonia speakers have difficulties producing sound and experience fatigue and throat pain when speaking that affects speech. The solution to improve dysphonia speech quality is through surgery or therapy, but this is expensive. Therefore, an alternative solution is needed to improve speech quality, one of which is the text-to-speech (TTS) system. This study develops a TTS system for dysphonia speakers to generate speech synthesis and help improve speech quality. The TTS system is built using an adversarial networks-based architecture called YourTTS, with a voice cloning approach to generate speech synthesis with high voice similarity using small data sample. To overcome the weakness of the YourTTS model, which is the lack of intelligibility on speech synthesis, this study proposed content text loss (CTL) as an additional loss value to help improve speech intelligibility. Evaluation was conducted subjectively and objectively to test the aspects of speaker voice similarity, speech naturalness, and speech intelligibility. The results for voice similarity are: 3.59 for mean opinion score (MOS), 0.883 for cosine similarity, and 2.910 for perceptual evaluation of speech quality (PESQ). The results for speech naturalness are: 3.37 for MOS, and 3.136 for NISQA-TTS. In assessing speech intelligibility using semantically unpredictable sentences (SUS) that achieve 76.32% for word accuracy and 63.12% for sentence accuracy, and 3.136 for NISQA-TTS. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
Dysphonia is the second most common speech disorder in the United States that
can affect anyone. Dysphonia causes difficulties in communicating and potentially
decreasing their overall quality of life. Moreover, dysphonia speakers have
difficulties producing sound and experience fatigue and throat pain when speaking
that affects speech. The solution to improve dysphonia speech quality is through
surgery or therapy, but this is expensive. Therefore, an alternative solution is
needed to improve speech quality, one of which is the text-to-speech (TTS) system.
This study develops a TTS system for dysphonia speakers to generate speech
synthesis and help improve speech quality. The TTS system is built using an
adversarial networks-based architecture called YourTTS, with a voice cloning
approach to generate speech synthesis with high voice similarity using small data
sample. To overcome the weakness of the YourTTS model, which is the lack of
intelligibility on speech synthesis, this study proposed content text loss (CTL) as an
additional loss value to help improve speech intelligibility.
Evaluation was conducted subjectively and objectively to test the aspects of speaker
voice similarity, speech naturalness, and speech intelligibility. The results for voice
similarity are: 3.59 for mean opinion score (MOS), 0.883 for cosine similarity, and
2.910 for perceptual evaluation of speech quality (PESQ). The results for speech
naturalness are: 3.37 for MOS, and 3.136 for NISQA-TTS. In assessing speech
intelligibility using semantically unpredictable sentences (SUS) that achieve
76.32% for word accuracy and 63.12% for sentence accuracy, and 3.136 for
NISQA-TTS. |
format |
Theses |
author |
Ulhaq Dewangga, Dhiya |
spellingShingle |
Ulhaq Dewangga, Dhiya TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH |
author_facet |
Ulhaq Dewangga, Dhiya |
author_sort |
Ulhaq Dewangga, Dhiya |
title |
TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH |
title_short |
TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH |
title_full |
TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH |
title_fullStr |
TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH |
title_full_unstemmed |
TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH |
title_sort |
text to speech system for dysphonia speech disorder using adversarial networks architecture with voice cloning approach |
url |
https://digilib.itb.ac.id/gdl/view/74896 |
_version_ |
1822007523134144512 |