TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH

Dysphonia is the second most common speech disorder in the United States that can affect anyone. Dysphonia causes difficulties in communicating and potentially decreasing their overall quality of life. Moreover, dysphonia speakers have difficulties producing sound and experience fatigue and throa...

Full description

Saved in:
Bibliographic Details
Main Author: Ulhaq Dewangga, Dhiya
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/74896
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:74896
spelling id-itb.:748962023-07-24T11:44:07ZTEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH Ulhaq Dewangga, Dhiya Indonesia Theses speech disorder, dysphonia, text-to-speech, speech synthesis, voice cloning, adversarial network. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/74896 Dysphonia is the second most common speech disorder in the United States that can affect anyone. Dysphonia causes difficulties in communicating and potentially decreasing their overall quality of life. Moreover, dysphonia speakers have difficulties producing sound and experience fatigue and throat pain when speaking that affects speech. The solution to improve dysphonia speech quality is through surgery or therapy, but this is expensive. Therefore, an alternative solution is needed to improve speech quality, one of which is the text-to-speech (TTS) system. This study develops a TTS system for dysphonia speakers to generate speech synthesis and help improve speech quality. The TTS system is built using an adversarial networks-based architecture called YourTTS, with a voice cloning approach to generate speech synthesis with high voice similarity using small data sample. To overcome the weakness of the YourTTS model, which is the lack of intelligibility on speech synthesis, this study proposed content text loss (CTL) as an additional loss value to help improve speech intelligibility. Evaluation was conducted subjectively and objectively to test the aspects of speaker voice similarity, speech naturalness, and speech intelligibility. The results for voice similarity are: 3.59 for mean opinion score (MOS), 0.883 for cosine similarity, and 2.910 for perceptual evaluation of speech quality (PESQ). The results for speech naturalness are: 3.37 for MOS, and 3.136 for NISQA-TTS. In assessing speech intelligibility using semantically unpredictable sentences (SUS) that achieve 76.32% for word accuracy and 63.12% for sentence accuracy, and 3.136 for NISQA-TTS. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description Dysphonia is the second most common speech disorder in the United States that can affect anyone. Dysphonia causes difficulties in communicating and potentially decreasing their overall quality of life. Moreover, dysphonia speakers have difficulties producing sound and experience fatigue and throat pain when speaking that affects speech. The solution to improve dysphonia speech quality is through surgery or therapy, but this is expensive. Therefore, an alternative solution is needed to improve speech quality, one of which is the text-to-speech (TTS) system. This study develops a TTS system for dysphonia speakers to generate speech synthesis and help improve speech quality. The TTS system is built using an adversarial networks-based architecture called YourTTS, with a voice cloning approach to generate speech synthesis with high voice similarity using small data sample. To overcome the weakness of the YourTTS model, which is the lack of intelligibility on speech synthesis, this study proposed content text loss (CTL) as an additional loss value to help improve speech intelligibility. Evaluation was conducted subjectively and objectively to test the aspects of speaker voice similarity, speech naturalness, and speech intelligibility. The results for voice similarity are: 3.59 for mean opinion score (MOS), 0.883 for cosine similarity, and 2.910 for perceptual evaluation of speech quality (PESQ). The results for speech naturalness are: 3.37 for MOS, and 3.136 for NISQA-TTS. In assessing speech intelligibility using semantically unpredictable sentences (SUS) that achieve 76.32% for word accuracy and 63.12% for sentence accuracy, and 3.136 for NISQA-TTS.
format Theses
author Ulhaq Dewangga, Dhiya
spellingShingle Ulhaq Dewangga, Dhiya
TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
author_facet Ulhaq Dewangga, Dhiya
author_sort Ulhaq Dewangga, Dhiya
title TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
title_short TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
title_full TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
title_fullStr TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
title_full_unstemmed TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
title_sort text to speech system for dysphonia speech disorder using adversarial networks architecture with voice cloning approach
url https://digilib.itb.ac.id/gdl/view/74896
_version_ 1822007523134144512