TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH

Dysphonia is the second most common speech disorder in the United States that can affect anyone. Dysphonia causes difficulties in communicating and potentially decreasing their overall quality of life. Moreover, dysphonia speakers have difficulties producing sound and experience fatigue and throa...

Full description

Saved in:

Bibliographic Details
Main Author:	Ulhaq Dewangga, Dhiya
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/74896
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:74896
spelling	id-itb.:748962023-07-24T11:44:07ZTEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH Ulhaq Dewangga, Dhiya Indonesia Theses speech disorder, dysphonia, text-to-speech, speech synthesis, voice cloning, adversarial network. INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/74896 Dysphonia is the second most common speech disorder in the United States that can affect anyone. Dysphonia causes difficulties in communicating and potentially decreasing their overall quality of life. Moreover, dysphonia speakers have difficulties producing sound and experience fatigue and throat pain when speaking that affects speech. The solution to improve dysphonia speech quality is through surgery or therapy, but this is expensive. Therefore, an alternative solution is needed to improve speech quality, one of which is the text-to-speech (TTS) system. This study develops a TTS system for dysphonia speakers to generate speech synthesis and help improve speech quality. The TTS system is built using an adversarial networks-based architecture called YourTTS, with a voice cloning approach to generate speech synthesis with high voice similarity using small data sample. To overcome the weakness of the YourTTS model, which is the lack of intelligibility on speech synthesis, this study proposed content text loss (CTL) as an additional loss value to help improve speech intelligibility. Evaluation was conducted subjectively and objectively to test the aspects of speaker voice similarity, speech naturalness, and speech intelligibility. The results for voice similarity are: 3.59 for mean opinion score (MOS), 0.883 for cosine similarity, and 2.910 for perceptual evaluation of speech quality (PESQ). The results for speech naturalness are: 3.37 for MOS, and 3.136 for NISQA-TTS. In assessing speech intelligibility using semantically unpredictable sentences (SUS) that achieve 76.32% for word accuracy and 63.12% for sentence accuracy, and 3.136 for NISQA-TTS. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	Dysphonia is the second most common speech disorder in the United States that can affect anyone. Dysphonia causes difficulties in communicating and potentially decreasing their overall quality of life. Moreover, dysphonia speakers have difficulties producing sound and experience fatigue and throat pain when speaking that affects speech. The solution to improve dysphonia speech quality is through surgery or therapy, but this is expensive. Therefore, an alternative solution is needed to improve speech quality, one of which is the text-to-speech (TTS) system. This study develops a TTS system for dysphonia speakers to generate speech synthesis and help improve speech quality. The TTS system is built using an adversarial networks-based architecture called YourTTS, with a voice cloning approach to generate speech synthesis with high voice similarity using small data sample. To overcome the weakness of the YourTTS model, which is the lack of intelligibility on speech synthesis, this study proposed content text loss (CTL) as an additional loss value to help improve speech intelligibility. Evaluation was conducted subjectively and objectively to test the aspects of speaker voice similarity, speech naturalness, and speech intelligibility. The results for voice similarity are: 3.59 for mean opinion score (MOS), 0.883 for cosine similarity, and 2.910 for perceptual evaluation of speech quality (PESQ). The results for speech naturalness are: 3.37 for MOS, and 3.136 for NISQA-TTS. In assessing speech intelligibility using semantically unpredictable sentences (SUS) that achieve 76.32% for word accuracy and 63.12% for sentence accuracy, and 3.136 for NISQA-TTS.
format	Theses
author	Ulhaq Dewangga, Dhiya
spellingShingle	Ulhaq Dewangga, Dhiya TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
author_facet	Ulhaq Dewangga, Dhiya
author_sort	Ulhaq Dewangga, Dhiya
title	TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
title_short	TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
title_full	TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
title_fullStr	TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
title_full_unstemmed	TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
title_sort	text to speech system for dysphonia speech disorder using adversarial networks architecture with voice cloning approach
url	https://digilib.itb.ac.id/gdl/view/74896
_version_	1822007523134144512

TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH

Similar Items