TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH

Dysphonia is the second most common speech disorder in the United States that can affect anyone. Dysphonia causes difficulties in communicating and potentially decreasing their overall quality of life. Moreover, dysphonia speakers have difficulties producing sound and experience fatigue and throa...

Full description

Saved in:
Bibliographic Details
Main Author: Ulhaq Dewangga, Dhiya
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/74896
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Dysphonia is the second most common speech disorder in the United States that can affect anyone. Dysphonia causes difficulties in communicating and potentially decreasing their overall quality of life. Moreover, dysphonia speakers have difficulties producing sound and experience fatigue and throat pain when speaking that affects speech. The solution to improve dysphonia speech quality is through surgery or therapy, but this is expensive. Therefore, an alternative solution is needed to improve speech quality, one of which is the text-to-speech (TTS) system. This study develops a TTS system for dysphonia speakers to generate speech synthesis and help improve speech quality. The TTS system is built using an adversarial networks-based architecture called YourTTS, with a voice cloning approach to generate speech synthesis with high voice similarity using small data sample. To overcome the weakness of the YourTTS model, which is the lack of intelligibility on speech synthesis, this study proposed content text loss (CTL) as an additional loss value to help improve speech intelligibility. Evaluation was conducted subjectively and objectively to test the aspects of speaker voice similarity, speech naturalness, and speech intelligibility. The results for voice similarity are: 3.59 for mean opinion score (MOS), 0.883 for cosine similarity, and 2.910 for perceptual evaluation of speech quality (PESQ). The results for speech naturalness are: 3.37 for MOS, and 3.136 for NISQA-TTS. In assessing speech intelligibility using semantically unpredictable sentences (SUS) that achieve 76.32% for word accuracy and 63.12% for sentence accuracy, and 3.136 for NISQA-TTS.