TEXT TO SPEECH SYSTEM FOR DYSPHONIA SPEECH DISORDER USING ADVERSARIAL NETWORKS ARCHITECTURE WITH VOICE CLONING APPROACH
Dysphonia is the second most common speech disorder in the United States that can affect anyone. Dysphonia causes difficulties in communicating and potentially decreasing their overall quality of life. Moreover, dysphonia speakers have difficulties producing sound and experience fatigue and throa...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/74896 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Dysphonia is the second most common speech disorder in the United States that
can affect anyone. Dysphonia causes difficulties in communicating and potentially
decreasing their overall quality of life. Moreover, dysphonia speakers have
difficulties producing sound and experience fatigue and throat pain when speaking
that affects speech. The solution to improve dysphonia speech quality is through
surgery or therapy, but this is expensive. Therefore, an alternative solution is
needed to improve speech quality, one of which is the text-to-speech (TTS) system.
This study develops a TTS system for dysphonia speakers to generate speech
synthesis and help improve speech quality. The TTS system is built using an
adversarial networks-based architecture called YourTTS, with a voice cloning
approach to generate speech synthesis with high voice similarity using small data
sample. To overcome the weakness of the YourTTS model, which is the lack of
intelligibility on speech synthesis, this study proposed content text loss (CTL) as an
additional loss value to help improve speech intelligibility.
Evaluation was conducted subjectively and objectively to test the aspects of speaker
voice similarity, speech naturalness, and speech intelligibility. The results for voice
similarity are: 3.59 for mean opinion score (MOS), 0.883 for cosine similarity, and
2.910 for perceptual evaluation of speech quality (PESQ). The results for speech
naturalness are: 3.37 for MOS, and 3.136 for NISQA-TTS. In assessing speech
intelligibility using semantically unpredictable sentences (SUS) that achieve
76.32% for word accuracy and 63.12% for sentence accuracy, and 3.136 for
NISQA-TTS. |
---|