EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2

This research aims to construct an Expressive Text to Speech (TTS) system in the domain of Indonesia Language. Tacotron 2 is used in this study with Global Style Token (GST) as an additional feature and Parallel WaveGAN as a vocoder. Linguistic features are extracted from input text using Taco...

Full description

Saved in:
Bibliographic Details
Main Author: Azhar Dhiaulhaq, Moch.
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/56159
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:56159
spelling id-itb.:561592021-06-21T13:54:25ZEXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2 Azhar Dhiaulhaq, Moch. Indonesia Final Project Text to Speech System, Tacotron 2, Parallel WaveGAN, Global Style Token, MOS INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/56159 This research aims to construct an Expressive Text to Speech (TTS) system in the domain of Indonesia Language. Tacotron 2 is used in this study with Global Style Token (GST) as an additional feature and Parallel WaveGAN as a vocoder. Linguistic features are extracted from input text using Tacotron 2. Those features are combined with GST which acts as emotion representation features extracted from reference audio. Combined features are processed by Decoder in Tacotron 2 model to produce Spectrogram which will then be processed by Parallel WaveGAN to finally produce expressive output audio. Both model GST + Tacotron 2 and model Parallel WaveGAN are trained using the same expressive corpus. The expressive corpus is constructed with 11.482 pairs of text and audio with 21 hours 57 minutes total duration. That expressive corpus contains angry, happy, sad, and neutral emotions. GST + Tacotron 2 model compared with baseline model, a Tacotron 2 architecture alone without Global Style Token and combined with Parallel WaveGAN as a vocoder. Both models are tested using Mean Opinion Score (MOS) and AB Testing. GST + Tacotron 2 model produce 3,90 ± 0,07 for MOS score. Higher than baseline model with 3,33 ± 0,10 MOS score. Respondent’s preference from AB Testing shows that most of the respondents chose GST + Tacotron2 Model (65,93%) than Baseline Model (34,07%). text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description This research aims to construct an Expressive Text to Speech (TTS) system in the domain of Indonesia Language. Tacotron 2 is used in this study with Global Style Token (GST) as an additional feature and Parallel WaveGAN as a vocoder. Linguistic features are extracted from input text using Tacotron 2. Those features are combined with GST which acts as emotion representation features extracted from reference audio. Combined features are processed by Decoder in Tacotron 2 model to produce Spectrogram which will then be processed by Parallel WaveGAN to finally produce expressive output audio. Both model GST + Tacotron 2 and model Parallel WaveGAN are trained using the same expressive corpus. The expressive corpus is constructed with 11.482 pairs of text and audio with 21 hours 57 minutes total duration. That expressive corpus contains angry, happy, sad, and neutral emotions. GST + Tacotron 2 model compared with baseline model, a Tacotron 2 architecture alone without Global Style Token and combined with Parallel WaveGAN as a vocoder. Both models are tested using Mean Opinion Score (MOS) and AB Testing. GST + Tacotron 2 model produce 3,90 ± 0,07 for MOS score. Higher than baseline model with 3,33 ± 0,10 MOS score. Respondent’s preference from AB Testing shows that most of the respondents chose GST + Tacotron2 Model (65,93%) than Baseline Model (34,07%).
format Final Project
author Azhar Dhiaulhaq, Moch.
spellingShingle Azhar Dhiaulhaq, Moch.
EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2
author_facet Azhar Dhiaulhaq, Moch.
author_sort Azhar Dhiaulhaq, Moch.
title EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2
title_short EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2
title_full EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2
title_fullStr EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2
title_full_unstemmed EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2
title_sort expressive text to speech system to read indonesia novel based on deep neural network using global style token and tacotron 2
url https://digilib.itb.ac.id/gdl/view/56159
_version_ 1822002279077642240