EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2
This research aims to construct an Expressive Text to Speech (TTS) system in the domain of Indonesia Language. Tacotron 2 is used in this study with Global Style Token (GST) as an additional feature and Parallel WaveGAN as a vocoder. Linguistic features are extracted from input text using Taco...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/56159 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:56159 |
---|---|
spelling |
id-itb.:561592021-06-21T13:54:25ZEXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2 Azhar Dhiaulhaq, Moch. Indonesia Final Project Text to Speech System, Tacotron 2, Parallel WaveGAN, Global Style Token, MOS INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/56159 This research aims to construct an Expressive Text to Speech (TTS) system in the domain of Indonesia Language. Tacotron 2 is used in this study with Global Style Token (GST) as an additional feature and Parallel WaveGAN as a vocoder. Linguistic features are extracted from input text using Tacotron 2. Those features are combined with GST which acts as emotion representation features extracted from reference audio. Combined features are processed by Decoder in Tacotron 2 model to produce Spectrogram which will then be processed by Parallel WaveGAN to finally produce expressive output audio. Both model GST + Tacotron 2 and model Parallel WaveGAN are trained using the same expressive corpus. The expressive corpus is constructed with 11.482 pairs of text and audio with 21 hours 57 minutes total duration. That expressive corpus contains angry, happy, sad, and neutral emotions. GST + Tacotron 2 model compared with baseline model, a Tacotron 2 architecture alone without Global Style Token and combined with Parallel WaveGAN as a vocoder. Both models are tested using Mean Opinion Score (MOS) and AB Testing. GST + Tacotron 2 model produce 3,90 ± 0,07 for MOS score. Higher than baseline model with 3,33 ± 0,10 MOS score. Respondent’s preference from AB Testing shows that most of the respondents chose GST + Tacotron2 Model (65,93%) than Baseline Model (34,07%). text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
This research aims to construct an Expressive Text to Speech (TTS) system in the domain of
Indonesia Language. Tacotron 2 is used in this study with Global Style Token (GST) as an
additional feature and Parallel WaveGAN as a vocoder. Linguistic features are extracted from
input text using Tacotron 2. Those features are combined with GST which acts as emotion
representation features extracted from reference audio. Combined features are processed by
Decoder in Tacotron 2 model to produce Spectrogram which will then be processed by Parallel
WaveGAN to finally produce expressive output audio. Both model GST + Tacotron 2 and
model Parallel WaveGAN are trained using the same expressive corpus. The expressive corpus
is constructed with 11.482 pairs of text and audio with 21 hours 57 minutes total duration. That
expressive corpus contains angry, happy, sad, and neutral emotions.
GST + Tacotron 2 model compared with baseline model, a Tacotron 2 architecture alone
without Global Style Token and combined with Parallel WaveGAN as a vocoder. Both models
are tested using Mean Opinion Score (MOS) and AB Testing. GST + Tacotron 2 model produce
3,90 ± 0,07 for MOS score. Higher than baseline model with 3,33 ± 0,10 MOS score.
Respondent’s preference from AB Testing shows that most of the respondents chose GST +
Tacotron2 Model (65,93%) than Baseline Model (34,07%).
|
format |
Final Project |
author |
Azhar Dhiaulhaq, Moch. |
spellingShingle |
Azhar Dhiaulhaq, Moch. EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2 |
author_facet |
Azhar Dhiaulhaq, Moch. |
author_sort |
Azhar Dhiaulhaq, Moch. |
title |
EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2 |
title_short |
EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2 |
title_full |
EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2 |
title_fullStr |
EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2 |
title_full_unstemmed |
EXPRESSIVE TEXT TO SPEECH SYSTEM TO READ INDONESIA NOVEL BASED ON DEEP NEURAL NETWORK USING GLOBAL STYLE TOKEN AND TACOTRON 2 |
title_sort |
expressive text to speech system to read indonesia novel based on deep neural network using global style token and tacotron 2 |
url |
https://digilib.itb.ac.id/gdl/view/56159 |
_version_ |
1822002279077642240 |