DEVELOPMENT OF TEXT-TO-SPEECH SYSTEM FOR AN INDONESIAN SMART SPEAKER

Generally, smart speakers are operated using the English language, even though Indonesian people generally have poor English language skills. There are three components in a smart speaker, Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS). End-to-End (E2...

Full description

Saved in:
Bibliographic Details
Main Author: David Partogi, Ignatius
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/72116
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Generally, smart speakers are operated using the English language, even though Indonesian people generally have poor English language skills. There are three components in a smart speaker, Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS). End-to-End (E2E) TTS system is a TTS system that can immediately process a text and generate audio from it. E2E TTS has two parts, spectrogram generator and vocoder. The TTS system for this research was built using Tacotron 2 which is the state of the art in TTS world as the spectrogram generator and Parallel WaveGAN as the vocoder. The dataset used for this research consist of 3000 pairs of audio and their text transcription that was sourced from an audiobook of Indonesian language school and college books, with a total duration of 9 hours, 22 minutes, and 30 seconds. Mean Opinion Score (MOS) testing of the TTS system for this research resulted in a MOS score of 3,24 ± 0,29, while the Semantically Unpredictable Sentence (SUS) testing from the TTS system for this research resulted in an accuracy score of (91.82 ± 7.63)%.