Speech recognition and synthesis

The recent advances in text-to-speech have been awe-inspiring, to the point of synthesizing near-human speeches. To achieve this, deep-neural networks are trained using different sound clips of a single speaker. However, traditional text-to-speech systems require a whole new dataset to produce the v...

Full description

Saved in:
Bibliographic Details
Main Author: Kang, Yi Da
Other Authors: Tan Yap Peng
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/167681
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The recent advances in text-to-speech have been awe-inspiring, to the point of synthesizing near-human speeches. To achieve this, deep-neural networks are trained using different sound clips of a single speaker. However, traditional text-to-speech systems require a whole new dataset to produce the voice of a new speaker and retrain the model. Using a recently developed three-stage system, trained models can clone speakers' voices unseen during training. With the use of an encoder, the critical features of speakers are encapsulated from a short clip. Researchers have previously developed models with such capabilities. This project intends to build on that with newer synthesiser implementations, and vocoders to reduce training time and improve naturalness. This paper delves into two such methods and analyses different models that can be used in such a system.