Speech recognition and synthesis
The recent advances in text-to-speech have been awe-inspiring, to the point of synthesizing near-human speeches. To achieve this, deep-neural networks are trained using different sound clips of a single speaker. However, traditional text-to-speech systems require a whole new dataset to produce the v...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/167681 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The recent advances in text-to-speech have been awe-inspiring, to the point of synthesizing near-human speeches. To achieve this, deep-neural networks are trained using different sound clips of a single speaker. However, traditional text-to-speech systems require a whole new dataset to produce the voice of a new speaker and retrain the model.
Using a recently developed three-stage system, trained models can clone speakers' voices unseen during training. With the use of an encoder, the critical features of speakers are encapsulated from a short clip. Researchers have previously developed models with such capabilities. This project intends to build on that with newer synthesiser implementations, and vocoders to reduce training time and improve naturalness.
This paper delves into two such methods and analyses different models that can be used in such a system. |
---|