Everybody's talkin': let me talk as you want

We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random c...

Full description

Saved in:

Bibliographic Details
Main Authors:	Song, Linsen, Wu, Wayne, Qian, Chen, He, Ran, Loy, Chen Change
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2022
Subjects:	Engineering::Computer science and engineering Talking Face Generation Video Generation
Online Access:	https://hdl.handle.net/10356/162986
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-162986
record_format	dspace
spelling	sg-ntu-dr.10356-1629862022-11-14T07:56:42Z Everybody's talkin': let me talk as you want Song, Linsen Wu, Wayne Qian, Chen He, Ran Loy, Chen Change School of Computer Science and Engineering S-Laboratory, NTU Engineering::Computer science and engineering Talking Face Generation Video Generation We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio. This work was supported in part by the Beijing Natural Science Foundation under Grant JQ18017, in part by the National Natural Science Foundation of China under Grant U20A20223 and Grant 61721004, in part by the Youth Innovation Promotion Association Chinese Academy of Sciences (CAS) under Grant Y201929, in part by the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, and in part by the Cash and In-Kind Contribution from the Industry Partner(s). 2022-11-14T07:56:42Z 2022-11-14T07:56:42Z 2022 Journal Article Song, L., Wu, W., Qian, C., He, R. & Loy, C. C. (2022). Everybody's talkin': let me talk as you want. IEEE Transactions On Information Forensics and Security, 17, 585-598. https://dx.doi.org/10.1109/TIFS.2022.3146783 1556-6013 https://hdl.handle.net/10356/162986 10.1109/TIFS.2022.3146783 2-s2.0-85123783224 17 585 598 en IEEE Transactions on Information Forensics and Security © 2022 IEEE. All rights reserved.
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Talking Face Generation Video Generation
spellingShingle	Engineering::Computer science and engineering Talking Face Generation Video Generation Song, Linsen Wu, Wayne Qian, Chen He, Ran Loy, Chen Change Everybody's talkin': let me talk as you want
description	We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Song, Linsen Wu, Wayne Qian, Chen He, Ran Loy, Chen Change
format	Article
author	Song, Linsen Wu, Wayne Qian, Chen He, Ran Loy, Chen Change
author_sort	Song, Linsen
title	Everybody's talkin': let me talk as you want
title_short	Everybody's talkin': let me talk as you want
title_full	Everybody's talkin': let me talk as you want
title_fullStr	Everybody's talkin': let me talk as you want
title_full_unstemmed	Everybody's talkin': let me talk as you want
title_sort	everybody's talkin': let me talk as you want
publishDate	2022
url	https://hdl.handle.net/10356/162986
_version_	1751548511226167296

Everybody's talkin': let me talk as you want

Similar Items