Everybody's talkin': let me talk as you want

We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random c...

Full description

Saved in:
Bibliographic Details
Main Authors: Song, Linsen, Wu, Wayne, Qian, Chen, He, Ran, Loy, Chen Change
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2022
Subjects:
Online Access:https://hdl.handle.net/10356/162986
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-162986
record_format dspace
spelling sg-ntu-dr.10356-1629862022-11-14T07:56:42Z Everybody's talkin': let me talk as you want Song, Linsen Wu, Wayne Qian, Chen He, Ran Loy, Chen Change School of Computer Science and Engineering S-Laboratory, NTU Engineering::Computer science and engineering Talking Face Generation Video Generation We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio. This work was supported in part by the Beijing Natural Science Foundation under Grant JQ18017, in part by the National Natural Science Foundation of China under Grant U20A20223 and Grant 61721004, in part by the Youth Innovation Promotion Association Chinese Academy of Sciences (CAS) under Grant Y201929, in part by the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, and in part by the Cash and In-Kind Contribution from the Industry Partner(s). 2022-11-14T07:56:42Z 2022-11-14T07:56:42Z 2022 Journal Article Song, L., Wu, W., Qian, C., He, R. & Loy, C. C. (2022). Everybody's talkin': let me talk as you want. IEEE Transactions On Information Forensics and Security, 17, 585-598. https://dx.doi.org/10.1109/TIFS.2022.3146783 1556-6013 https://hdl.handle.net/10356/162986 10.1109/TIFS.2022.3146783 2-s2.0-85123783224 17 585 598 en IEEE Transactions on Information Forensics and Security © 2022 IEEE. All rights reserved.
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
Talking Face Generation
Video Generation
spellingShingle Engineering::Computer science and engineering
Talking Face Generation
Video Generation
Song, Linsen
Wu, Wayne
Qian, Chen
He, Ran
Loy, Chen Change
Everybody's talkin': let me talk as you want
description We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Song, Linsen
Wu, Wayne
Qian, Chen
He, Ran
Loy, Chen Change
format Article
author Song, Linsen
Wu, Wayne
Qian, Chen
He, Ran
Loy, Chen Change
author_sort Song, Linsen
title Everybody's talkin': let me talk as you want
title_short Everybody's talkin': let me talk as you want
title_full Everybody's talkin': let me talk as you want
title_fullStr Everybody's talkin': let me talk as you want
title_full_unstemmed Everybody's talkin': let me talk as you want
title_sort everybody's talkin': let me talk as you want
publishDate 2022
url https://hdl.handle.net/10356/162986
_version_ 1751548511226167296