Everybody's talkin': let me talk as you want
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random c...
Saved in:
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/162986 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-162986 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1629862022-11-14T07:56:42Z Everybody's talkin': let me talk as you want Song, Linsen Wu, Wayne Qian, Chen He, Ran Loy, Chen Change School of Computer Science and Engineering S-Laboratory, NTU Engineering::Computer science and engineering Talking Face Generation Video Generation We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio. This work was supported in part by the Beijing Natural Science Foundation under Grant JQ18017, in part by the National Natural Science Foundation of China under Grant U20A20223 and Grant 61721004, in part by the Youth Innovation Promotion Association Chinese Academy of Sciences (CAS) under Grant Y201929, in part by the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, and in part by the Cash and In-Kind Contribution from the Industry Partner(s). 2022-11-14T07:56:42Z 2022-11-14T07:56:42Z 2022 Journal Article Song, L., Wu, W., Qian, C., He, R. & Loy, C. C. (2022). Everybody's talkin': let me talk as you want. IEEE Transactions On Information Forensics and Security, 17, 585-598. https://dx.doi.org/10.1109/TIFS.2022.3146783 1556-6013 https://hdl.handle.net/10356/162986 10.1109/TIFS.2022.3146783 2-s2.0-85123783224 17 585 598 en IEEE Transactions on Information Forensics and Security © 2022 IEEE. All rights reserved. |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering Talking Face Generation Video Generation |
spellingShingle |
Engineering::Computer science and engineering Talking Face Generation Video Generation Song, Linsen Wu, Wayne Qian, Chen He, Ran Loy, Chen Change Everybody's talkin': let me talk as you want |
description |
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio. |
author2 |
School of Computer Science and Engineering |
author_facet |
School of Computer Science and Engineering Song, Linsen Wu, Wayne Qian, Chen He, Ran Loy, Chen Change |
format |
Article |
author |
Song, Linsen Wu, Wayne Qian, Chen He, Ran Loy, Chen Change |
author_sort |
Song, Linsen |
title |
Everybody's talkin': let me talk as you want |
title_short |
Everybody's talkin': let me talk as you want |
title_full |
Everybody's talkin': let me talk as you want |
title_fullStr |
Everybody's talkin': let me talk as you want |
title_full_unstemmed |
Everybody's talkin': let me talk as you want |
title_sort |
everybody's talkin': let me talk as you want |
publishDate |
2022 |
url |
https://hdl.handle.net/10356/162986 |
_version_ |
1751548511226167296 |