Learning from the master: Distilling cross-modal advanced knowledge for lip reading

Lip reading aims to predict the spoken sentences from silent lip videos. Due to the fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is to distill knowledge from a teacher pretrained by audio signals. However, the latent domain gap bet...

Full description

Saved in:

Bibliographic Details
Main Authors:	REN, Sucheng, DU, Yong, LV, Jianming, HAN, Guoqiang, HE, Shengfeng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2021
Subjects:	Curricula Modal analysis Speech recognition Audio signal Collaborative framework Cross-modal Interactive strategy Learning designs Lip reading Modal data Performance Teachers' Teaching contents Students Databases and Information Systems Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/8442 https://ink.library.smu.edu.sg/context/sis_research/article/9445/viewcontent/Ren_Learning_From_the_Master_Distilling_Cross_Modal_Advanced_Knowledge_for_Lip_CVPR_2021_paper.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

Description
Summary:	Lip reading aims to predict the spoken sentences from silent lip videos. Due to the fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is to distill knowledge from a teacher pretrained by audio signals. However, the latent domain gap between the cross-modal data could lead to a learning ambiguity and thus limits the performance of lip reading. In this paper, we propose a novel collaborative framework for lip reading, and two aspects of issues are considered: 1) the teacher should understand bi-modal knowledge to possibly bridge the inherent cross-modal gap; 2) the teacher should adjust teaching contents adaptively with the evolution of the student. To these ends, we introduce a trainable “master” network which ingests both audio signals and silent lip videos instead of a pretrained teacher. The master produces logits from three modalities of features: audio modality, video modality, and their combination. To further provide an interactive strategy to fuse these knowledge organically, we regularize the master with the task-specific feedback from the student, in which the requirement of the student is implicitly embedded. Meanwhile, we involve a couple of “tutor” networks into our system as guidance for emphasizing the fruitful knowledge flexibly. In addition, we incorporate a curriculum learning design to ensure a better convergence. Extensive experiments demonstrate that the proposed network outperforms the state-of-the-art methods on several benchmarks, including in both word-level and sentence-level scenarios.

Learning from the master: Distilling cross-modal advanced knowledge for lip reading

Similar Items