Learning from the master: Distilling cross-modal advanced knowledge for lip reading

Lip reading aims to predict the spoken sentences from silent lip videos. Due to the fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is to distill knowledge from a teacher pretrained by audio signals. However, the latent domain gap bet...

Full description

Saved in:
Bibliographic Details
Main Authors: REN, Sucheng, DU, Yong, LV, Jianming, HAN, Guoqiang, HE, Shengfeng
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2021
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8442
https://ink.library.smu.edu.sg/context/sis_research/article/9445/viewcontent/Ren_Learning_From_the_Master_Distilling_Cross_Modal_Advanced_Knowledge_for_Lip_CVPR_2021_paper.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9445
record_format dspace
spelling sg-smu-ink.sis_research-94452024-01-04T09:55:15Z Learning from the master: Distilling cross-modal advanced knowledge for lip reading REN, Sucheng DU, Yong LV, Jianming HAN, Guoqiang HE, Shengfeng Lip reading aims to predict the spoken sentences from silent lip videos. Due to the fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is to distill knowledge from a teacher pretrained by audio signals. However, the latent domain gap between the cross-modal data could lead to a learning ambiguity and thus limits the performance of lip reading. In this paper, we propose a novel collaborative framework for lip reading, and two aspects of issues are considered: 1) the teacher should understand bi-modal knowledge to possibly bridge the inherent cross-modal gap; 2) the teacher should adjust teaching contents adaptively with the evolution of the student. To these ends, we introduce a trainable “master” network which ingests both audio signals and silent lip videos instead of a pretrained teacher. The master produces logits from three modalities of features: audio modality, video modality, and their combination. To further provide an interactive strategy to fuse these knowledge organically, we regularize the master with the task-specific feedback from the student, in which the requirement of the student is implicitly embedded. Meanwhile, we involve a couple of “tutor” networks into our system as guidance for emphasizing the fruitful knowledge flexibly. In addition, we incorporate a curriculum learning design to ensure a better convergence. Extensive experiments demonstrate that the proposed network outperforms the state-of-the-art methods on several benchmarks, including in both word-level and sentence-level scenarios. 2021-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8442 info:doi/10.1109/CVPR46437.2021.01312 https://ink.library.smu.edu.sg/context/sis_research/article/9445/viewcontent/Ren_Learning_From_the_Master_Distilling_Cross_Modal_Advanced_Knowledge_for_Lip_CVPR_2021_paper.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Curricula Modal analysis Speech recognition Audio signal Collaborative framework Cross-modal Interactive strategy Learning designs Lip reading Modal data Performance Teachers' Teaching contents Students Databases and Information Systems Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Curricula
Modal analysis
Speech recognition
Audio signal
Collaborative framework
Cross-modal
Interactive strategy
Learning designs
Lip reading
Modal data
Performance
Teachers'
Teaching contents
Students
Databases and Information Systems
Graphics and Human Computer Interfaces
spellingShingle Curricula
Modal analysis
Speech recognition
Audio signal
Collaborative framework
Cross-modal
Interactive strategy
Learning designs
Lip reading
Modal data
Performance
Teachers'
Teaching contents
Students
Databases and Information Systems
Graphics and Human Computer Interfaces
REN, Sucheng
DU, Yong
LV, Jianming
HAN, Guoqiang
HE, Shengfeng
Learning from the master: Distilling cross-modal advanced knowledge for lip reading
description Lip reading aims to predict the spoken sentences from silent lip videos. Due to the fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is to distill knowledge from a teacher pretrained by audio signals. However, the latent domain gap between the cross-modal data could lead to a learning ambiguity and thus limits the performance of lip reading. In this paper, we propose a novel collaborative framework for lip reading, and two aspects of issues are considered: 1) the teacher should understand bi-modal knowledge to possibly bridge the inherent cross-modal gap; 2) the teacher should adjust teaching contents adaptively with the evolution of the student. To these ends, we introduce a trainable “master” network which ingests both audio signals and silent lip videos instead of a pretrained teacher. The master produces logits from three modalities of features: audio modality, video modality, and their combination. To further provide an interactive strategy to fuse these knowledge organically, we regularize the master with the task-specific feedback from the student, in which the requirement of the student is implicitly embedded. Meanwhile, we involve a couple of “tutor” networks into our system as guidance for emphasizing the fruitful knowledge flexibly. In addition, we incorporate a curriculum learning design to ensure a better convergence. Extensive experiments demonstrate that the proposed network outperforms the state-of-the-art methods on several benchmarks, including in both word-level and sentence-level scenarios.
format text
author REN, Sucheng
DU, Yong
LV, Jianming
HAN, Guoqiang
HE, Shengfeng
author_facet REN, Sucheng
DU, Yong
LV, Jianming
HAN, Guoqiang
HE, Shengfeng
author_sort REN, Sucheng
title Learning from the master: Distilling cross-modal advanced knowledge for lip reading
title_short Learning from the master: Distilling cross-modal advanced knowledge for lip reading
title_full Learning from the master: Distilling cross-modal advanced knowledge for lip reading
title_fullStr Learning from the master: Distilling cross-modal advanced knowledge for lip reading
title_full_unstemmed Learning from the master: Distilling cross-modal advanced knowledge for lip reading
title_sort learning from the master: distilling cross-modal advanced knowledge for lip reading
publisher Institutional Knowledge at Singapore Management University
publishDate 2021
url https://ink.library.smu.edu.sg/sis_research/8442
https://ink.library.smu.edu.sg/context/sis_research/article/9445/viewcontent/Ren_Learning_From_the_Master_Distilling_Cross_Modal_Advanced_Knowledge_for_Lip_CVPR_2021_paper.pdf
_version_ 1787590750247059456