Learning from the master: Distilling cross-modal advanced knowledge for lip reading
Lip reading aims to predict the spoken sentences from silent lip videos. Due to the fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is to distill knowledge from a teacher pretrained by audio signals. However, the latent domain gap bet...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2021
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/8442 https://ink.library.smu.edu.sg/context/sis_research/article/9445/viewcontent/Ren_Learning_From_the_Master_Distilling_Cross_Modal_Advanced_Knowledge_for_Lip_CVPR_2021_paper.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-9445 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-94452024-01-04T09:55:15Z Learning from the master: Distilling cross-modal advanced knowledge for lip reading REN, Sucheng DU, Yong LV, Jianming HAN, Guoqiang HE, Shengfeng Lip reading aims to predict the spoken sentences from silent lip videos. Due to the fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is to distill knowledge from a teacher pretrained by audio signals. However, the latent domain gap between the cross-modal data could lead to a learning ambiguity and thus limits the performance of lip reading. In this paper, we propose a novel collaborative framework for lip reading, and two aspects of issues are considered: 1) the teacher should understand bi-modal knowledge to possibly bridge the inherent cross-modal gap; 2) the teacher should adjust teaching contents adaptively with the evolution of the student. To these ends, we introduce a trainable “master” network which ingests both audio signals and silent lip videos instead of a pretrained teacher. The master produces logits from three modalities of features: audio modality, video modality, and their combination. To further provide an interactive strategy to fuse these knowledge organically, we regularize the master with the task-specific feedback from the student, in which the requirement of the student is implicitly embedded. Meanwhile, we involve a couple of “tutor” networks into our system as guidance for emphasizing the fruitful knowledge flexibly. In addition, we incorporate a curriculum learning design to ensure a better convergence. Extensive experiments demonstrate that the proposed network outperforms the state-of-the-art methods on several benchmarks, including in both word-level and sentence-level scenarios. 2021-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8442 info:doi/10.1109/CVPR46437.2021.01312 https://ink.library.smu.edu.sg/context/sis_research/article/9445/viewcontent/Ren_Learning_From_the_Master_Distilling_Cross_Modal_Advanced_Knowledge_for_Lip_CVPR_2021_paper.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Curricula Modal analysis Speech recognition Audio signal Collaborative framework Cross-modal Interactive strategy Learning designs Lip reading Modal data Performance Teachers' Teaching contents Students Databases and Information Systems Graphics and Human Computer Interfaces |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Curricula Modal analysis Speech recognition Audio signal Collaborative framework Cross-modal Interactive strategy Learning designs Lip reading Modal data Performance Teachers' Teaching contents Students Databases and Information Systems Graphics and Human Computer Interfaces |
spellingShingle |
Curricula Modal analysis Speech recognition Audio signal Collaborative framework Cross-modal Interactive strategy Learning designs Lip reading Modal data Performance Teachers' Teaching contents Students Databases and Information Systems Graphics and Human Computer Interfaces REN, Sucheng DU, Yong LV, Jianming HAN, Guoqiang HE, Shengfeng Learning from the master: Distilling cross-modal advanced knowledge for lip reading |
description |
Lip reading aims to predict the spoken sentences from silent lip videos. Due to the fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is to distill knowledge from a teacher pretrained by audio signals. However, the latent domain gap between the cross-modal data could lead to a learning ambiguity and thus limits the performance of lip reading. In this paper, we propose a novel collaborative framework for lip reading, and two aspects of issues are considered: 1) the teacher should understand bi-modal knowledge to possibly bridge the inherent cross-modal gap; 2) the teacher should adjust teaching contents adaptively with the evolution of the student. To these ends, we introduce a trainable “master” network which ingests both audio signals and silent lip videos instead of a pretrained teacher. The master produces logits from three modalities of features: audio modality, video modality, and their combination. To further provide an interactive strategy to fuse these knowledge organically, we regularize the master with the task-specific feedback from the student, in which the requirement of the student is implicitly embedded. Meanwhile, we involve a couple of “tutor” networks into our system as guidance for emphasizing the fruitful knowledge flexibly. In addition, we incorporate a curriculum learning design to ensure a better convergence. Extensive experiments demonstrate that the proposed network outperforms the state-of-the-art methods on several benchmarks, including in both word-level and sentence-level scenarios. |
format |
text |
author |
REN, Sucheng DU, Yong LV, Jianming HAN, Guoqiang HE, Shengfeng |
author_facet |
REN, Sucheng DU, Yong LV, Jianming HAN, Guoqiang HE, Shengfeng |
author_sort |
REN, Sucheng |
title |
Learning from the master: Distilling cross-modal advanced knowledge for lip reading |
title_short |
Learning from the master: Distilling cross-modal advanced knowledge for lip reading |
title_full |
Learning from the master: Distilling cross-modal advanced knowledge for lip reading |
title_fullStr |
Learning from the master: Distilling cross-modal advanced knowledge for lip reading |
title_full_unstemmed |
Learning from the master: Distilling cross-modal advanced knowledge for lip reading |
title_sort |
learning from the master: distilling cross-modal advanced knowledge for lip reading |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2021 |
url |
https://ink.library.smu.edu.sg/sis_research/8442 https://ink.library.smu.edu.sg/context/sis_research/article/9445/viewcontent/Ren_Learning_From_the_Master_Distilling_Cross_Modal_Advanced_Knowledge_for_Lip_CVPR_2021_paper.pdf |
_version_ |
1787590750247059456 |