CATNet: Cross-modal fusion for audio-visual speech recognition
Automatic speech recognition (ASR) is a typical pattern recognition technology that converts human speeches into texts. With the aid of advanced deep learning models, the performance of speech recognition is significantly improved. Especially, the emerging Audio–Visual Speech Recognition (AVSR) meth...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2024
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/8645 https://ink.library.smu.edu.sg/context/sis_research/article/9648/viewcontent/CatNet_av.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-9648 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-96482024-02-08T07:44:15Z CATNet: Cross-modal fusion for audio-visual speech recognition WANG, Xingmei MI, Jianchen LI, Boquan ZHAO, Yixu MENG, Jiaxiang Automatic speech recognition (ASR) is a typical pattern recognition technology that converts human speeches into texts. With the aid of advanced deep learning models, the performance of speech recognition is significantly improved. Especially, the emerging Audio–Visual Speech Recognition (AVSR) methods achieve satisfactory performance by combining audio-modal and visual-modal information. However, various complex environments, especially noises, limit the effectiveness of existing methods. In response to the noisy problem, in this paper, we propose a novel cross-modal audio–visual speech recognition model, named CATNet. First, we devise a cross-modal bidirectional fusion model to analyze the close relationship between audio and visual modalities. Second, we propose an audio–visual dual-modal network to preprocess audio and visual information, extract significant features and filter redundant noises. The experimental results demonstrate the effectiveness of CATNet, which achieves excellent WER, CER and converges speeds, outperforms other benchmark models and overcomes the challenge posed by noisy environments. 2024-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8645 info:doi/10.1016/j.patrec.2024.01.002 https://ink.library.smu.edu.sg/context/sis_research/article/9648/viewcontent/CatNet_av.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Attention mechanism Audio-visual speech recognition Cross-modal fusion Deep learning Graphics and Human Computer Interfaces Numerical Analysis and Scientific Computing |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Attention mechanism Audio-visual speech recognition Cross-modal fusion Deep learning Graphics and Human Computer Interfaces Numerical Analysis and Scientific Computing |
spellingShingle |
Attention mechanism Audio-visual speech recognition Cross-modal fusion Deep learning Graphics and Human Computer Interfaces Numerical Analysis and Scientific Computing WANG, Xingmei MI, Jianchen LI, Boquan ZHAO, Yixu MENG, Jiaxiang CATNet: Cross-modal fusion for audio-visual speech recognition |
description |
Automatic speech recognition (ASR) is a typical pattern recognition technology that converts human speeches into texts. With the aid of advanced deep learning models, the performance of speech recognition is significantly improved. Especially, the emerging Audio–Visual Speech Recognition (AVSR) methods achieve satisfactory performance by combining audio-modal and visual-modal information. However, various complex environments, especially noises, limit the effectiveness of existing methods. In response to the noisy problem, in this paper, we propose a novel cross-modal audio–visual speech recognition model, named CATNet. First, we devise a cross-modal bidirectional fusion model to analyze the close relationship between audio and visual modalities. Second, we propose an audio–visual dual-modal network to preprocess audio and visual information, extract significant features and filter redundant noises. The experimental results demonstrate the effectiveness of CATNet, which achieves excellent WER, CER and converges speeds, outperforms other benchmark models and overcomes the challenge posed by noisy environments. |
format |
text |
author |
WANG, Xingmei MI, Jianchen LI, Boquan ZHAO, Yixu MENG, Jiaxiang |
author_facet |
WANG, Xingmei MI, Jianchen LI, Boquan ZHAO, Yixu MENG, Jiaxiang |
author_sort |
WANG, Xingmei |
title |
CATNet: Cross-modal fusion for audio-visual speech recognition |
title_short |
CATNet: Cross-modal fusion for audio-visual speech recognition |
title_full |
CATNet: Cross-modal fusion for audio-visual speech recognition |
title_fullStr |
CATNet: Cross-modal fusion for audio-visual speech recognition |
title_full_unstemmed |
CATNet: Cross-modal fusion for audio-visual speech recognition |
title_sort |
catnet: cross-modal fusion for audio-visual speech recognition |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2024 |
url |
https://ink.library.smu.edu.sg/sis_research/8645 https://ink.library.smu.edu.sg/context/sis_research/article/9648/viewcontent/CatNet_av.pdf |
_version_ |
1794549516656967680 |