CATNet: Cross-modal fusion for audio-visual speech recognition

CATNet: Cross-modal fusion for audio-visual speech recognition

Automatic speech recognition (ASR) is a typical pattern recognition technology that converts human speeches into texts. With the aid of advanced deep learning models, the performance of speech recognition is significantly improved. Especially, the emerging Audio–Visual Speech Recognition (AVSR) meth...

Full description

Saved in:

Bibliographic Details
Main Authors:	WANG, Xingmei, MI, Jianchen, LI, Boquan, ZHAO, Yixu, MENG, Jiaxiang
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Attention mechanism Audio-visual speech recognition Cross-modal fusion Deep learning Graphics and Human Computer Interfaces Numerical Analysis and Scientific Computing
Online Access:	https://ink.library.smu.edu.sg/sis_research/8645 https://ink.library.smu.edu.sg/context/sis_research/article/9648/viewcontent/CatNet_av.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

Similar Items

CROSS-MODALITY COMPLEMENTARITY FOR AUDIO-VISUAL SPEECH RECOGNITION
by: WANG JIADONG
Published: (2024)

Audio-visual modeling for bimodal speech recognition
by: Kaynak, M.N., et al.
Published: (2014)

AV-FDTI: audio-visual fusion for drone threat identification
by: Yang, Yizhuo, et al.
Published: (2024)

PRFusion: toward effective and robust multi-modal place recognition with image and point cloud fusion
by: Wang, Sijie, et al.
Published: (2025)

AUDIO-VISUAL ACTIVE SPEAKER DETECTION AND RECOGNITION
by: TAO RUIJIE
Published: (2023)

Attentive Moment Retrieval in Videos
by: Meng Liu, et al.
Published: (2020)

Multi-stage DNN training for automatic recognition of dysarthric speech
by: Yilmaz E., et al.
Published: (2018)

Learning from the master: Distilling cross-modal advanced knowledge for lip reading
by: REN, Sucheng, et al.
Published: (2021)

Noise-Robust Speech Recognition Using Deep Neural Network
by: LI BO
Published: (2014)

Integration strategies for audio-visual speech processing: Applied to text-dependent speaker recognition
by: Lucey S., et al.
Published: (2018)

Web service for automatic speech recognition
by: LIEW SHIANG CHEN
Published: (2010)

An adaptive network fusing light detection and ranging height-sliced bird’s-eye view and vision for place recognition
by: ZHENG, Rui, et al.
Published: (2024)

Speech to text converter for Filipino language using hybrid artificial neural network/Hidden Markov Model
by: Chan, Aylmer Jason L., et al.
Published: (2007)

Fusing heterogeneous modalities for video and image re-ranking
by: TAN, Hung-Khoon, et al.
Published: (2011)

Emotion recognition in Filipino speech: EMOTICON
by: Chua, Joan L., et al.
Published: (2009)

TranSiam: Aggregating multi-modal visual features with locality for medical image segmentation
by: LI, Xuejian, et al.
Published: (2024)

Audio-based assessment in determining language
by: Africa, Aaron Don M., et al.
Published: (2020)

Deep understanding of cooking procedure for cross-modal recipe retrieval
by: CHEN, Jingjing, et al.
Published: (2018)

Speech comparison using discrete wavelet transform
by: Baello, Kimberly Anne D., et al.
Published: (2003)

Implementing a statistical method for automatic speech recognition
by: Gochuico, Stephany, et al.
Published: (1990)

Speech-controlled human-computer interface for audio-visual breast self-examination guidance system
by: Billones, Robert Kerwin C., et al.
Published: (2016)

Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval
by: Jing-Jing Chen, et al.
Published: (2020)

Learning Using Privileged Information for Food Recognition
by: Lei Meng, et al.
Published: (2020)

Isolated-word speech recognition system
by: Carunungan, Christopher C., et al.
Published: (1992)

Combining Speech with textual methods for arabic diacritization
by: AISHA SIDDIQA AZIM
Published: (2012)

Concept-driven multi-modality fusion for video search
by: WEI, Xiao-Yong, et al.
Published: (2011)

Deep Spiking Neural Networks for Large Vocabulary Automatic Speech Recognition
by: Wu, J., et al.
Published: (2021)

Cross-domain cross-modal food transfer
by: ZHU, Bin, et al.
Published: (2020)

DIRECTED AUDIO TEXTURE SYNTHESIS WITH DEEP LEARNING
by: MUHAMMAD HUZAIFAH BIN MD SHAHRIN
Published: (2021)

k-NN and k-means for emotion detection from speech
by: Pedro, Ana Marian M.
Published: (2010)

Cross-modal recipe retrieval with stacked attention model
by: CHEN, Jing-Jing, et al.
Published: (2018)

Automatic speech recognition and chat bot for air traffic control
by: Low, Ashton Kin Yun
Published: (2024)

Speech link
by: Chan, William C., et al.
Published: (1993)

Speech based emotion classification
by: Nwe, T.L., et al.
Published: (2014)

Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
by: Xiao, X, et al.
Published: (2020)

On-device implementation of an automatic Filipino speech recognition system
by: Ang, Federico M., et al.
Published: (2008)

A Small vocabulary automatic speech profanity suppression system using Hybrid Hidden Markov Model/ Artificial Neural Network (HMM/ANN) keyword spotting framework
by: Ablaza, Fernando I., Jr., et al.
Published: (2010)

Underwater Image Translation via Multi-Scale Generative Adversarial Network
by: YANG, Dongmei, et al.
Published: (2023)

Effects of music and speech on the neural correlates of emotion and attention
by: NICOLAS RENE ESCOFFIER
Published: (2013)

Modeling of non-native speech automatic speech recognition
by: XIONG YUANTING
Published: (2011)