Audio captioning and retrieval with improved cross-modal objectives

Audio captioning and retrieval with improved cross-modal objectives

Automated Audio Captioning (AAC) is the task of generating descriptive captions from an input audio clip, while Language-Based Audio Retrieval (LBAR) is the task of retrieving the most relevant audio clip based on an input text query. AAC requires a model that is not only able to comprehend the acou...

Full description

Saved in:

Bibliographic Details
Main Author:	Koh, Andrew Jin Jie
Other Authors:	Chng Eng Siong
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2023
Subjects:	Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/172437
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

Similar Items

Cross-modal graph with meta concepts for video captioning
by: Wang, Hao, et al.
Published: (2022)

Improved image captioning techniques with comparative study
by: He, Cari
Published: (2021)

Cross-modal graph with meta concepts for video captioning
by: WANG, Hao, et al.
Published: (2022)

Audio pattern discovery and retrieval
by: Wang, Lei
Published: (2013)

Text-based image retrieval using image captioning
by: Tan, Kah Hwa
Published: (2019)

Incorporating additional knowledge into image captioners
by: Xu, Yang
Published: (2021)

Paired cross-modal data augmentation for fine-grained image-to-text retrieval
by: Wang, Hao, et al.
Published: (2023)

Neural image and video captioning (NIVC)
by: Lee, Jeremy Kian Kiat
Published: (2022)

Deep learning-based image captioning
by: Chong, Kaydon
Published: (2019)

Deep robust multilevel semantic hashing for multi-label cross-modal retrieval
by: Song, Ge, et al.
Published: (2023)

Evaluations of training paradigms in neural image captioning
by: Lee, Si Min
Published: (2019)

Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature
by: Xu, Yuecong, et al.
Published: (2021)

A vector-based approach to broadcast audio database indexing and retrieval
by: Wang, Lei, et al.
Published: (2013)

Online weighted hashing for cross-modal retrieval
by: Jiang, Zining
Published: (2022)

Deconfounded image captioning: a causal retrospect
by: Yang, Xu, et al.
Published: (2022)

Automatic closed caption generation from video files
by: Tan, Kenneth Chengwei
Published: (2014)

Cross-modal recipe retrieval with stacked attention model
by: CHEN, Jing-Jing, et al.
Published: (2018)

SWORS : a system for the efficient retrieval of relevant spatial web objects
by: Cao, Xin, et al.
Published: (2013)

A framework for efficient spatial web object retrieval
by: Jensen, Christian S., et al.
Published: (2013)

Online cross-modal hashing for web image retrieval
by: XIE, Liang, et al.
Published: (2016)

CATNet: Cross-modal fusion for audio-visual speech recognition
by: WANG, Xingmei, et al.
Published: (2024)

Neural image and video captioning
by: Lam, Ting En
Published: (2024)

Distance metric learning for multi-modal image retrieval and annotation
by: Wu, Pengcheng
Published: (2014)

Whispersync : close caption (live-following) of the read speech in a close cation
by: Lam, Chun Yin
Published: (2015)

Image retrieval with a multi-modality ontology
by: Wang, Huan
Published: (2010)

An object-oriented, logic based approach to document retrieval
by: Tan, Nam Beng.
Published: (2009)

Content-based audio classification and retrieval
by: Liu, Ming Chun
Published: (2008)

Introduction to the special issue on new subjective and objective methodologies for audio and visual signal processing
by: Loizou, Philip C., et al.
Published: (2013)

Automated image captioning
by: Teo, Sabrina Jingya
Published: (2017)

Alleviating the inconsistency of multimodal data in cross-modal retrieval
by: Li, Tieying, et al.
Published: (2024)

Learning to collocate Visual-Linguistic Neural Modules for image captioning
by: Yang, Xu, et al.
Published: (2023)

Audio fingerprint application for media industry
by: Kusuma, Andrew Putra
Published: (2018)

Cross-Modal Self-Taught Hashing for large-scale image retrieval
by: XIE, Liang, et al.
Published: (2016)

Context-aware visual policy network for fine-grained image captioning
by: Zha, Zheng-Jun, et al.
Published: (2022)

Learning decoupled models for cross-modal generation
by: Wang, Hao
Published: (2023)

Cross-modal prediction in audio-visual communication
by: Rao Ram R., et al.
Published: (2018)

Cross-modal recipe retrieval: How to cook this dish?
by: CHEN, Jingjing, et al.
Published: (2017)

Stack-VS : stacked visual-semantic attention for image caption generation
by: Cheng, Ling, et al.
Published: (2021)

Visionary caption: Improving the accessibility of presentation slides through highlighting visualization
by: YIP, Carmen Ji Yan, et al.
Published: (2021)

Cross-modal recipe retrieval with rich food attributes
by: CHEN, Jingjing, et al.
Published: (2017)