Audio captioning and retrieval with improved cross-modal objectives

Automated Audio Captioning (AAC) is the task of generating descriptive captions from an input audio clip, while Language-Based Audio Retrieval (LBAR) is the task of retrieving the most relevant audio clip based on an input text query. AAC requires a model that is not only able to comprehend the acou...

Full description

Saved in:
Bibliographic Details
Main Author: Koh, Andrew Jin Jie
Other Authors: Chng Eng Siong
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/172437
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-172437
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Koh, Andrew Jin Jie
Audio captioning and retrieval with improved cross-modal objectives
description Automated Audio Captioning (AAC) is the task of generating descriptive captions from an input audio clip, while Language-Based Audio Retrieval (LBAR) is the task of retrieving the most relevant audio clip based on an input text query. AAC requires a model that is not only able to comprehend the acoustic events occurring within an audio clip but also able to translate that information into natural language. For LBAR, the model must have a good understanding of the context and meaning of both the audio events and the query text caption, so it can retrieve relevant audio clips based on user-specified queries. This can be a difficult task, as audio data can often be noisy and the sound events within it may sound different because of the many differing sources in different environments. To overcome these challenges, we propose three different self-supervised techniques to enhance the cross-modality relationship between text and audio representations. In the first study, we propose Reconstruction Latent Space Similarity Regularization (RLSSR) for AAC, an additional module in the model architecture to optimize. This module is trained in a self-supervised manner and does not require any additional annotations. The idea behind this is based on various tasks in computer vision that involve having the model recreate the original image. Instead of recreating the original audio, a small component is employed to recreate the audio embeddings from the text embeddings using a method that increases the similarity between the two. This feedback process serves as a form of regularization and improves the overall quality of the generation. We also perform an analysis of the design of the audio encoder and found that using a transformer encoder is beneficial to Automated Audio Captioning. The combination of both methods allows us to surpass state-of-the-art results (0.242 SPIDEr score) by a significant margin on the Clotho dataset across several metrics and benchmarks In the second study, we tackle the new Language-Based Audio Retrieval challenge presented in DCASE 20221 . Firstly, we introduce an easy-to-use and scalable architecture, Converging Tied Layers. This architecture makes use of shared transformer layers to align both the audio and text representations in the same subspace. This approach requires minimal training and allows the use of many publicly available models without the need for fine-tuning. Secondly, we demonstrate that by using this architecture along with self-supervised contrastive loss, the model exceeds the performance of the baseline model. Lastly, our approach has a low memory requirement for training and it allows the use of pre-trained models as is, without requiring fine-tuning. Our evaluation shows that by using our approach, we beat the baseline scores by 0.08 (267%) in R@1 and 0.13 in mAP10 on the Clotho dataset. In the third study, we present a new algorithm named Epochal Difficult Captions to aid in the training of models for AAC. The algorithm adjusts target captions based on a predetermined curriculum and difficulty level that is determined by the current training epoch. The algorithm is efficient, self-supervised, and can be incorporated into any model architecture. Epochal Difficult Captions will not cause a noticeable increase in training time. This algorithm improves the keyword estimation method that has been used in earlier work to train the AAC encoder. We evaluated our approach on two different models in three settings and found that using Epochal Difficult Captions consistently improves performance by as much as 0.013 SPIDEr score on the Clotho dataset. In addition to the above work, we present 2 novel papers for word sense disambiguation via transfer learning and audio tagging. The former study makes use of BERT to reframe the word sense disambigution problem into a relevance ranking problem to allow the model to perform better by 2.6% F1score on the SE15 dataset. The latter method for audio tagging makes use of label manipulation to convert strong labels to weak labels to mitigate the model’s tendency to predict inactive frames. This approach outperforms the DCASE 2022 baseline by 45.5% on the real validation set in both aspects of the PSDS metric. Both methods for word sense disambiguation via transfer learning and audio tagging are complementary to audio captioning and retrieval due to the need for good cross-modal audio and text representations.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Koh, Andrew Jin Jie
format Thesis-Doctor of Philosophy
author Koh, Andrew Jin Jie
author_sort Koh, Andrew Jin Jie
title Audio captioning and retrieval with improved cross-modal objectives
title_short Audio captioning and retrieval with improved cross-modal objectives
title_full Audio captioning and retrieval with improved cross-modal objectives
title_fullStr Audio captioning and retrieval with improved cross-modal objectives
title_full_unstemmed Audio captioning and retrieval with improved cross-modal objectives
title_sort audio captioning and retrieval with improved cross-modal objectives
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/172437
_version_ 1787590720928874496
spelling sg-ntu-dr.10356-1724372024-01-04T06:32:51Z Audio captioning and retrieval with improved cross-modal objectives Koh, Andrew Jin Jie Chng Eng Siong School of Computer Science and Engineering ASESChng@ntu.edu.sg Engineering::Computer science and engineering Automated Audio Captioning (AAC) is the task of generating descriptive captions from an input audio clip, while Language-Based Audio Retrieval (LBAR) is the task of retrieving the most relevant audio clip based on an input text query. AAC requires a model that is not only able to comprehend the acoustic events occurring within an audio clip but also able to translate that information into natural language. For LBAR, the model must have a good understanding of the context and meaning of both the audio events and the query text caption, so it can retrieve relevant audio clips based on user-specified queries. This can be a difficult task, as audio data can often be noisy and the sound events within it may sound different because of the many differing sources in different environments. To overcome these challenges, we propose three different self-supervised techniques to enhance the cross-modality relationship between text and audio representations. In the first study, we propose Reconstruction Latent Space Similarity Regularization (RLSSR) for AAC, an additional module in the model architecture to optimize. This module is trained in a self-supervised manner and does not require any additional annotations. The idea behind this is based on various tasks in computer vision that involve having the model recreate the original image. Instead of recreating the original audio, a small component is employed to recreate the audio embeddings from the text embeddings using a method that increases the similarity between the two. This feedback process serves as a form of regularization and improves the overall quality of the generation. We also perform an analysis of the design of the audio encoder and found that using a transformer encoder is beneficial to Automated Audio Captioning. The combination of both methods allows us to surpass state-of-the-art results (0.242 SPIDEr score) by a significant margin on the Clotho dataset across several metrics and benchmarks In the second study, we tackle the new Language-Based Audio Retrieval challenge presented in DCASE 20221 . Firstly, we introduce an easy-to-use and scalable architecture, Converging Tied Layers. This architecture makes use of shared transformer layers to align both the audio and text representations in the same subspace. This approach requires minimal training and allows the use of many publicly available models without the need for fine-tuning. Secondly, we demonstrate that by using this architecture along with self-supervised contrastive loss, the model exceeds the performance of the baseline model. Lastly, our approach has a low memory requirement for training and it allows the use of pre-trained models as is, without requiring fine-tuning. Our evaluation shows that by using our approach, we beat the baseline scores by 0.08 (267%) in R@1 and 0.13 in mAP10 on the Clotho dataset. In the third study, we present a new algorithm named Epochal Difficult Captions to aid in the training of models for AAC. The algorithm adjusts target captions based on a predetermined curriculum and difficulty level that is determined by the current training epoch. The algorithm is efficient, self-supervised, and can be incorporated into any model architecture. Epochal Difficult Captions will not cause a noticeable increase in training time. This algorithm improves the keyword estimation method that has been used in earlier work to train the AAC encoder. We evaluated our approach on two different models in three settings and found that using Epochal Difficult Captions consistently improves performance by as much as 0.013 SPIDEr score on the Clotho dataset. In addition to the above work, we present 2 novel papers for word sense disambiguation via transfer learning and audio tagging. The former study makes use of BERT to reframe the word sense disambigution problem into a relevance ranking problem to allow the model to perform better by 2.6% F1score on the SE15 dataset. The latter method for audio tagging makes use of label manipulation to convert strong labels to weak labels to mitigate the model’s tendency to predict inactive frames. This approach outperforms the DCASE 2022 baseline by 45.5% on the real validation set in both aspects of the PSDS metric. Both methods for word sense disambiguation via transfer learning and audio tagging are complementary to audio captioning and retrieval due to the need for good cross-modal audio and text representations. Doctor of Philosophy 2023-12-11T06:07:06Z 2023-12-11T06:07:06Z 2023 Thesis-Doctor of Philosophy Koh, A. J. J. (2023). Audio captioning and retrieval with improved cross-modal objectives. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/172437 https://hdl.handle.net/10356/172437 10.32657/10356/172437 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University