Multimodal few-shot classification without attribute embedding

Multimodal few-shot learning aims to exploit complementary information inherent in multiple modalities for vision tasks in low data scenarios. Most of the current research focuses on a suitable embedding space for the various modalities. While solutions based on embedding provide state-of-the-art re...

Full description

Saved in:
Bibliographic Details
Main Authors: Chang, Jun Qing, Rajan, Deepu, Vun, Nicholas
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175469
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-175469
record_format dspace
spelling sg-ntu-dr.10356-1754692024-04-26T15:38:56Z Multimodal few-shot classification without attribute embedding Chang, Jun Qing Rajan, Deepu Vun, Nicholas School of Computer Science and Engineering Computer and Information Science Multimodal learning Few-shot classifcation Multimodal few-shot learning aims to exploit complementary information inherent in multiple modalities for vision tasks in low data scenarios. Most of the current research focuses on a suitable embedding space for the various modalities. While solutions based on embedding provide state-of-the-art results, they reduce the interpretability of the model. Separate visualization approaches enable the models to become more transparent. In this paper, a multimodal few-shot learning framework that is inherently interpretable is presented. This is achieved by using the textual modality in the form of attributes without embedding them. This enables the model to directly explain which attributes caused it to classify an image into a particular class. The model consists of a variational autoencoder to learn the visual latent representation, which is combined with a semantic latent representation that is learnt from a normal autoencoder, which calculates a semantic loss between the latent representation and a binary attribute vector. A decoder reconstructs the original image from concatenated latent vectors. The proposed model outperforms other multimodal methods when all test classes are used, e.g., 50 classes in a 50-way 1-shot setting, and is comparable for lesser number of ways. Since raw text attributes are used, the datasets for evaluation are CUB, SUN and AWA2. The effectiveness of interpretability provided by the model is evaluated by analyzing how well it has learnt to identify the attributes. Published version 2024-04-24T07:42:29Z 2024-04-24T07:42:29Z 2024 Journal Article Chang, J. Q., Rajan, D. & Vun, N. (2024). Multimodal few-shot classification without attribute embedding. Eurasip Journal On Image and Video Processing, 2024(1), 4-. https://dx.doi.org/10.1186/s13640-024-00620-9 1687-5281 https://hdl.handle.net/10356/175469 10.1186/s13640-024-00620-9 2-s2.0-85181849777 1 2024 4 en Eurasip Journal on Image and Video Processing © The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Multimodal learning
Few-shot classifcation
spellingShingle Computer and Information Science
Multimodal learning
Few-shot classifcation
Chang, Jun Qing
Rajan, Deepu
Vun, Nicholas
Multimodal few-shot classification without attribute embedding
description Multimodal few-shot learning aims to exploit complementary information inherent in multiple modalities for vision tasks in low data scenarios. Most of the current research focuses on a suitable embedding space for the various modalities. While solutions based on embedding provide state-of-the-art results, they reduce the interpretability of the model. Separate visualization approaches enable the models to become more transparent. In this paper, a multimodal few-shot learning framework that is inherently interpretable is presented. This is achieved by using the textual modality in the form of attributes without embedding them. This enables the model to directly explain which attributes caused it to classify an image into a particular class. The model consists of a variational autoencoder to learn the visual latent representation, which is combined with a semantic latent representation that is learnt from a normal autoencoder, which calculates a semantic loss between the latent representation and a binary attribute vector. A decoder reconstructs the original image from concatenated latent vectors. The proposed model outperforms other multimodal methods when all test classes are used, e.g., 50 classes in a 50-way 1-shot setting, and is comparable for lesser number of ways. Since raw text attributes are used, the datasets for evaluation are CUB, SUN and AWA2. The effectiveness of interpretability provided by the model is evaluated by analyzing how well it has learnt to identify the attributes.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Chang, Jun Qing
Rajan, Deepu
Vun, Nicholas
format Article
author Chang, Jun Qing
Rajan, Deepu
Vun, Nicholas
author_sort Chang, Jun Qing
title Multimodal few-shot classification without attribute embedding
title_short Multimodal few-shot classification without attribute embedding
title_full Multimodal few-shot classification without attribute embedding
title_fullStr Multimodal few-shot classification without attribute embedding
title_full_unstemmed Multimodal few-shot classification without attribute embedding
title_sort multimodal few-shot classification without attribute embedding
publishDate 2024
url https://hdl.handle.net/10356/175469
_version_ 1806059742029676544