Multimodal few-shot classification without attribute embedding
Multimodal few-shot learning aims to exploit complementary information inherent in multiple modalities for vision tasks in low data scenarios. Most of the current research focuses on a suitable embedding space for the various modalities. While solutions based on embedding provide state-of-the-art re...
Saved in:
Main Authors: | , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/175469 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-175469 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1754692024-04-26T15:38:56Z Multimodal few-shot classification without attribute embedding Chang, Jun Qing Rajan, Deepu Vun, Nicholas School of Computer Science and Engineering Computer and Information Science Multimodal learning Few-shot classifcation Multimodal few-shot learning aims to exploit complementary information inherent in multiple modalities for vision tasks in low data scenarios. Most of the current research focuses on a suitable embedding space for the various modalities. While solutions based on embedding provide state-of-the-art results, they reduce the interpretability of the model. Separate visualization approaches enable the models to become more transparent. In this paper, a multimodal few-shot learning framework that is inherently interpretable is presented. This is achieved by using the textual modality in the form of attributes without embedding them. This enables the model to directly explain which attributes caused it to classify an image into a particular class. The model consists of a variational autoencoder to learn the visual latent representation, which is combined with a semantic latent representation that is learnt from a normal autoencoder, which calculates a semantic loss between the latent representation and a binary attribute vector. A decoder reconstructs the original image from concatenated latent vectors. The proposed model outperforms other multimodal methods when all test classes are used, e.g., 50 classes in a 50-way 1-shot setting, and is comparable for lesser number of ways. Since raw text attributes are used, the datasets for evaluation are CUB, SUN and AWA2. The effectiveness of interpretability provided by the model is evaluated by analyzing how well it has learnt to identify the attributes. Published version 2024-04-24T07:42:29Z 2024-04-24T07:42:29Z 2024 Journal Article Chang, J. Q., Rajan, D. & Vun, N. (2024). Multimodal few-shot classification without attribute embedding. Eurasip Journal On Image and Video Processing, 2024(1), 4-. https://dx.doi.org/10.1186/s13640-024-00620-9 1687-5281 https://hdl.handle.net/10356/175469 10.1186/s13640-024-00620-9 2-s2.0-85181849777 1 2024 4 en Eurasip Journal on Image and Video Processing © The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Multimodal learning Few-shot classifcation |
spellingShingle |
Computer and Information Science Multimodal learning Few-shot classifcation Chang, Jun Qing Rajan, Deepu Vun, Nicholas Multimodal few-shot classification without attribute embedding |
description |
Multimodal few-shot learning aims to exploit complementary information inherent in multiple modalities for vision tasks in low data scenarios. Most of the current research focuses on a suitable embedding space for the various modalities. While solutions based on embedding provide state-of-the-art results, they reduce the interpretability of the model. Separate visualization approaches enable the models to become more transparent. In this paper, a multimodal few-shot learning framework that is inherently interpretable is presented. This is achieved by using the textual modality in the form of attributes without embedding them. This enables the model to directly explain which attributes caused it to classify an image into a particular class. The model consists of a variational autoencoder to learn the visual latent representation, which is combined with a semantic latent representation that is learnt from a normal autoencoder, which calculates a semantic loss between the latent representation and a binary attribute vector. A decoder reconstructs the original image from concatenated latent vectors. The proposed model outperforms other multimodal methods when all test classes are used, e.g., 50 classes in a 50-way 1-shot setting, and is comparable for lesser number of ways. Since raw text attributes are used, the datasets for evaluation are CUB, SUN and AWA2. The effectiveness of interpretability provided by the model is evaluated by analyzing how well it has learnt to identify the attributes. |
author2 |
School of Computer Science and Engineering |
author_facet |
School of Computer Science and Engineering Chang, Jun Qing Rajan, Deepu Vun, Nicholas |
format |
Article |
author |
Chang, Jun Qing Rajan, Deepu Vun, Nicholas |
author_sort |
Chang, Jun Qing |
title |
Multimodal few-shot classification without attribute embedding |
title_short |
Multimodal few-shot classification without attribute embedding |
title_full |
Multimodal few-shot classification without attribute embedding |
title_fullStr |
Multimodal few-shot classification without attribute embedding |
title_full_unstemmed |
Multimodal few-shot classification without attribute embedding |
title_sort |
multimodal few-shot classification without attribute embedding |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/175469 |
_version_ |
1806059742029676544 |