Resource-efficient learning for vision-capable neural models

The emblem of human intelligence is the ability to address a new task by applying relevant knowledge learned from previous tasks. Hence, human requires only a minimal number of examples from the new task in the adaptation process. In contrary, deep learning models still lag behind human in achievi...

Full description

Saved in:
Bibliographic Details
Main Author: Tiong, Anthony Meng Huat
Other Authors: Li Boyang
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/174637
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-174637
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Visual language models
Computer vision
Low-resource learning
Long-tailed classification
Visual question answering
Task relationship
spellingShingle Computer and Information Science
Visual language models
Computer vision
Low-resource learning
Long-tailed classification
Visual question answering
Task relationship
Tiong, Anthony Meng Huat
Resource-efficient learning for vision-capable neural models
description The emblem of human intelligence is the ability to address a new task by applying relevant knowledge learned from previous tasks. Hence, human requires only a minimal number of examples from the new task in the adaptation process. In contrary, deep learning models still lag behind human in achieving such remarkable generalization capability to a new task which has limited data. The learning setting is termed as resource-efficient learning. In this thesis, we explore resource-efficient problem formulations for vision-capable deep learning models. We begin by investigating vision-only neural models with application to long-tailed image classification. In long-tailed image classification, the number of training samples for tail classes is scarce, but abundant for head classes. The imbalance in training distribution make the learning of good tail-class representation a difficult task. We propose interpolative centroid contrastive learning (ICCL) approach to encourage the learning of tail-class representation by leveraging the abundant head-class samples. We create an interpolation sample between a head and a tail class and optimize the representation using a new interpolative centroid contrastive loss. We demonstrate the effectiveness of ICCL on multiple long-tailed evaluation datasets. Next, we extend our study to visual language models (VLMs) which involves both image and text modalities. We investigate zero-shot VQA, which restricts VLMs access to any VQA training samples. We devise a modular framework, PNP-VQA which performs zero-shot VQA and requires zero-training. We utilize natural language and network interpretability technique as interface to combine several pretrained models. Concretely, we first generate multiple question-guided captions by attending to relevant image patches. Then, we input the captions as context to a pretrained language model to answer the question. Our question-guided captions can capture detailed visual attributes and contain answer words, thus facilitate the question answering model to obtain the correct answer. Our PNP-VQA achieves state-of-the-art results on multiple VQA benchmarks. We conclude by investigating the zero-shot evaluation for VLMs. It is crucial that the performance of VLMs on testing tasks in zero-shot setting reflects their true generalization capability, so that we can have a fair comparison between VLMs and track their progress. When a testing task shares high similarity with the training tasks of a VLM, the performance of that VLM is likely to be higher than the other VLMs which do not have such similarity. Therefore, we perform transfer learning experiments to study the task similarity between training and testing tasks which is not considered when evaluating VLMs. Additionally, we discover the underlying VL skills from the data directly by utilizing factor analysis on the transfer performance. We demonstrate that factor analysis is an effective data-driven approach that identifies reasonable yet surprising VL skills. Furthermore, we address the lack of VL benchmarks that focus on VLM evaluation in the wild by proposing a new benchmark, OLIVE. It simulates the diverse queries from users to VLMs in practical, real-world scenarios.
author2 Li Boyang
author_facet Li Boyang
Tiong, Anthony Meng Huat
format Thesis-Doctor of Philosophy
author Tiong, Anthony Meng Huat
author_sort Tiong, Anthony Meng Huat
title Resource-efficient learning for vision-capable neural models
title_short Resource-efficient learning for vision-capable neural models
title_full Resource-efficient learning for vision-capable neural models
title_fullStr Resource-efficient learning for vision-capable neural models
title_full_unstemmed Resource-efficient learning for vision-capable neural models
title_sort resource-efficient learning for vision-capable neural models
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/174637
_version_ 1806059779305504768
spelling sg-ntu-dr.10356-1746372024-05-03T02:58:53Z Resource-efficient learning for vision-capable neural models Tiong, Anthony Meng Huat Li Boyang School of Computer Science and Engineering Salesforce AI Research Singapore Economic Development Board boyang.li@ntu.edu.sg Computer and Information Science Visual language models Computer vision Low-resource learning Long-tailed classification Visual question answering Task relationship The emblem of human intelligence is the ability to address a new task by applying relevant knowledge learned from previous tasks. Hence, human requires only a minimal number of examples from the new task in the adaptation process. In contrary, deep learning models still lag behind human in achieving such remarkable generalization capability to a new task which has limited data. The learning setting is termed as resource-efficient learning. In this thesis, we explore resource-efficient problem formulations for vision-capable deep learning models. We begin by investigating vision-only neural models with application to long-tailed image classification. In long-tailed image classification, the number of training samples for tail classes is scarce, but abundant for head classes. The imbalance in training distribution make the learning of good tail-class representation a difficult task. We propose interpolative centroid contrastive learning (ICCL) approach to encourage the learning of tail-class representation by leveraging the abundant head-class samples. We create an interpolation sample between a head and a tail class and optimize the representation using a new interpolative centroid contrastive loss. We demonstrate the effectiveness of ICCL on multiple long-tailed evaluation datasets. Next, we extend our study to visual language models (VLMs) which involves both image and text modalities. We investigate zero-shot VQA, which restricts VLMs access to any VQA training samples. We devise a modular framework, PNP-VQA which performs zero-shot VQA and requires zero-training. We utilize natural language and network interpretability technique as interface to combine several pretrained models. Concretely, we first generate multiple question-guided captions by attending to relevant image patches. Then, we input the captions as context to a pretrained language model to answer the question. Our question-guided captions can capture detailed visual attributes and contain answer words, thus facilitate the question answering model to obtain the correct answer. Our PNP-VQA achieves state-of-the-art results on multiple VQA benchmarks. We conclude by investigating the zero-shot evaluation for VLMs. It is crucial that the performance of VLMs on testing tasks in zero-shot setting reflects their true generalization capability, so that we can have a fair comparison between VLMs and track their progress. When a testing task shares high similarity with the training tasks of a VLM, the performance of that VLM is likely to be higher than the other VLMs which do not have such similarity. Therefore, we perform transfer learning experiments to study the task similarity between training and testing tasks which is not considered when evaluating VLMs. Additionally, we discover the underlying VL skills from the data directly by utilizing factor analysis on the transfer performance. We demonstrate that factor analysis is an effective data-driven approach that identifies reasonable yet surprising VL skills. Furthermore, we address the lack of VL benchmarks that focus on VLM evaluation in the wild by proposing a new benchmark, OLIVE. It simulates the diverse queries from users to VLMs in practical, real-world scenarios. Doctor of Philosophy 2024-04-05T03:03:31Z 2024-04-05T03:03:31Z 2024 Thesis-Doctor of Philosophy Tiong, A. M. H. (2024). Resource-efficient learning for vision-capable neural models. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/174637 https://hdl.handle.net/10356/174637 10.32657/10356/174637 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University