Resource-efficient learning for vision-capable neural models

The emblem of human intelligence is the ability to address a new task by applying relevant knowledge learned from previous tasks. Hence, human requires only a minimal number of examples from the new task in the adaptation process. In contrary, deep learning models still lag behind human in achievi...

Full description

Saved in:

Bibliographic Details
Main Author:	Tiong, Anthony Meng Huat
Other Authors:	Li Boyang
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Visual language models Computer vision Low-resource learning Long-tailed classification Visual question answering Task relationship
Online Access:	https://hdl.handle.net/10356/174637
Tags:	Add Tag No Tags, Be the first to tag this record!

id	sg-ntu-dr.10356-174637
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Visual language models Computer vision Low-resource learning Long-tailed classification Visual question answering Task relationship
spellingShingle	Computer and Information Science Visual language models Computer vision Low-resource learning Long-tailed classification Visual question answering Task relationship Tiong, Anthony Meng Huat Resource-efficient learning for vision-capable neural models
description	The emblem of human intelligence is the ability to address a new task by applying relevant knowledge learned from previous tasks. Hence, human requires only a minimal number of examples from the new task in the adaptation process. In contrary, deep learning models still lag behind human in achieving such remarkable generalization capability to a new task which has limited data. The learning setting is termed as resource-efficient learning. In this thesis, we explore resource-efficient problem formulations for vision-capable deep learning models. We begin by investigating vision-only neural models with application to long-tailed image classification. In long-tailed image classification, the number of training samples for tail classes is scarce, but abundant for head classes. The imbalance in training distribution make the learning of good tail-class representation a difficult task. We propose interpolative centroid contrastive learning (ICCL) approach to encourage the learning of tail-class representation by leveraging the abundant head-class samples. We create an interpolation sample between a head and a tail class and optimize the representation using a new interpolative centroid contrastive loss. We demonstrate the effectiveness of ICCL on multiple long-tailed evaluation datasets. Next, we extend our study to visual language models (VLMs) which involves both image and text modalities. We investigate zero-shot VQA, which restricts VLMs access to any VQA training samples. We devise a modular framework, PNP-VQA which performs zero-shot VQA and requires zero-training. We utilize natural language and network interpretability technique as interface to combine several pretrained models. Concretely, we first generate multiple question-guided captions by attending to relevant image patches. Then, we input the captions as context to a pretrained language model to answer the question. Our question-guided captions can capture detailed visual attributes and contain answer words, thus facilitate the question answering model to obtain the correct answer. Our PNP-VQA achieves state-of-the-art results on multiple VQA benchmarks. We conclude by investigating the zero-shot evaluation for VLMs. It is crucial that the performance of VLMs on testing tasks in zero-shot setting reflects their true generalization capability, so that we can have a fair comparison between VLMs and track their progress. When a testing task shares high similarity with the training tasks of a VLM, the performance of that VLM is likely to be higher than the other VLMs which do not have such similarity. Therefore, we perform transfer learning experiments to study the task similarity between training and testing tasks which is not considered when evaluating VLMs. Additionally, we discover the underlying VL skills from the data directly by utilizing factor analysis on the transfer performance. We demonstrate that factor analysis is an effective data-driven approach that identifies reasonable yet surprising VL skills. Furthermore, we address the lack of VL benchmarks that focus on VLM evaluation in the wild by proposing a new benchmark, OLIVE. It simulates the diverse queries from users to VLMs in practical, real-world scenarios.
author2	Li Boyang
author_facet	Li Boyang Tiong, Anthony Meng Huat
format	Thesis-Doctor of Philosophy
author	Tiong, Anthony Meng Huat
author_sort	Tiong, Anthony Meng Huat
title	Resource-efficient learning for vision-capable neural models
title_short	Resource-efficient learning for vision-capable neural models
title_full	Resource-efficient learning for vision-capable neural models
title_fullStr	Resource-efficient learning for vision-capable neural models
title_full_unstemmed	Resource-efficient learning for vision-capable neural models
title_sort	resource-efficient learning for vision-capable neural models
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/174637
_version_	1806059779305504768
spelling	sg-ntu-dr.10356-1746372024-05-03T02:58:53Z Resource-efficient learning for vision-capable neural models Tiong, Anthony Meng Huat Li Boyang School of Computer Science and Engineering Salesforce AI Research Singapore Economic Development Board boyang.li@ntu.edu.sg Computer and Information Science Visual language models Computer vision Low-resource learning Long-tailed classification Visual question answering Task relationship The emblem of human intelligence is the ability to address a new task by applying relevant knowledge learned from previous tasks. Hence, human requires only a minimal number of examples from the new task in the adaptation process. In contrary, deep learning models still lag behind human in achieving such remarkable generalization capability to a new task which has limited data. The learning setting is termed as resource-efficient learning. In this thesis, we explore resource-efficient problem formulations for vision-capable deep learning models. We begin by investigating vision-only neural models with application to long-tailed image classification. In long-tailed image classification, the number of training samples for tail classes is scarce, but abundant for head classes. The imbalance in training distribution make the learning of good tail-class representation a difficult task. We propose interpolative centroid contrastive learning (ICCL) approach to encourage the learning of tail-class representation by leveraging the abundant head-class samples. We create an interpolation sample between a head and a tail class and optimize the representation using a new interpolative centroid contrastive loss. We demonstrate the effectiveness of ICCL on multiple long-tailed evaluation datasets. Next, we extend our study to visual language models (VLMs) which involves both image and text modalities. We investigate zero-shot VQA, which restricts VLMs access to any VQA training samples. We devise a modular framework, PNP-VQA which performs zero-shot VQA and requires zero-training. We utilize natural language and network interpretability technique as interface to combine several pretrained models. Concretely, we first generate multiple question-guided captions by attending to relevant image patches. Then, we input the captions as context to a pretrained language model to answer the question. Our question-guided captions can capture detailed visual attributes and contain answer words, thus facilitate the question answering model to obtain the correct answer. Our PNP-VQA achieves state-of-the-art results on multiple VQA benchmarks. We conclude by investigating the zero-shot evaluation for VLMs. It is crucial that the performance of VLMs on testing tasks in zero-shot setting reflects their true generalization capability, so that we can have a fair comparison between VLMs and track their progress. When a testing task shares high similarity with the training tasks of a VLM, the performance of that VLM is likely to be higher than the other VLMs which do not have such similarity. Therefore, we perform transfer learning experiments to study the task similarity between training and testing tasks which is not considered when evaluating VLMs. Additionally, we discover the underlying VL skills from the data directly by utilizing factor analysis on the transfer performance. We demonstrate that factor analysis is an effective data-driven approach that identifies reasonable yet surprising VL skills. Furthermore, we address the lack of VL benchmarks that focus on VLM evaluation in the wild by proposing a new benchmark, OLIVE. It simulates the diverse queries from users to VLMs in practical, real-world scenarios. Doctor of Philosophy 2024-04-05T03:03:31Z 2024-04-05T03:03:31Z 2024 Thesis-Doctor of Philosophy Tiong, A. M. H. (2024). Resource-efficient learning for vision-capable neural models. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/174637 https://hdl.handle.net/10356/174637 10.32657/10356/174637 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Resource-efficient learning for vision-capable neural models

Similar Items