Resource-efficient learning for vision-capable neural models
The emblem of human intelligence is the ability to address a new task by applying relevant knowledge learned from previous tasks. Hence, human requires only a minimal number of examples from the new task in the adaptation process. In contrary, deep learning models still lag behind human in achievi...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/174637 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The emblem of human intelligence is the ability to address a new task by applying relevant knowledge learned from previous tasks. Hence, human requires only a minimal number of examples from the new task in the adaptation process.
In contrary, deep learning models still lag behind human in achieving such remarkable generalization capability to a new task which has limited data.
The learning setting is termed as resource-efficient learning.
In this thesis, we explore resource-efficient problem formulations for vision-capable deep learning models.
We begin by investigating vision-only neural models with application to long-tailed image classification. In long-tailed image classification, the number of training samples for tail classes is scarce, but abundant for head classes. The imbalance in training distribution make the learning of good tail-class representation a difficult task.
We propose interpolative centroid contrastive learning (ICCL) approach to encourage the learning of tail-class representation by leveraging the abundant head-class samples. We create an interpolation sample between a head and a tail class and optimize the representation using a new interpolative centroid contrastive loss. We demonstrate the effectiveness of ICCL on multiple long-tailed evaluation datasets.
Next, we extend our study to visual language models (VLMs) which involves both image and text modalities. We investigate zero-shot VQA, which restricts VLMs access to any VQA training samples.
We devise a modular framework, PNP-VQA which performs zero-shot VQA and requires zero-training. We utilize natural language and network interpretability technique as interface to combine several pretrained models. Concretely, we first generate multiple question-guided captions by attending to relevant image patches. Then, we input the captions as context to a pretrained language model to answer the question. Our question-guided captions can capture detailed visual attributes and contain answer words, thus facilitate the question answering model to obtain the correct answer. Our PNP-VQA achieves state-of-the-art results on multiple VQA benchmarks.
We conclude by investigating the zero-shot evaluation for VLMs.
It is crucial that the performance of VLMs on testing tasks in zero-shot setting reflects their true generalization capability, so that we can have a fair comparison between VLMs and track their progress.
When a testing task shares high similarity with the training tasks of a VLM, the performance of that VLM is likely to be higher than the other VLMs which do not have such similarity.
Therefore, we perform transfer learning experiments to study the task similarity between training and testing tasks which is not considered when evaluating VLMs. Additionally, we discover the underlying VL skills from the data directly by utilizing factor analysis on the transfer performance. We demonstrate that factor analysis is an effective data-driven approach that identifies reasonable yet surprising VL skills.
Furthermore, we address the lack of VL benchmarks that focus on VLM evaluation in the wild by proposing a new benchmark, OLIVE. It simulates the diverse queries from users to VLMs in practical, real-world scenarios. |
---|