Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
A foundational objective of machine learning is to create models that are (1) unbiased, ensuring fair predictions across different classes; (2) accurate, ex- celling in in-distribution (target) environments; and (3) robust, achieving high performance even under distribution shifts. Recently, vision...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/181746 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | A foundational objective of machine learning is to create models that are (1) unbiased, ensuring fair predictions across different classes; (2) accurate, ex- celling in in-distribution (target) environments; and (3) robust, achieving high performance even under distribution shifts. Recently, vision models pre-trained with language supervision on large-scale data empower zero-shot inference through prompting. Such zero-shot models have demonstrated unprecedented robustness across a broad range of distributions. However, the pre-training data often exhibit a skewed label distribution, contributing to poor performance of zero-shot models on less frequent classes. Additionally, zero-shot models are still inaccurate on several domain-specific tasks, such as differentiating between car models, flower species, and aircraft variants. Therefore, it is a common practice to boost the accuracy and correct the imbalanced prediction via fine-tuning on downstream labeled data.
However, fine-tuning with few-shot samples sometimes leads to over-fitting, making these models under-perform compared to zero-shot models. Moreover, even with abundant downstream data, fine-tuning often comes at the cost of robustness: fine- tuned models easily exploit spurious correlations that only hold on the downstream distribution, resulting in lower performance on distribution shifts compared to zero- shot models. This raises a natural question:
Can fine-tuned zero-shot models achieve unbiased, accurate, and robust predictions all at once?
In this thesis, we affirmatively answer the question through the presentation of three comprehensive studies.
• To achieve unbiased predictions, we propose Generalized Logit Adjustment (GLA), a simple post-hoc method which removes the label distribution bias of zero-shot model via estimating the label distribution of the pre-training dataset. Notably, direct access to pre-training data is often restricted due to privacy or copyright concerns. Instead, we only use the downstream data and the zero-shot model to derive an unbiased zero-shot model. Moreover, we prove the non-asymptotic convergence guarantees of the label distribution estimation and demonstrate that ensembling the debiased zero-shot model with an off-the-shelf fine-tuned model is the Bayes optimal classifier.
• To avoid the over-fitting issue in few-shot adaptation, we present Prompt- aligned Gradient, dubbed ProGrad – to prevent fine-tuning from forgetting the general knowledge from zero-shot models. By leveraging knowledge from the pre-trained data to regularize fine-tuning on a specific distribution, our ProGrad method is robust to distribution shifts. We further justify the proposed method by demonstrating that it offers lower generalization error bound compared to plain fine-tuning.
• To resolve the undesirable ID-OOD trade-offs that persist in prevailing fine- tuning methods: out-of-distribution (OOD) robustness is at odds with in- distribution (ID) accuracy, we propose a sample-wise ensembling technique that can simultaneously attain the best performance on ID and OOD data without trade-offs. Our theoretical analysis shows that it effectively min- imizes the variance of the ensemble models, resulting in reduced residual error.
The three proposed methods are independent and can be combined to create fine- tuned models that are unbiased, accurate, and robust. These methods have been thoroughly evaluated in real-world settings, including many-shot learning with abundant data, few-shot learning, and long-tail classification—a challenging sce- nario that combines elements of both many-shot and few-shot data. In all these settings, the methods consistently deliver unbiased predictions and achieve state- of-the-art accuracy and robustness. |
---|