Towards unbiased, accurate and robust fine-tuning of zero-shot vision models

A foundational objective of machine learning is to create models that are (1) unbiased, ensuring fair predictions across different classes; (2) accurate, ex- celling in in-distribution (target) environments; and (3) robust, achieving high performance even under distribution shifts. Recently, vision...

Full description

Saved in:

Bibliographic Details
Main Author:	Zhu Beier
Other Authors:	Hanwang Zhang
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science
Online Access:	https://hdl.handle.net/10356/181746
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

Description
Summary:	A foundational objective of machine learning is to create models that are (1) unbiased, ensuring fair predictions across different classes; (2) accurate, ex- celling in in-distribution (target) environments; and (3) robust, achieving high performance even under distribution shifts. Recently, vision models pre-trained with language supervision on large-scale data empower zero-shot inference through prompting. Such zero-shot models have demonstrated unprecedented robustness across a broad range of distributions. However, the pre-training data often exhibit a skewed label distribution, contributing to poor performance of zero-shot models on less frequent classes. Additionally, zero-shot models are still inaccurate on several domain-specific tasks, such as differentiating between car models, flower species, and aircraft variants. Therefore, it is a common practice to boost the accuracy and correct the imbalanced prediction via fine-tuning on downstream labeled data. However, fine-tuning with few-shot samples sometimes leads to over-fitting, making these models under-perform compared to zero-shot models. Moreover, even with abundant downstream data, fine-tuning often comes at the cost of robustness: fine- tuned models easily exploit spurious correlations that only hold on the downstream distribution, resulting in lower performance on distribution shifts compared to zero- shot models. This raises a natural question: Can fine-tuned zero-shot models achieve unbiased, accurate, and robust predictions all at once? In this thesis, we affirmatively answer the question through the presentation of three comprehensive studies. • To achieve unbiased predictions, we propose Generalized Logit Adjustment (GLA), a simple post-hoc method which removes the label distribution bias of zero-shot model via estimating the label distribution of the pre-training dataset. Notably, direct access to pre-training data is often restricted due to privacy or copyright concerns. Instead, we only use the downstream data and the zero-shot model to derive an unbiased zero-shot model. Moreover, we prove the non-asymptotic convergence guarantees of the label distribution estimation and demonstrate that ensembling the debiased zero-shot model with an off-the-shelf fine-tuned model is the Bayes optimal classifier. • To avoid the over-fitting issue in few-shot adaptation, we present Prompt- aligned Gradient, dubbed ProGrad – to prevent fine-tuning from forgetting the general knowledge from zero-shot models. By leveraging knowledge from the pre-trained data to regularize fine-tuning on a specific distribution, our ProGrad method is robust to distribution shifts. We further justify the proposed method by demonstrating that it offers lower generalization error bound compared to plain fine-tuning. • To resolve the undesirable ID-OOD trade-offs that persist in prevailing fine- tuning methods: out-of-distribution (OOD) robustness is at odds with in- distribution (ID) accuracy, we propose a sample-wise ensembling technique that can simultaneously attain the best performance on ID and OOD data without trade-offs. Our theoretical analysis shows that it effectively min- imizes the variance of the ensemble models, resulting in reduced residual error. The three proposed methods are independent and can be combined to create fine- tuned models that are unbiased, accurate, and robust. These methods have been thoroughly evaluated in real-world settings, including many-shot learning with abundant data, few-shot learning, and long-tail classification—a challenging sce- nario that combines elements of both many-shot and few-shot data. In all these settings, the methods consistently deliver unbiased predictions and achieve state- of-the-art accuracy and robustness.

Towards unbiased, accurate and robust fine-tuning of zero-shot vision models

Similar Items