Towards unbiased, accurate and robust fine-tuning of zero-shot vision models

A foundational objective of machine learning is to create models that are (1) unbiased, ensuring fair predictions across different classes; (2) accurate, ex- celling in in-distribution (target) environments; and (3) robust, achieving high performance even under distribution shifts. Recently, vision...

Full description

Saved in:
Bibliographic Details
Main Author: Zhu Beier
Other Authors: Hanwang Zhang
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/181746
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-181746
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
spellingShingle Computer and Information Science
Zhu Beier
Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
description A foundational objective of machine learning is to create models that are (1) unbiased, ensuring fair predictions across different classes; (2) accurate, ex- celling in in-distribution (target) environments; and (3) robust, achieving high performance even under distribution shifts. Recently, vision models pre-trained with language supervision on large-scale data empower zero-shot inference through prompting. Such zero-shot models have demonstrated unprecedented robustness across a broad range of distributions. However, the pre-training data often exhibit a skewed label distribution, contributing to poor performance of zero-shot models on less frequent classes. Additionally, zero-shot models are still inaccurate on several domain-specific tasks, such as differentiating between car models, flower species, and aircraft variants. Therefore, it is a common practice to boost the accuracy and correct the imbalanced prediction via fine-tuning on downstream labeled data. However, fine-tuning with few-shot samples sometimes leads to over-fitting, making these models under-perform compared to zero-shot models. Moreover, even with abundant downstream data, fine-tuning often comes at the cost of robustness: fine- tuned models easily exploit spurious correlations that only hold on the downstream distribution, resulting in lower performance on distribution shifts compared to zero- shot models. This raises a natural question: Can fine-tuned zero-shot models achieve unbiased, accurate, and robust predictions all at once? In this thesis, we affirmatively answer the question through the presentation of three comprehensive studies. • To achieve unbiased predictions, we propose Generalized Logit Adjustment (GLA), a simple post-hoc method which removes the label distribution bias of zero-shot model via estimating the label distribution of the pre-training dataset. Notably, direct access to pre-training data is often restricted due to privacy or copyright concerns. Instead, we only use the downstream data and the zero-shot model to derive an unbiased zero-shot model. Moreover, we prove the non-asymptotic convergence guarantees of the label distribution estimation and demonstrate that ensembling the debiased zero-shot model with an off-the-shelf fine-tuned model is the Bayes optimal classifier. • To avoid the over-fitting issue in few-shot adaptation, we present Prompt- aligned Gradient, dubbed ProGrad – to prevent fine-tuning from forgetting the general knowledge from zero-shot models. By leveraging knowledge from the pre-trained data to regularize fine-tuning on a specific distribution, our ProGrad method is robust to distribution shifts. We further justify the proposed method by demonstrating that it offers lower generalization error bound compared to plain fine-tuning. • To resolve the undesirable ID-OOD trade-offs that persist in prevailing fine- tuning methods: out-of-distribution (OOD) robustness is at odds with in- distribution (ID) accuracy, we propose a sample-wise ensembling technique that can simultaneously attain the best performance on ID and OOD data without trade-offs. Our theoretical analysis shows that it effectively min- imizes the variance of the ensemble models, resulting in reduced residual error. The three proposed methods are independent and can be combined to create fine- tuned models that are unbiased, accurate, and robust. These methods have been thoroughly evaluated in real-world settings, including many-shot learning with abundant data, few-shot learning, and long-tail classification—a challenging sce- nario that combines elements of both many-shot and few-shot data. In all these settings, the methods consistently deliver unbiased predictions and achieve state- of-the-art accuracy and robustness.
author2 Hanwang Zhang
author_facet Hanwang Zhang
Zhu Beier
format Thesis-Doctor of Philosophy
author Zhu Beier
author_sort Zhu Beier
title Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
title_short Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
title_full Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
title_fullStr Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
title_full_unstemmed Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
title_sort towards unbiased, accurate and robust fine-tuning of zero-shot vision models
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/181746
_version_ 1819113048448696320
spelling sg-ntu-dr.10356-1817462024-12-17T11:56:53Z Towards unbiased, accurate and robust fine-tuning of zero-shot vision models Zhu Beier Hanwang Zhang College of Computing and Data Science hanwangzhang@ntu.edu.sg Computer and Information Science A foundational objective of machine learning is to create models that are (1) unbiased, ensuring fair predictions across different classes; (2) accurate, ex- celling in in-distribution (target) environments; and (3) robust, achieving high performance even under distribution shifts. Recently, vision models pre-trained with language supervision on large-scale data empower zero-shot inference through prompting. Such zero-shot models have demonstrated unprecedented robustness across a broad range of distributions. However, the pre-training data often exhibit a skewed label distribution, contributing to poor performance of zero-shot models on less frequent classes. Additionally, zero-shot models are still inaccurate on several domain-specific tasks, such as differentiating between car models, flower species, and aircraft variants. Therefore, it is a common practice to boost the accuracy and correct the imbalanced prediction via fine-tuning on downstream labeled data. However, fine-tuning with few-shot samples sometimes leads to over-fitting, making these models under-perform compared to zero-shot models. Moreover, even with abundant downstream data, fine-tuning often comes at the cost of robustness: fine- tuned models easily exploit spurious correlations that only hold on the downstream distribution, resulting in lower performance on distribution shifts compared to zero- shot models. This raises a natural question: Can fine-tuned zero-shot models achieve unbiased, accurate, and robust predictions all at once? In this thesis, we affirmatively answer the question through the presentation of three comprehensive studies. • To achieve unbiased predictions, we propose Generalized Logit Adjustment (GLA), a simple post-hoc method which removes the label distribution bias of zero-shot model via estimating the label distribution of the pre-training dataset. Notably, direct access to pre-training data is often restricted due to privacy or copyright concerns. Instead, we only use the downstream data and the zero-shot model to derive an unbiased zero-shot model. Moreover, we prove the non-asymptotic convergence guarantees of the label distribution estimation and demonstrate that ensembling the debiased zero-shot model with an off-the-shelf fine-tuned model is the Bayes optimal classifier. • To avoid the over-fitting issue in few-shot adaptation, we present Prompt- aligned Gradient, dubbed ProGrad – to prevent fine-tuning from forgetting the general knowledge from zero-shot models. By leveraging knowledge from the pre-trained data to regularize fine-tuning on a specific distribution, our ProGrad method is robust to distribution shifts. We further justify the proposed method by demonstrating that it offers lower generalization error bound compared to plain fine-tuning. • To resolve the undesirable ID-OOD trade-offs that persist in prevailing fine- tuning methods: out-of-distribution (OOD) robustness is at odds with in- distribution (ID) accuracy, we propose a sample-wise ensembling technique that can simultaneously attain the best performance on ID and OOD data without trade-offs. Our theoretical analysis shows that it effectively min- imizes the variance of the ensemble models, resulting in reduced residual error. The three proposed methods are independent and can be combined to create fine- tuned models that are unbiased, accurate, and robust. These methods have been thoroughly evaluated in real-world settings, including many-shot learning with abundant data, few-shot learning, and long-tail classification—a challenging sce- nario that combines elements of both many-shot and few-shot data. In all these settings, the methods consistently deliver unbiased predictions and achieve state- of-the-art accuracy and robustness. Doctor of Philosophy 2024-12-17T11:56:53Z 2024-12-17T11:56:53Z 2024 Thesis-Doctor of Philosophy Zhu Beier (2024). Towards unbiased, accurate and robust fine-tuning of zero-shot vision models. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181746 https://hdl.handle.net/10356/181746 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University