Towards unbiased, accurate and robust fine-tuning of zero-shot vision models

A foundational objective of machine learning is to create models that are (1) unbiased, ensuring fair predictions across different classes; (2) accurate, ex- celling in in-distribution (target) environments; and (3) robust, achieving high performance even under distribution shifts. Recently, vision...

全面介紹

Saved in:

書目詳細資料
主要作者:	Zhu Beier
其他作者:	Hanwang Zhang
格式:	Thesis-Doctor of Philosophy
語言:	English
出版:	Nanyang Technological University 2024
主題:	Computer and Information Science
在線閱讀:	https://hdl.handle.net/10356/181746
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University
語言:	English

id	sg-ntu-dr.10356-181746
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science
spellingShingle	Computer and Information Science Zhu Beier Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
description	A foundational objective of machine learning is to create models that are (1) unbiased, ensuring fair predictions across different classes; (2) accurate, ex- celling in in-distribution (target) environments; and (3) robust, achieving high performance even under distribution shifts. Recently, vision models pre-trained with language supervision on large-scale data empower zero-shot inference through prompting. Such zero-shot models have demonstrated unprecedented robustness across a broad range of distributions. However, the pre-training data often exhibit a skewed label distribution, contributing to poor performance of zero-shot models on less frequent classes. Additionally, zero-shot models are still inaccurate on several domain-specific tasks, such as differentiating between car models, flower species, and aircraft variants. Therefore, it is a common practice to boost the accuracy and correct the imbalanced prediction via fine-tuning on downstream labeled data. However, fine-tuning with few-shot samples sometimes leads to over-fitting, making these models under-perform compared to zero-shot models. Moreover, even with abundant downstream data, fine-tuning often comes at the cost of robustness: fine- tuned models easily exploit spurious correlations that only hold on the downstream distribution, resulting in lower performance on distribution shifts compared to zero- shot models. This raises a natural question: Can fine-tuned zero-shot models achieve unbiased, accurate, and robust predictions all at once? In this thesis, we affirmatively answer the question through the presentation of three comprehensive studies. • To achieve unbiased predictions, we propose Generalized Logit Adjustment (GLA), a simple post-hoc method which removes the label distribution bias of zero-shot model via estimating the label distribution of the pre-training dataset. Notably, direct access to pre-training data is often restricted due to privacy or copyright concerns. Instead, we only use the downstream data and the zero-shot model to derive an unbiased zero-shot model. Moreover, we prove the non-asymptotic convergence guarantees of the label distribution estimation and demonstrate that ensembling the debiased zero-shot model with an off-the-shelf fine-tuned model is the Bayes optimal classifier. • To avoid the over-fitting issue in few-shot adaptation, we present Prompt- aligned Gradient, dubbed ProGrad – to prevent fine-tuning from forgetting the general knowledge from zero-shot models. By leveraging knowledge from the pre-trained data to regularize fine-tuning on a specific distribution, our ProGrad method is robust to distribution shifts. We further justify the proposed method by demonstrating that it offers lower generalization error bound compared to plain fine-tuning. • To resolve the undesirable ID-OOD trade-offs that persist in prevailing fine- tuning methods: out-of-distribution (OOD) robustness is at odds with in- distribution (ID) accuracy, we propose a sample-wise ensembling technique that can simultaneously attain the best performance on ID and OOD data without trade-offs. Our theoretical analysis shows that it effectively min- imizes the variance of the ensemble models, resulting in reduced residual error. The three proposed methods are independent and can be combined to create fine- tuned models that are unbiased, accurate, and robust. These methods have been thoroughly evaluated in real-world settings, including many-shot learning with abundant data, few-shot learning, and long-tail classification—a challenging sce- nario that combines elements of both many-shot and few-shot data. In all these settings, the methods consistently deliver unbiased predictions and achieve state- of-the-art accuracy and robustness.
author2	Hanwang Zhang
author_facet	Hanwang Zhang Zhu Beier
format	Thesis-Doctor of Philosophy
author	Zhu Beier
author_sort	Zhu Beier
title	Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
title_short	Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
title_full	Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
title_fullStr	Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
title_full_unstemmed	Towards unbiased, accurate and robust fine-tuning of zero-shot vision models
title_sort	towards unbiased, accurate and robust fine-tuning of zero-shot vision models
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/181746
_version_	1821237166199537664
spelling	sg-ntu-dr.10356-1817462025-01-02T10:18:25Z Towards unbiased, accurate and robust fine-tuning of zero-shot vision models Zhu Beier Hanwang Zhang College of Computing and Data Science hanwangzhang@ntu.edu.sg Computer and Information Science A foundational objective of machine learning is to create models that are (1) unbiased, ensuring fair predictions across different classes; (2) accurate, ex- celling in in-distribution (target) environments; and (3) robust, achieving high performance even under distribution shifts. Recently, vision models pre-trained with language supervision on large-scale data empower zero-shot inference through prompting. Such zero-shot models have demonstrated unprecedented robustness across a broad range of distributions. However, the pre-training data often exhibit a skewed label distribution, contributing to poor performance of zero-shot models on less frequent classes. Additionally, zero-shot models are still inaccurate on several domain-specific tasks, such as differentiating between car models, flower species, and aircraft variants. Therefore, it is a common practice to boost the accuracy and correct the imbalanced prediction via fine-tuning on downstream labeled data. However, fine-tuning with few-shot samples sometimes leads to over-fitting, making these models under-perform compared to zero-shot models. Moreover, even with abundant downstream data, fine-tuning often comes at the cost of robustness: fine- tuned models easily exploit spurious correlations that only hold on the downstream distribution, resulting in lower performance on distribution shifts compared to zero- shot models. This raises a natural question: Can fine-tuned zero-shot models achieve unbiased, accurate, and robust predictions all at once? In this thesis, we affirmatively answer the question through the presentation of three comprehensive studies. • To achieve unbiased predictions, we propose Generalized Logit Adjustment (GLA), a simple post-hoc method which removes the label distribution bias of zero-shot model via estimating the label distribution of the pre-training dataset. Notably, direct access to pre-training data is often restricted due to privacy or copyright concerns. Instead, we only use the downstream data and the zero-shot model to derive an unbiased zero-shot model. Moreover, we prove the non-asymptotic convergence guarantees of the label distribution estimation and demonstrate that ensembling the debiased zero-shot model with an off-the-shelf fine-tuned model is the Bayes optimal classifier. • To avoid the over-fitting issue in few-shot adaptation, we present Prompt- aligned Gradient, dubbed ProGrad – to prevent fine-tuning from forgetting the general knowledge from zero-shot models. By leveraging knowledge from the pre-trained data to regularize fine-tuning on a specific distribution, our ProGrad method is robust to distribution shifts. We further justify the proposed method by demonstrating that it offers lower generalization error bound compared to plain fine-tuning. • To resolve the undesirable ID-OOD trade-offs that persist in prevailing fine- tuning methods: out-of-distribution (OOD) robustness is at odds with in- distribution (ID) accuracy, we propose a sample-wise ensembling technique that can simultaneously attain the best performance on ID and OOD data without trade-offs. Our theoretical analysis shows that it effectively min- imizes the variance of the ensemble models, resulting in reduced residual error. The three proposed methods are independent and can be combined to create fine- tuned models that are unbiased, accurate, and robust. These methods have been thoroughly evaluated in real-world settings, including many-shot learning with abundant data, few-shot learning, and long-tail classification—a challenging sce- nario that combines elements of both many-shot and few-shot data. In all these settings, the methods consistently deliver unbiased predictions and achieve state- of-the-art accuracy and robustness. Doctor of Philosophy 2024-12-17T11:56:53Z 2024-12-17T11:56:53Z 2024 Thesis-Doctor of Philosophy Zhu Beier (2024). Towards unbiased, accurate and robust fine-tuning of zero-shot vision models. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181746 https://hdl.handle.net/10356/181746 10.32657/10356/181746 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Towards unbiased, accurate and robust fine-tuning of zero-shot vision models

相似書籍