Self-promoted supervision for few-shot transformer

The few-shot learning ability of vision transformers (ViTs) is rarely investigated though heavily desired. In this work, we empirically find that with the same few-shot learning frameworks, e.g. MetaBaseline, replacing the widely used CNN feature extractor with a ViT model often severely impairs few...

Full description

Saved in:
Bibliographic Details
Main Authors: DONG, Bowen, ZHOU, Pan, YAN, Shuicheng, ZUO, Wangmeng
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2022
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8984
https://ink.library.smu.edu.sg/context/sis_research/article/9987/viewcontent/2022_ECCV_few_shot__1_.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9987
record_format dspace
spelling sg-smu-ink.sis_research-99872024-07-25T08:30:38Z Self-promoted supervision for few-shot transformer DONG, Bowen ZHOU, Pan YAN, Shuicheng ZUO, Wangmeng The few-shot learning ability of vision transformers (ViTs) is rarely investigated though heavily desired. In this work, we empirically find that with the same few-shot learning frameworks, e.g. MetaBaseline, replacing the widely used CNN feature extractor with a ViT model often severely impairs few-shot classification performance. Moreover, our empirical study shows that in the absence of inductive bias, ViTs often learn the low-qualified token dependencies under few-shot learning regime where only a few labeled training data are available, which largely contributes to the above performance degradation. To alleviate this issue, for the first time, we propose a simple yet effective few-shot training framework for ViTs, namely Self-promoted sUpervisioN (SUN). Specifically, besides the conventional global supervision for global semantic learning, SUN further pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token. This location-specific supervision tells the ViT which patch tokens are similar or dissimilar and thus accelerates token dependency learning. Moreover, it models the local semantics in each patch token to improve the object grounding and recognition capability which helps learn generalizable patterns. To improve the quality of location-specific supervision, we further propose two techniques: 1) background patch filtration to filtrate background patches out and assign them into an extra background class; and 2) spatialconsistent augmentation to introduce sufficient diversity for data augmentation while keeping the accuracy of the generated local supervisions. Experimental results show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts. Our code is publicly available at https://github.com/DongSky/few-shot-vit 2022-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8984 info:doi/10.1007/978-3-031-20044-1_19 https://ink.library.smu.edu.sg/context/sis_research/article/9987/viewcontent/2022_ECCV_few_shot__1_.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University few-shot learning location-specific supervision Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic few-shot learning
location-specific supervision
Graphics and Human Computer Interfaces
spellingShingle few-shot learning
location-specific supervision
Graphics and Human Computer Interfaces
DONG, Bowen
ZHOU, Pan
YAN, Shuicheng
ZUO, Wangmeng
Self-promoted supervision for few-shot transformer
description The few-shot learning ability of vision transformers (ViTs) is rarely investigated though heavily desired. In this work, we empirically find that with the same few-shot learning frameworks, e.g. MetaBaseline, replacing the widely used CNN feature extractor with a ViT model often severely impairs few-shot classification performance. Moreover, our empirical study shows that in the absence of inductive bias, ViTs often learn the low-qualified token dependencies under few-shot learning regime where only a few labeled training data are available, which largely contributes to the above performance degradation. To alleviate this issue, for the first time, we propose a simple yet effective few-shot training framework for ViTs, namely Self-promoted sUpervisioN (SUN). Specifically, besides the conventional global supervision for global semantic learning, SUN further pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token. This location-specific supervision tells the ViT which patch tokens are similar or dissimilar and thus accelerates token dependency learning. Moreover, it models the local semantics in each patch token to improve the object grounding and recognition capability which helps learn generalizable patterns. To improve the quality of location-specific supervision, we further propose two techniques: 1) background patch filtration to filtrate background patches out and assign them into an extra background class; and 2) spatialconsistent augmentation to introduce sufficient diversity for data augmentation while keeping the accuracy of the generated local supervisions. Experimental results show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts. Our code is publicly available at https://github.com/DongSky/few-shot-vit
format text
author DONG, Bowen
ZHOU, Pan
YAN, Shuicheng
ZUO, Wangmeng
author_facet DONG, Bowen
ZHOU, Pan
YAN, Shuicheng
ZUO, Wangmeng
author_sort DONG, Bowen
title Self-promoted supervision for few-shot transformer
title_short Self-promoted supervision for few-shot transformer
title_full Self-promoted supervision for few-shot transformer
title_fullStr Self-promoted supervision for few-shot transformer
title_full_unstemmed Self-promoted supervision for few-shot transformer
title_sort self-promoted supervision for few-shot transformer
publisher Institutional Knowledge at Singapore Management University
publishDate 2022
url https://ink.library.smu.edu.sg/sis_research/8984
https://ink.library.smu.edu.sg/context/sis_research/article/9987/viewcontent/2022_ECCV_few_shot__1_.pdf
_version_ 1814047700583186432