Inception transformer

Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that...

Full description

Saved in:
Bibliographic Details
Main Authors: SI, Chenyang, YU, Weihao, ZHOU, Pan, ZHOU, Yichen, WANG, Xinchao, YAN, Shuicheng
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2022
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9026
https://ink.library.smu.edu.sg/context/sis_research/article/10029/viewcontent/2022_NeurIPS_inception.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10029
record_format dspace
spelling sg-smu-ink.sis_research-100292024-07-25T08:03:31Z Inception transformer SI, Chenyang YU, Weihao ZHOU, Pan ZHOU, Yichen WANG, Xinchao YAN, Shuicheng Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e., gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and lowfrequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs. Code and models are released at https://github.com/sail-sg/iFormer. 2022-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9026 https://ink.library.smu.edu.sg/context/sis_research/article/10029/viewcontent/2022_NeurIPS_inception.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Theory and Algorithms
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Theory and Algorithms
spellingShingle Theory and Algorithms
SI, Chenyang
YU, Weihao
ZHOU, Pan
ZHOU, Yichen
WANG, Xinchao
YAN, Shuicheng
Inception transformer
description Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e., gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and lowfrequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs. Code and models are released at https://github.com/sail-sg/iFormer.
format text
author SI, Chenyang
YU, Weihao
ZHOU, Pan
ZHOU, Yichen
WANG, Xinchao
YAN, Shuicheng
author_facet SI, Chenyang
YU, Weihao
ZHOU, Pan
ZHOU, Yichen
WANG, Xinchao
YAN, Shuicheng
author_sort SI, Chenyang
title Inception transformer
title_short Inception transformer
title_full Inception transformer
title_fullStr Inception transformer
title_full_unstemmed Inception transformer
title_sort inception transformer
publisher Institutional Knowledge at Singapore Management University
publishDate 2022
url https://ink.library.smu.edu.sg/sis_research/9026
https://ink.library.smu.edu.sg/context/sis_research/article/10029/viewcontent/2022_NeurIPS_inception.pdf
_version_ 1814047711557582848