Long-term leap attention, short-term periodic shift for video classification

Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes �� times longer sequence than the latter under the current attention of quadratic complexity (�� 2�� 2 ). The existing works treat the temporal axis as a simple extension of spat...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHANG, Hao, CHENG, Lechao, HAO, Yanbin, NGO, Chong-wah
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	Video classification Transformer Shift Leap attention Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/7507 https://ink.library.smu.edu.sg/context/sis_research/article/8510/viewcontent/3503161.3547908.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

Description
Summary:	Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes �� times longer sequence than the latter under the current attention of quadratic complexity (�� 2�� 2 ). The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy. However, videos naturally contain redundant information between neighboring frames; thereby, we could potentially suppress attention on visually similar frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a long-term “Leap Attention” (LA), short-term “Periodic Shift” (P-Shift) module for video transformers, with (2�� 2 ) complexity. Specifically, the “LA” groups longterm frames into pairs, then refactors each discrete pair via attention. The “P-Shift” exchanges features between temporal neighbors to confront the loss of short-term dynamics. By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead (∼2.6%). Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS transformer could achieve competitive performances in terms of accuracy, FLOPs, and Params among CNN and transformer SOTAs. We open-source our project in https://github.com/VideoNetworks/ LAPS-transformer.

Long-term leap attention, short-term periodic shift for video classification

Similar Items