Token shift transformer for video classification
Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP and Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, and flexibility...
Saved in:
Main Authors: | , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2021
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/6807 https://ink.library.smu.edu.sg/context/sis_research/article/7810/viewcontent/Token_Shift_Transformer_for_Video_Classification.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-7810 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-78102022-01-27T08:29:50Z Token shift transformer for video classification ZHANG Hao, HAO, Yanbin. NGO, Chong-wah Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP and Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, and flexibility in processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy computational burden when being applied on the complex 3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift), a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within each transformer encoder. Specifically, the TokShift barely temporally shifts partial [Class] token features back-and-forth across adjacent frames. Then, we densely plug the module into each encoder of a plain 2D vision transformer for learning 3D video representation. It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding. Experiments on standard benchmarks verify its robustness, effectiveness, and efficiency. Particularly, with input clips of 8/12 frames, the TokShift transformer achieves SOTA precision: 79.83%/80.40% on the Kinetics-400, 66.56% on EGTEA-Gaze+, and 96.80% on UCF-101 datasets, comparable or better than existing SOTA convolutional counterparts. Our code is open-sourced in: https://github.com/VideoNetworks/TokShift-Transformer. 2021-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/6807 info:doi/10.1145/3474085.3475272 https://ink.library.smu.edu.sg/context/sis_research/article/7810/viewcontent/Token_Shift_Transformer_for_Video_Classification.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Self-attention Shift Transformer Video classification Databases and Information Systems |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Self-attention Shift Transformer Video classification Databases and Information Systems |
spellingShingle |
Self-attention Shift Transformer Video classification Databases and Information Systems ZHANG Hao, HAO, Yanbin. NGO, Chong-wah Token shift transformer for video classification |
description |
Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP and Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, and flexibility in processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy computational burden when being applied on the complex 3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift), a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within each transformer encoder. Specifically, the TokShift barely temporally shifts partial [Class] token features back-and-forth across adjacent frames. Then, we densely plug the module into each encoder of a plain 2D vision transformer for learning 3D video representation. It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding. Experiments on standard benchmarks verify its robustness, effectiveness, and efficiency. Particularly, with input clips of 8/12 frames, the TokShift transformer achieves SOTA precision: 79.83%/80.40% on the Kinetics-400, 66.56% on EGTEA-Gaze+, and 96.80% on UCF-101 datasets, comparable or better than existing SOTA convolutional counterparts. Our code is open-sourced in: https://github.com/VideoNetworks/TokShift-Transformer. |
format |
text |
author |
ZHANG Hao, HAO, Yanbin. NGO, Chong-wah |
author_facet |
ZHANG Hao, HAO, Yanbin. NGO, Chong-wah |
author_sort |
ZHANG Hao, |
title |
Token shift transformer for video classification |
title_short |
Token shift transformer for video classification |
title_full |
Token shift transformer for video classification |
title_fullStr |
Token shift transformer for video classification |
title_full_unstemmed |
Token shift transformer for video classification |
title_sort |
token shift transformer for video classification |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2021 |
url |
https://ink.library.smu.edu.sg/sis_research/6807 https://ink.library.smu.edu.sg/context/sis_research/article/7810/viewcontent/Token_Shift_Transformer_for_Video_Classification.pdf |
_version_ |
1770576072633483264 |