DualFormer: Local-global stratified transformer for efficient video recognition

While transformers have shown great potential on video recognition with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by the self-attention to the huge number of 3D tokens. In this paper, we present a new transformer architecture ter...

Full description

Saved in:

Bibliographic Details
Main Authors:	LIANG, Yuxuan, ZHOU, Pan, ZIMMERMANN, Roger, YAN, Shuicheng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	Efficient video transformer Local and global attention Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/8980 https://ink.library.smu.edu.sg/context/sis_research/article/9983/viewcontent/2022_ECCV_DualFormer.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9983
record_format	dspace
spelling	sg-smu-ink.sis_research-99832024-07-25T08:32:21Z DualFormer: Local-global stratified transformer for efficient video recognition LIANG, Yuxuan ZHOU, Pan ZIMMERMANN, Roger YAN, Shuicheng While transformers have shown great potential on video recognition with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by the self-attention to the huge number of 3D tokens. In this paper, we present a new transformer architecture termed DualFormer, which can efficiently perform space-time attention for video recognition. Concretely, DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local interactions among nearby 3D tokens, and then to capture coarse-grained global dependencies between the query token and global pyramid contexts. Different from existing methods that apply space-time factorization or restrict attention computations within local windows for improving efficiency, our local-global stratification strategy can well capture both short- and long-range spatiotemporal dependencies, and meanwhile greatly reduces the number of keys and values in attention computation to boost efficiency. Experimental results verify the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer achieves 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with ∼1000G inference FLOPs which is at least 3.2× fewer than existing methods with similar performance. We have released the source code at https://github.com/sail-sg/dualformer. 2022-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8980 info:doi/10.1007/978-3-031-19830-4_33 https://ink.library.smu.edu.sg/context/sis_research/article/9983/viewcontent/2022_ECCV_DualFormer.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Efficient video transformer Local and global attention Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Efficient video transformer Local and global attention Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
spellingShingle	Efficient video transformer Local and global attention Artificial Intelligence and Robotics Graphics and Human Computer Interfaces LIANG, Yuxuan ZHOU, Pan ZIMMERMANN, Roger YAN, Shuicheng DualFormer: Local-global stratified transformer for efficient video recognition
description	While transformers have shown great potential on video recognition with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by the self-attention to the huge number of 3D tokens. In this paper, we present a new transformer architecture termed DualFormer, which can efficiently perform space-time attention for video recognition. Concretely, DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local interactions among nearby 3D tokens, and then to capture coarse-grained global dependencies between the query token and global pyramid contexts. Different from existing methods that apply space-time factorization or restrict attention computations within local windows for improving efficiency, our local-global stratification strategy can well capture both short- and long-range spatiotemporal dependencies, and meanwhile greatly reduces the number of keys and values in attention computation to boost efficiency. Experimental results verify the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer achieves 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with ∼1000G inference FLOPs which is at least 3.2× fewer than existing methods with similar performance. We have released the source code at https://github.com/sail-sg/dualformer.
format	text
author	LIANG, Yuxuan ZHOU, Pan ZIMMERMANN, Roger YAN, Shuicheng
author_facet	LIANG, Yuxuan ZHOU, Pan ZIMMERMANN, Roger YAN, Shuicheng
author_sort	LIANG, Yuxuan
title	DualFormer: Local-global stratified transformer for efficient video recognition
title_short	DualFormer: Local-global stratified transformer for efficient video recognition
title_full	DualFormer: Local-global stratified transformer for efficient video recognition
title_fullStr	DualFormer: Local-global stratified transformer for efficient video recognition
title_full_unstemmed	DualFormer: Local-global stratified transformer for efficient video recognition
title_sort	dualformer: local-global stratified transformer for efficient video recognition
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/sis_research/8980 https://ink.library.smu.edu.sg/context/sis_research/article/9983/viewcontent/2022_ECCV_DualFormer.pdf
_version_	1814047699479035904

DualFormer: Local-global stratified transformer for efficient video recognition

Similar Items