PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition

In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To add...

Full description

Saved in:

Bibliographic Details
Main Authors:	HAO, Yanbin, ZHOU, Diansong, WANG, Zhicai, NGO, Chong-wah, HE, Xiangnan, WANG, Meng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Multi-layer perceptron Positional encoding Spatio-temporal modeling Video recognition Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/8256 https://ink.library.smu.edu.sg/context/sis_research/article/9259/viewcontent/PosMLP_preprint_pvoa_cc_by.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9259
record_format	dspace
spelling	sg-smu-ink.sis_research-92592025-01-01T15:25:52Z PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition HAO, Yanbin ZHOU, Diansong WANG, Zhicai NGO, Chong-wah HE, Xiangnan WANG, Meng In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding (RPE) to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP’s positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal factorized positional MLP blocks, which not only decrease model complexity but also maintain good performance. Additionally, we improve the locality of modeling using window partitioning and enrich relative positional relationships using channel grouping. Experimental results demonstrate that PosMLP-Video achieves competitive speed-accuracy trade-offs compared to the previous state-of-the-art models. In particular, PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy on Something-Something V1/V2 and 82.1% top-1 accuracy on Kinetics-400 while requiring much fewer parameters and FLOPs than other models. The code will be made publicly available. 2024-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8256 info:doi/10.21203/rs.3.rs-3485088/v1 https://ink.library.smu.edu.sg/context/sis_research/article/9259/viewcontent/PosMLP_preprint_pvoa_cc_by.pdf http://creativecommons.org/licenses/by/3.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Multi-layer perceptron Positional encoding Spatio-temporal modeling Video recognition Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Multi-layer perceptron Positional encoding Spatio-temporal modeling Video recognition Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
spellingShingle	Multi-layer perceptron Positional encoding Spatio-temporal modeling Video recognition Artificial Intelligence and Robotics Graphics and Human Computer Interfaces HAO, Yanbin ZHOU, Diansong WANG, Zhicai NGO, Chong-wah HE, Xiangnan WANG, Meng PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition
description	In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding (RPE) to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP’s positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal factorized positional MLP blocks, which not only decrease model complexity but also maintain good performance. Additionally, we improve the locality of modeling using window partitioning and enrich relative positional relationships using channel grouping. Experimental results demonstrate that PosMLP-Video achieves competitive speed-accuracy trade-offs compared to the previous state-of-the-art models. In particular, PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy on Something-Something V1/V2 and 82.1% top-1 accuracy on Kinetics-400 while requiring much fewer parameters and FLOPs than other models. The code will be made publicly available.
format	text
author	HAO, Yanbin ZHOU, Diansong WANG, Zhicai NGO, Chong-wah HE, Xiangnan WANG, Meng
author_facet	HAO, Yanbin ZHOU, Diansong WANG, Zhicai NGO, Chong-wah HE, Xiangnan WANG, Meng
author_sort	HAO, Yanbin
title	PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition
title_short	PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition
title_full	PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition
title_fullStr	PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition
title_full_unstemmed	PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition
title_sort	posmlp-video: spatial and temporal relative position encoding for efficient video recognition
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/8256 https://ink.library.smu.edu.sg/context/sis_research/article/9259/viewcontent/PosMLP_preprint_pvoa_cc_by.pdf
_version_	1821237257362735104

PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition

Similar Items