Learning temporal dynamics in videos with image transformer

Temporal dynamics represent the evolving of video content over time, which are critical for action recognition. In this paper, we ask the question: can the off-the-shelf image transformer architecture learn temporal dynamics in videos? To this end, we propose Multidimensional Stacked Image (MSImage)...

Full description

Saved in:
Bibliographic Details
Main Authors: SHU, Yan, QIU, Z, LONG, Fuchen, YAO, Ting, NGO, Chong-wah, MEI, Tao
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9860
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10860
record_format dspace
spelling sg-smu-ink.sis_research-108602024-12-24T02:24:02Z Learning temporal dynamics in videos with image transformer SHU, Yan QIU, Z LONG, Fuchen YAO, Ting NGO, Chong-wah MEI, Tao Temporal dynamics represent the evolving of video content over time, which are critical for action recognition. In this paper, we ask the question: can the off-the-shelf image transformer architecture learn temporal dynamics in videos? To this end, we propose Multidimensional Stacked Image (MSImage) as a new arrangement of video data, which can be fed to image transformers. Technically, MSImage is a high-resolution image that is composed of several evenly-sampled video clips stacked along the channel and space dimensions. The frames in each clip are concatenated along the channel dimension for the transformers to infer short-term dynamics. Meanwhile, the clips are arranged on different spatial positions for learning long-term dynamics. On this basis we propose MSImageFormer – a new variant of image transformer that takes MSImage as the input and is jointly optimized by video classification loss and a new dynamics enhancement loss. The network optimization attends to the high-frequency component of MSImage, avoiding overfitting to static visual patterns. We empirically demonstrate the merits of the MSImageFormer on six action recognition benchmarks. With only 2D image transformer as the classifier, our MSImageFormer achieves 85.3% and 69.7% top-1 accuracy on Kinetics-400 and Something-Something V2 datasets, respectively. Despite with fewer computations, the results are comparable to the SOTA 3D CNNs and video transformers. 2024-04-11T07:00:00Z text https://ink.library.smu.edu.sg/sis_research/9860 info:doi/10.1109/TMM.2024.3383662 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Neural networks Video action recognition Vision transformer Video transformers Three-dimensional displays Optical flow Visualization Optimization Image recognition Artificial Intelligence and Robotics
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Neural networks
Video action recognition
Vision transformer
Video transformers
Three-dimensional displays
Optical flow
Visualization
Optimization
Image recognition
Artificial Intelligence and Robotics
spellingShingle Neural networks
Video action recognition
Vision transformer
Video transformers
Three-dimensional displays
Optical flow
Visualization
Optimization
Image recognition
Artificial Intelligence and Robotics
SHU, Yan
QIU, Z
LONG, Fuchen
YAO, Ting
NGO, Chong-wah
MEI, Tao
Learning temporal dynamics in videos with image transformer
description Temporal dynamics represent the evolving of video content over time, which are critical for action recognition. In this paper, we ask the question: can the off-the-shelf image transformer architecture learn temporal dynamics in videos? To this end, we propose Multidimensional Stacked Image (MSImage) as a new arrangement of video data, which can be fed to image transformers. Technically, MSImage is a high-resolution image that is composed of several evenly-sampled video clips stacked along the channel and space dimensions. The frames in each clip are concatenated along the channel dimension for the transformers to infer short-term dynamics. Meanwhile, the clips are arranged on different spatial positions for learning long-term dynamics. On this basis we propose MSImageFormer – a new variant of image transformer that takes MSImage as the input and is jointly optimized by video classification loss and a new dynamics enhancement loss. The network optimization attends to the high-frequency component of MSImage, avoiding overfitting to static visual patterns. We empirically demonstrate the merits of the MSImageFormer on six action recognition benchmarks. With only 2D image transformer as the classifier, our MSImageFormer achieves 85.3% and 69.7% top-1 accuracy on Kinetics-400 and Something-Something V2 datasets, respectively. Despite with fewer computations, the results are comparable to the SOTA 3D CNNs and video transformers.
format text
author SHU, Yan
QIU, Z
LONG, Fuchen
YAO, Ting
NGO, Chong-wah
MEI, Tao
author_facet SHU, Yan
QIU, Z
LONG, Fuchen
YAO, Ting
NGO, Chong-wah
MEI, Tao
author_sort SHU, Yan
title Learning temporal dynamics in videos with image transformer
title_short Learning temporal dynamics in videos with image transformer
title_full Learning temporal dynamics in videos with image transformer
title_fullStr Learning temporal dynamics in videos with image transformer
title_full_unstemmed Learning temporal dynamics in videos with image transformer
title_sort learning temporal dynamics in videos with image transformer
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/9860
_version_ 1820027801883901952