Learning temporal dynamics in videos with image transformer
Temporal dynamics represent the evolving of video content over time, which are critical for action recognition. In this paper, we ask the question: can the off-the-shelf image transformer architecture learn temporal dynamics in videos? To this end, we propose Multidimensional Stacked Image (MSImage)...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2024
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/9860 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-10860 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-108602024-12-24T02:24:02Z Learning temporal dynamics in videos with image transformer SHU, Yan QIU, Z LONG, Fuchen YAO, Ting NGO, Chong-wah MEI, Tao Temporal dynamics represent the evolving of video content over time, which are critical for action recognition. In this paper, we ask the question: can the off-the-shelf image transformer architecture learn temporal dynamics in videos? To this end, we propose Multidimensional Stacked Image (MSImage) as a new arrangement of video data, which can be fed to image transformers. Technically, MSImage is a high-resolution image that is composed of several evenly-sampled video clips stacked along the channel and space dimensions. The frames in each clip are concatenated along the channel dimension for the transformers to infer short-term dynamics. Meanwhile, the clips are arranged on different spatial positions for learning long-term dynamics. On this basis we propose MSImageFormer – a new variant of image transformer that takes MSImage as the input and is jointly optimized by video classification loss and a new dynamics enhancement loss. The network optimization attends to the high-frequency component of MSImage, avoiding overfitting to static visual patterns. We empirically demonstrate the merits of the MSImageFormer on six action recognition benchmarks. With only 2D image transformer as the classifier, our MSImageFormer achieves 85.3% and 69.7% top-1 accuracy on Kinetics-400 and Something-Something V2 datasets, respectively. Despite with fewer computations, the results are comparable to the SOTA 3D CNNs and video transformers. 2024-04-11T07:00:00Z text https://ink.library.smu.edu.sg/sis_research/9860 info:doi/10.1109/TMM.2024.3383662 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Neural networks Video action recognition Vision transformer Video transformers Three-dimensional displays Optical flow Visualization Optimization Image recognition Artificial Intelligence and Robotics |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Neural networks Video action recognition Vision transformer Video transformers Three-dimensional displays Optical flow Visualization Optimization Image recognition Artificial Intelligence and Robotics |
spellingShingle |
Neural networks Video action recognition Vision transformer Video transformers Three-dimensional displays Optical flow Visualization Optimization Image recognition Artificial Intelligence and Robotics SHU, Yan QIU, Z LONG, Fuchen YAO, Ting NGO, Chong-wah MEI, Tao Learning temporal dynamics in videos with image transformer |
description |
Temporal dynamics represent the evolving of video content over time, which are critical for action recognition. In this paper, we ask the question: can the off-the-shelf image transformer architecture learn temporal dynamics in videos? To this end, we propose Multidimensional Stacked Image (MSImage) as a new arrangement of video data, which can be fed to image transformers. Technically, MSImage is a high-resolution image that is composed of several evenly-sampled video clips stacked along the channel and space dimensions. The frames in each clip are concatenated along the channel dimension for the transformers to infer short-term dynamics. Meanwhile, the clips are arranged on different spatial positions for learning long-term dynamics. On this basis we propose MSImageFormer – a new variant of image transformer that takes MSImage as the input and is jointly optimized by video classification loss and a new dynamics enhancement loss. The network optimization attends to the high-frequency component of MSImage, avoiding overfitting to static visual patterns. We empirically demonstrate the merits of the MSImageFormer on six action recognition benchmarks. With only 2D image transformer as the classifier, our MSImageFormer achieves 85.3% and 69.7% top-1 accuracy on Kinetics-400 and Something-Something V2 datasets, respectively. Despite with fewer computations, the results are comparable to the SOTA 3D CNNs and video transformers. |
format |
text |
author |
SHU, Yan QIU, Z LONG, Fuchen YAO, Ting NGO, Chong-wah MEI, Tao |
author_facet |
SHU, Yan QIU, Z LONG, Fuchen YAO, Ting NGO, Chong-wah MEI, Tao |
author_sort |
SHU, Yan |
title |
Learning temporal dynamics in videos with image transformer |
title_short |
Learning temporal dynamics in videos with image transformer |
title_full |
Learning temporal dynamics in videos with image transformer |
title_fullStr |
Learning temporal dynamics in videos with image transformer |
title_full_unstemmed |
Learning temporal dynamics in videos with image transformer |
title_sort |
learning temporal dynamics in videos with image transformer |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2024 |
url |
https://ink.library.smu.edu.sg/sis_research/9860 |
_version_ |
1820027801883901952 |