Learning temporal dynamics in videos with image transformer

Learning temporal dynamics in videos with image transformer

Temporal dynamics represent the evolving of video content over time, which are critical for action recognition. In this paper, we ask the question: can the off-the-shelf image transformer architecture learn temporal dynamics in videos? To this end, we propose Multidimensional Stacked Image (MSImage)...

Full description

Saved in:

Bibliographic Details
Main Authors:	SHU, Yan, QIU, Z, LONG, Fuchen, YAO, Ting, NGO, Chong-wah, MEI, Tao
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Neural networks Video action recognition Vision transformer Video transformers Three-dimensional displays Optical flow Visualization Optimization Image recognition Artificial Intelligence and Robotics
Online Access:	https://ink.library.smu.edu.sg/sis_research/9860
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

Similar Items

TRANSFORMER TECHNIQUES FOR HUMAN ACTION RECOGNITION AND LOCALIZATION
by: CHANG SHUNING
Published: (2024)

Real-time human action recognition by luminance field trajectory analysis
by: Li, Z., et al.
Published: (2014)

Architecture for 3D Convolutional Neural Networks Based on Temporal Similarity Removal
by: WATHUTHANTHRIGE UDARI CHARITHA DE ALWIS, et al.
Published: (2023)

Exploring probabilistic localized video representation for human action recognition
by: Song, Y., et al.
Published: (2014)

A distribution based video representation for human action recognition
by: Song, Y., et al.
Published: (2013)

FOREGROUND-CENTRIC ACTION RECOGNITION
by: TRAN LAM AN
Published: (2018)

Neural networks as applied to vision systems in recognizing 3-D objects in different orientations
by: Choi, Tsun Kit, et al.
Published: (1993)

Enhancing Anomaly Detection in Surveillance Videos with Transfer Learning from Action Recognition
by: Kun Liu, et al.
Published: (2020)

Temporal Redundancy-Based Computation Reduction for 3D Convolutional Neural Networks
by: Udari De Alwis, et al.
Published: (2022)

Annotating Objects and Relations in User-Generated Videos
by: Xindi Shang, et al.
Published: (2020)

Automatic detection and analysis of player action in moving background sports video sequences
by: Li, H., et al.
Published: (2013)

3-D Relation Network for visual relation recognition in videos
by: Qianwen Cao, et al.
Published: (2021)

Activity recognition using dense long-duration trajectories
by: Sun, J., et al.
Published: (2014)

Wave-ViT: Unifying wavelet and transformers for visual representation learning
by: YAO, Ting, et al.
Published: (2022)

A new Iterative-Midpoint-Method for video character gap filling
by: Shivakumara, P., et al.
Published: (2013)

Lecture video enhancement and editing by integrating posture, gesture, and text
by: WANG, Feng, et al.
Published: (2007)

Relation Understanding in Videos: A Grand Challenge Overview
by: Xindi Shang, et al.
Published: (2020)

Recognition of video text through temporal integration
by: Phan, T.Q., et al.
Published: (2014)

Hough-based model for recognizing bar charts in document images
by: Yan Ping Zhou, et al.
Published: (2013)

Token shift transformer for video classification
by: ZHANG Hao,, et al.
Published: (2021)

A new gradient based character segmentation method for video text recognition
by: Shivakumara, P., et al.
Published: (2013)

Temporal Spiking Recurrent Neural Network for Action Recognition
by: Wang, W., et al.
Published: (2022)

Trajectory-based modeling of human actions with motion reference points
by: JIANG, Yu-Gang, et al.
Published: (2012)

Self-supervised video representation learning by uncovering spatio-temporal statistics
by: WANG, Jiangliu, et al.
Published: (2022)

A novel ring radius transform for video character reconstruction
by: Shivakumara, P., et al.
Published: (2013)

Performance of the color set partitioning in hierarchical tree scheme (CSPIHT) in video coding
by: Kassim, A.A., et al.
Published: (2014)

Multimodal multipart learning for action recognition in depth videos
by: Shahroudy, Amir, et al.
Published: (2018)

Human Interaction Image (HII) dataset
by: Li Junnan, et al.
Published: (2017)

Tracking and indexing of human actions in video image sequences
by: GAMHEWAGE CHAMINDA DE SILVA
Published: (2010)

Report on the FG 2015 video person recognition evaluation
by: BEVERIDGE, J.R., et al.
Published: (2015)

Multimodal affective computing for video summarization
by: Lew, Lincoln Wai Cheong
Published: (2024)

Coupling alignments with recognition for still-to-video face recognition
by: HUANG, Zhiwu, et al.
Published: (2013)

Spatiotemporal interaction residual networks with pseudo3d for video action recognition
by: Chen, J., et al.
Published: (2021)

3D scene reconstruction using a single monocular image
by: Ablay, Francis Miguel P., et al.
Published: (2012)

3D face recognition under varying expressions using an integrated morphable model
by: SEBASTIEN HENRI BENOIT
Published: (2010)

Highly scalable wavelet-based video codec for very low bit-rate environment
by: Tham, J.Y., et al.
Published: (2014)

A new Fourier-moments based video word and character extraction method for recognition
by: Rajendran, D., et al.
Published: (2013)

Brain-computer interface and voice-controlled 3D printed prosthetic hand
by: Oppus, Carlos M, et al.
Published: (2017)

A robust Hough-based algorithm for partial ellipse detection in broadcast soccer video
by: Yu, X., et al.
Published: (2013)

An FFT twofold subspace-based optimization method for solving electromagnetic inverse scattering problems
by: Zhong, Y., et al.
Published: (2014)