DualFormer: Local-global stratified transformer for efficient video recognition

DualFormer: Local-global stratified transformer for efficient video recognition

While transformers have shown great potential on video recognition with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by the self-attention to the huge number of 3D tokens. In this paper, we present a new transformer architecture ter...

Full description

Saved in:

Bibliographic Details
Main Authors:	LIANG, Yuxuan, ZHOU, Pan, ZIMMERMANN, Roger, YAN, Shuicheng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	Efficient video transformer Local and global attention Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/8980 https://ink.library.smu.edu.sg/context/sis_research/article/9983/viewcontent/2022_ECCV_DualFormer.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

Similar Items

Video graph transformer for video question answering
by: XIAO, Junbin, et al.
Published: (2022)

Long-term leap attention, short-term periodic shift for video classification
by: ZHANG, Hao, et al.
Published: (2022)

MetaFormer baselines for vision
by: YU, Weihao, et al.
Published: (2023)

Contrastive video question answering via video graph transformer
by: XIAO, Junbin Xiao, et al.
Published: (2023)

MetaFormer is actually what you need for vision
by: YU, Weihao, et al.
Published: (2022)

Video summarization and scene detection by graph modeling
by: NGO, Chong-wah, et al.
Published: (2005)

Token shift transformer for video classification
by: ZHANG Hao,, et al.
Published: (2021)

On the selection of anchors and targets for video hyperlinking
by: CHENG, Zhi-Qi, et al.
Published: (2017)

Cross-modal Moment Localization in Videos
by: Meng Liu, et al.
Published: (2020)

Watching 360° videos together
by: TANG, Anthony, et al.
Published: (2017)

Recent advances in content-based video analysis
by: NGO, Chong-wah, et al.
Published: (2001)

Energy-efficient mobile video management using smartphones
by: Hao, J., et al.
Published: (2013)

Synchronization of lecture videos and electronic slides by video text analysis
by: WANG, Feng, et al.
Published: (2003)

Video text detection and segmentation for optical character recognition
by: NGO, Chong-wah, et al.
Published: (2005)

Wave-ViT: Unifying wavelet and transformers for visual representation learning
by: YAO, Ting, et al.
Published: (2022)

Vireo @ video browser showdown 2019
by: NGUYEN, Phuong Anh, et al.
Published: (2019)

A benchmark and comparative study of video-based face recognition on cox face database
by: HUANG, Zhiwu, et al.
Published: (2015)

Exploring video streaming in public settings: Shared geocaching over distance using mobile video chat
by: PROCYK, Jason, et al.
Published: (2014)

Stargazer: An interactive camera robot for capturing how-to videos based on subtle instructor cues
by: LI, Jiannan, et al.
Published: (2023)

SwapVid: Integrating video viewing and document exploration with direct manipulation
by: MURAKAMI, Taichi, et al.
Published: (2024)

TRANSFORMER TECHNIQUES FOR HUMAN ACTION RECOGNITION AND LOCALIZATION
by: CHANG SHUNING
Published: (2024)

Deep video demoireing via compact invertible dyadic decomposition
by: QUAN, Yuhui, et al.
Published: (2023)

Exploiting self-adaptive posture-based focus estimation for lecture video editing
by: WANG, Feng, et al.
Published: (2005)

Tourgether360: Collaborative exploration of 360° videos using pseudo-spatial navigation
by: KUMAR, Kartikaeya, et al.
Published: (2022)

Route tapestries: Navigating 360° virtual tour videos using slit-scan visualizations
by: LI, Jiannan, et al.
Published: (2021)

Towards understanding why mask reconstruction pretraining helps in downstream tasks
by: PAN, Jiachun, et al.
Published: (2023)

TOWARDS ATTENTION-AWARE CONCEPT MAP BASED REVIEW IN VIDEO LEARNING
by: ZHANG SHAN
Published: (2023)

Learning to match anchor-target video pairs with dual attentional holographic networks
by: HAO, Yan Bin, et al.
Published: (2021)

Gesture tracking and recognition for lecture video editing
by: WANG, Feng, et al.
Published: (2004)

Video modeling and learning on Riemannian manifold for emotion recognition in the wild
by: LIU, Mengyi, et al.
Published: (2016)

PosMLP-Video: Spatial and temporal relative position encoding for efficient video recognition
by: HAO, Yanbin, et al.
Published: (2024)

On the annotation of web videos by efficient near-duplicate search
by: ZHAO, Wan-Lei, et al.
Published: (2010)

immersivePOV: Filming how-to videos with a head-mounted 360° action camera
by: HUANG, Kevin, et al.
Published: (2022)

Lecture video enhancement and editing by integrating posture, gesture, and text
by: WANG, Feng, et al.
Published: (2007)

Efficient cross-modal video retrieval with meta-optimized frames
by: HAN, Ning, et al.
Published: (2024)

Video hyperlinking: Libraries and tools for threading and visualizing large video collection
by: PANG, Lei, et al.
Published: (2012)

VIREO @ Video Browser Showdown 2020
by: NGUYEN, Phuong Anh, et al.
Published: (2020)

YOUR EYES TELL EVERYTHING IMPROVING THE EFFECTIVENESS OF ONLINE VIDEO ADVERTISING: AN EYE-TRACKING APPROACH
by: LUO CHENG
Published: (2016)

Bilingual effects on deployment of the attention system in linguistically and culturally homogeneous children and adults
by: YANG, Sujin, et al.
Published: (2016)

Text-driven video prediction
by: SONG, Xue, et al.
Published: (2024)