Group contextualization for video recognition

Learning discriminative representation from the complex spatio-temporal dynamic space is essential for video recognition. On top of those stylized spatio-temporal computational units, further refining the learnt feature with axial contexts is demonstrated to be promising in achieving this goal. Howe...

Full description

Saved in:
Bibliographic Details
Main Authors: HAO, Yanbin, ZHANG, Hao, NGO, Chong-wah, HE, Xiangnan
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2022
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/7504
https://ink.library.smu.edu.sg/context/sis_research/article/8507/viewcontent/Hao_Group_Contextualization_for_Video_Recognition_CVPR_2022_paper.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-8507
record_format dspace
spelling sg-smu-ink.sis_research-85072023-10-10T02:32:56Z Group contextualization for video recognition HAO, Yanbin ZHANG, Hao NGO, Chong-wah HE, Xiangnan Learning discriminative representation from the complex spatio-temporal dynamic space is essential for video recognition. On top of those stylized spatio-temporal computational units, further refining the learnt feature with axial contexts is demonstrated to be promising in achieving this goal. However, previous works generally focus on utilizing a single kind of contexts to calibrate entire feature channels and could hardly apply to deal with diverse video activities. The problem can be tackled by using pair-wise spatio-temporal attentions to recompute feature response with cross-axis contexts at the expense of heavy computations. In this paper, we propose an efficient feature refinement method that decomposes the feature channels into several groups and separately refines them with different axial contexts in parallel. We refer this lightweight feature calibration as group contextualization (GC). Specifically, we design a family of efficient element-wise calibrators, i.e., ECal-G/S/T/L, where their axial contexts are information dynamics aggregated from other axes either globally or locally, to contextualize feature channel groups. The GC module can be densely plugged into each residual layer of the off-the-shelf video networks. With little computational overhead, consistent improvement is observed when plugging in GC on different networks. By utilizing calibrators to embed feature with four different kinds of contexts in parallel, the learnt representation is expected to be more resilient to diverse types of activities. On videos with rich temporal variations, empirically GC can boost the performance of 2D-CNN (e.g., TSN and TSM) to a level comparable to the state-of-the-art video networks. Code is available at https://github.com/haoyanbin918/GroupContextualization. 2022-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7504 info:doi/10.1109/CVPR52688.2022.00100 https://ink.library.smu.edu.sg/context/sis_research/article/8507/viewcontent/Hao_Group_Contextualization_for_Video_Recognition_CVPR_2022_paper.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Recognition detection categorization retrieval Action and event recognition Deep learning architectures and techniques Efficient learning and inferences Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Recognition
detection
categorization
retrieval
Action and event recognition
Deep learning architectures and techniques
Efficient learning and inferences
Artificial Intelligence and Robotics
Graphics and Human Computer Interfaces
spellingShingle Recognition
detection
categorization
retrieval
Action and event recognition
Deep learning architectures and techniques
Efficient learning and inferences
Artificial Intelligence and Robotics
Graphics and Human Computer Interfaces
HAO, Yanbin
ZHANG, Hao
NGO, Chong-wah
HE, Xiangnan
Group contextualization for video recognition
description Learning discriminative representation from the complex spatio-temporal dynamic space is essential for video recognition. On top of those stylized spatio-temporal computational units, further refining the learnt feature with axial contexts is demonstrated to be promising in achieving this goal. However, previous works generally focus on utilizing a single kind of contexts to calibrate entire feature channels and could hardly apply to deal with diverse video activities. The problem can be tackled by using pair-wise spatio-temporal attentions to recompute feature response with cross-axis contexts at the expense of heavy computations. In this paper, we propose an efficient feature refinement method that decomposes the feature channels into several groups and separately refines them with different axial contexts in parallel. We refer this lightweight feature calibration as group contextualization (GC). Specifically, we design a family of efficient element-wise calibrators, i.e., ECal-G/S/T/L, where their axial contexts are information dynamics aggregated from other axes either globally or locally, to contextualize feature channel groups. The GC module can be densely plugged into each residual layer of the off-the-shelf video networks. With little computational overhead, consistent improvement is observed when plugging in GC on different networks. By utilizing calibrators to embed feature with four different kinds of contexts in parallel, the learnt representation is expected to be more resilient to diverse types of activities. On videos with rich temporal variations, empirically GC can boost the performance of 2D-CNN (e.g., TSN and TSM) to a level comparable to the state-of-the-art video networks. Code is available at https://github.com/haoyanbin918/GroupContextualization.
format text
author HAO, Yanbin
ZHANG, Hao
NGO, Chong-wah
HE, Xiangnan
author_facet HAO, Yanbin
ZHANG, Hao
NGO, Chong-wah
HE, Xiangnan
author_sort HAO, Yanbin
title Group contextualization for video recognition
title_short Group contextualization for video recognition
title_full Group contextualization for video recognition
title_fullStr Group contextualization for video recognition
title_full_unstemmed Group contextualization for video recognition
title_sort group contextualization for video recognition
publisher Institutional Knowledge at Singapore Management University
publishDate 2022
url https://ink.library.smu.edu.sg/sis_research/7504
https://ink.library.smu.edu.sg/context/sis_research/article/8507/viewcontent/Hao_Group_Contextualization_for_Video_Recognition_CVPR_2022_paper.pdf
_version_ 1781793931340546048