Group contextualization for video recognition

Learning discriminative representation from the complex spatio-temporal dynamic space is essential for video recognition. On top of those stylized spatio-temporal computational units, further refining the learnt feature with axial contexts is demonstrated to be promising in achieving this goal. Howe...

全面介紹

Saved in:

書目詳細資料
Main Authors:	HAO, Yanbin, ZHANG, Hao, NGO, Chong-wah, HE, Xiangnan
格式:	text
語言:	English
出版:	Institutional Knowledge at Singapore Management University 2022
主題:	Recognition detection categorization retrieval Action and event recognition Deep learning architectures and techniques Efficient learning and inferences Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
在線閱讀:	https://ink.library.smu.edu.sg/sis_research/7504 https://ink.library.smu.edu.sg/context/sis_research/article/8507/viewcontent/Hao_Group_Contextualization_for_Video_Recognition_CVPR_2022_paper.pdf
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Singapore Management University
語言:	English

id	sg-smu-ink.sis_research-8507
record_format	dspace
spelling	sg-smu-ink.sis_research-85072023-10-10T02:32:56Z Group contextualization for video recognition HAO, Yanbin ZHANG, Hao NGO, Chong-wah HE, Xiangnan Learning discriminative representation from the complex spatio-temporal dynamic space is essential for video recognition. On top of those stylized spatio-temporal computational units, further refining the learnt feature with axial contexts is demonstrated to be promising in achieving this goal. However, previous works generally focus on utilizing a single kind of contexts to calibrate entire feature channels and could hardly apply to deal with diverse video activities. The problem can be tackled by using pair-wise spatio-temporal attentions to recompute feature response with cross-axis contexts at the expense of heavy computations. In this paper, we propose an efficient feature refinement method that decomposes the feature channels into several groups and separately refines them with different axial contexts in parallel. We refer this lightweight feature calibration as group contextualization (GC). Specifically, we design a family of efficient element-wise calibrators, i.e., ECal-G/S/T/L, where their axial contexts are information dynamics aggregated from other axes either globally or locally, to contextualize feature channel groups. The GC module can be densely plugged into each residual layer of the off-the-shelf video networks. With little computational overhead, consistent improvement is observed when plugging in GC on different networks. By utilizing calibrators to embed feature with four different kinds of contexts in parallel, the learnt representation is expected to be more resilient to diverse types of activities. On videos with rich temporal variations, empirically GC can boost the performance of 2D-CNN (e.g., TSN and TSM) to a level comparable to the state-of-the-art video networks. Code is available at https://github.com/haoyanbin918/GroupContextualization. 2022-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7504 info:doi/10.1109/CVPR52688.2022.00100 https://ink.library.smu.edu.sg/context/sis_research/article/8507/viewcontent/Hao_Group_Contextualization_for_Video_Recognition_CVPR_2022_paper.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Recognition detection categorization retrieval Action and event recognition Deep learning architectures and techniques Efficient learning and inferences Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Recognition detection categorization retrieval Action and event recognition Deep learning architectures and techniques Efficient learning and inferences Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
spellingShingle	Recognition detection categorization retrieval Action and event recognition Deep learning architectures and techniques Efficient learning and inferences Artificial Intelligence and Robotics Graphics and Human Computer Interfaces HAO, Yanbin ZHANG, Hao NGO, Chong-wah HE, Xiangnan Group contextualization for video recognition
description	Learning discriminative representation from the complex spatio-temporal dynamic space is essential for video recognition. On top of those stylized spatio-temporal computational units, further refining the learnt feature with axial contexts is demonstrated to be promising in achieving this goal. However, previous works generally focus on utilizing a single kind of contexts to calibrate entire feature channels and could hardly apply to deal with diverse video activities. The problem can be tackled by using pair-wise spatio-temporal attentions to recompute feature response with cross-axis contexts at the expense of heavy computations. In this paper, we propose an efficient feature refinement method that decomposes the feature channels into several groups and separately refines them with different axial contexts in parallel. We refer this lightweight feature calibration as group contextualization (GC). Specifically, we design a family of efficient element-wise calibrators, i.e., ECal-G/S/T/L, where their axial contexts are information dynamics aggregated from other axes either globally or locally, to contextualize feature channel groups. The GC module can be densely plugged into each residual layer of the off-the-shelf video networks. With little computational overhead, consistent improvement is observed when plugging in GC on different networks. By utilizing calibrators to embed feature with four different kinds of contexts in parallel, the learnt representation is expected to be more resilient to diverse types of activities. On videos with rich temporal variations, empirically GC can boost the performance of 2D-CNN (e.g., TSN and TSM) to a level comparable to the state-of-the-art video networks. Code is available at https://github.com/haoyanbin918/GroupContextualization.
format	text
author	HAO, Yanbin ZHANG, Hao NGO, Chong-wah HE, Xiangnan
author_facet	HAO, Yanbin ZHANG, Hao NGO, Chong-wah HE, Xiangnan
author_sort	HAO, Yanbin
title	Group contextualization for video recognition
title_short	Group contextualization for video recognition
title_full	Group contextualization for video recognition
title_fullStr	Group contextualization for video recognition
title_full_unstemmed	Group contextualization for video recognition
title_sort	group contextualization for video recognition
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/sis_research/7504 https://ink.library.smu.edu.sg/context/sis_research/article/8507/viewcontent/Hao_Group_Contextualization_for_Video_Recognition_CVPR_2022_paper.pdf
_version_	1781793931340546048

Group contextualization for video recognition

相似書籍