Condensing a sequence to one informative frame for video recognition

Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step alternative that first condenses the video sequence to an in...

Full description

Saved in:
Bibliographic Details
Main Authors: QIU. Zhaofan, YAO, Ting, SHU, Yan, NGO, Chong-wah, MEI, Tao
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2021
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/6890
https://ink.library.smu.edu.sg/context/sis_research/article/7893/viewcontent/iccv21.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-7893
record_format dspace
spelling sg-smu-ink.sis_research-78932022-02-07T11:01:04Z Condensing a sequence to one informative frame for video recognition QIU. Zhaofan, YAO, Ting SHU, Yan NGO, Chong-wah MEI, Tao Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step alternative that first condenses the video sequence to an informative" frame" and then exploits off-the-shelf image recognition system on the synthetic frame. A valid question is how to define" useful information" and then distill it from a video sequence down to one synthetic frame. This paper presents a novel Informative Frame Synthesis (IFS) architecture that incorporates three objective tasks, ie, appearance reconstruction, video categorization, motion estimation, and two regularizers, ie, adversarial learning, color consistency. Each task equips the synthetic frame with one ability, while each regularizer enhances its visual quality. With these, by jointly learning the frame synthesis in an end-to-end manner, the generated frame is expected to encapsulate the required spatio-temporal information useful for video analysis. Extensive experiments are conducted on the large-scale Kinetics dataset. When comparing to baseline methods that map video sequence to a single image, IFS shows superior performance. More remarkably, IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks, and achieves comparable performance with the state-of-the-art methods with less computational cost. 2021-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/6890 https://ink.library.smu.edu.sg/context/sis_research/article/7893/viewcontent/iccv21.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Databases and Information Systems
Graphics and Human Computer Interfaces
spellingShingle Databases and Information Systems
Graphics and Human Computer Interfaces
QIU. Zhaofan,
YAO, Ting
SHU, Yan
NGO, Chong-wah
MEI, Tao
Condensing a sequence to one informative frame for video recognition
description Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step alternative that first condenses the video sequence to an informative" frame" and then exploits off-the-shelf image recognition system on the synthetic frame. A valid question is how to define" useful information" and then distill it from a video sequence down to one synthetic frame. This paper presents a novel Informative Frame Synthesis (IFS) architecture that incorporates three objective tasks, ie, appearance reconstruction, video categorization, motion estimation, and two regularizers, ie, adversarial learning, color consistency. Each task equips the synthetic frame with one ability, while each regularizer enhances its visual quality. With these, by jointly learning the frame synthesis in an end-to-end manner, the generated frame is expected to encapsulate the required spatio-temporal information useful for video analysis. Extensive experiments are conducted on the large-scale Kinetics dataset. When comparing to baseline methods that map video sequence to a single image, IFS shows superior performance. More remarkably, IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks, and achieves comparable performance with the state-of-the-art methods with less computational cost.
format text
author QIU. Zhaofan,
YAO, Ting
SHU, Yan
NGO, Chong-wah
MEI, Tao
author_facet QIU. Zhaofan,
YAO, Ting
SHU, Yan
NGO, Chong-wah
MEI, Tao
author_sort QIU. Zhaofan,
title Condensing a sequence to one informative frame for video recognition
title_short Condensing a sequence to one informative frame for video recognition
title_full Condensing a sequence to one informative frame for video recognition
title_fullStr Condensing a sequence to one informative frame for video recognition
title_full_unstemmed Condensing a sequence to one informative frame for video recognition
title_sort condensing a sequence to one informative frame for video recognition
publisher Institutional Knowledge at Singapore Management University
publishDate 2021
url https://ink.library.smu.edu.sg/sis_research/6890
https://ink.library.smu.edu.sg/context/sis_research/article/7893/viewcontent/iccv21.pdf
_version_ 1770576114323816448