Condensing a sequence to one informative frame for video recognition

Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step alternative that first condenses the video sequence to an in...

Full description

Saved in:

Bibliographic Details
Main Authors:	QIU. Zhaofan, YAO, Ting, SHU, Yan, NGO, Chong-wah, MEI, Tao
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2021
Subjects:	Databases and Information Systems Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/6890 https://ink.library.smu.edu.sg/context/sis_research/article/7893/viewcontent/iccv21.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-7893
record_format	dspace
spelling	sg-smu-ink.sis_research-78932022-02-07T11:01:04Z Condensing a sequence to one informative frame for video recognition QIU. Zhaofan, YAO, Ting SHU, Yan NGO, Chong-wah MEI, Tao Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step alternative that first condenses the video sequence to an informative" frame" and then exploits off-the-shelf image recognition system on the synthetic frame. A valid question is how to define" useful information" and then distill it from a video sequence down to one synthetic frame. This paper presents a novel Informative Frame Synthesis (IFS) architecture that incorporates three objective tasks, ie, appearance reconstruction, video categorization, motion estimation, and two regularizers, ie, adversarial learning, color consistency. Each task equips the synthetic frame with one ability, while each regularizer enhances its visual quality. With these, by jointly learning the frame synthesis in an end-to-end manner, the generated frame is expected to encapsulate the required spatio-temporal information useful for video analysis. Extensive experiments are conducted on the large-scale Kinetics dataset. When comparing to baseline methods that map video sequence to a single image, IFS shows superior performance. More remarkably, IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks, and achieves comparable performance with the state-of-the-art methods with less computational cost. 2021-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/6890 https://ink.library.smu.edu.sg/context/sis_research/article/7893/viewcontent/iccv21.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Databases and Information Systems Graphics and Human Computer Interfaces
spellingShingle	Databases and Information Systems Graphics and Human Computer Interfaces QIU. Zhaofan, YAO, Ting SHU, Yan NGO, Chong-wah MEI, Tao Condensing a sequence to one informative frame for video recognition
description	Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step alternative that first condenses the video sequence to an informative" frame" and then exploits off-the-shelf image recognition system on the synthetic frame. A valid question is how to define" useful information" and then distill it from a video sequence down to one synthetic frame. This paper presents a novel Informative Frame Synthesis (IFS) architecture that incorporates three objective tasks, ie, appearance reconstruction, video categorization, motion estimation, and two regularizers, ie, adversarial learning, color consistency. Each task equips the synthetic frame with one ability, while each regularizer enhances its visual quality. With these, by jointly learning the frame synthesis in an end-to-end manner, the generated frame is expected to encapsulate the required spatio-temporal information useful for video analysis. Extensive experiments are conducted on the large-scale Kinetics dataset. When comparing to baseline methods that map video sequence to a single image, IFS shows superior performance. More remarkably, IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks, and achieves comparable performance with the state-of-the-art methods with less computational cost.
format	text
author	QIU. Zhaofan, YAO, Ting SHU, Yan NGO, Chong-wah MEI, Tao
author_facet	QIU. Zhaofan, YAO, Ting SHU, Yan NGO, Chong-wah MEI, Tao
author_sort	QIU. Zhaofan,
title	Condensing a sequence to one informative frame for video recognition
title_short	Condensing a sequence to one informative frame for video recognition
title_full	Condensing a sequence to one informative frame for video recognition
title_fullStr	Condensing a sequence to one informative frame for video recognition
title_full_unstemmed	Condensing a sequence to one informative frame for video recognition
title_sort	condensing a sequence to one informative frame for video recognition
publisher	Institutional Knowledge at Singapore Management University
publishDate	2021
url	https://ink.library.smu.edu.sg/sis_research/6890 https://ink.library.smu.edu.sg/context/sis_research/article/7893/viewcontent/iccv21.pdf
_version_	1770576114323816448

Condensing a sequence to one informative frame for video recognition

Similar Items