Boosting video representation learning with multi-faceted integration

Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. There is no study yet on how to...

Full description

Saved in:

Bibliographic Details
Main Authors:	QIU, Zhaofan, TING, Yao, NGO, Chong-wah, ZHANG, Xiao-Ping, WU, Dong, MEI, Tao
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2021
Subjects:	Databases and Information Systems
Online Access:	https://ink.library.smu.edu.sg/sis_research/6808 https://ink.library.smu.edu.sg/context/sis_research/article/7811/viewcontent/cvpr21.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-7811
record_format	dspace
spelling	sg-smu-ink.sis_research-78112022-01-27T08:29:28Z Boosting video representation learning with multi-faceted integration QIU, Zhaofan TING, Yao NGO, Chong-wah ZHANG, Xiao-Ping WU, Dong MEI, Tao Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. There is no study yet on how to learn a video representation from multifaceted labels, and whether multifaceted information is helpful for video representation learning. In this paper, we propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content. Technically, MUFI formulates the problem as visual-semantic embedding learning, which explicitly maps video representation into a rich semantic embedding space, and jointly optimizes video representation from two perspectives. One is to capitalize on the intra-facet supervision between each video and its own label descriptions, and the second predicts the" semantic representation" of each video from the facets of other datasets as the inter-facet supervision. Extensive experiments demonstrate that learning 3D CNN via our MUFI framework on a union of four large-scale video datasets plus two image datasets leads to superior capability of video representation. The pre-learnt 3D CNN with MUFI also shows clear improvements over other approaches on several downstream video applications. More remarkably, MUFI achieves 98.1%/80.9% on UCF101/HMDB51 for action recognition and 101.5% in terms of CIDEr-D score on MSVD for video captioning. 2021-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/6808 info:doi/10.1109/CVPR46437.2021.01381 https://ink.library.smu.edu.sg/context/sis_research/article/7811/viewcontent/cvpr21.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Databases and Information Systems
spellingShingle	Databases and Information Systems QIU, Zhaofan TING, Yao NGO, Chong-wah ZHANG, Xiao-Ping WU, Dong MEI, Tao Boosting video representation learning with multi-faceted integration
description	Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. There is no study yet on how to learn a video representation from multifaceted labels, and whether multifaceted information is helpful for video representation learning. In this paper, we propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content. Technically, MUFI formulates the problem as visual-semantic embedding learning, which explicitly maps video representation into a rich semantic embedding space, and jointly optimizes video representation from two perspectives. One is to capitalize on the intra-facet supervision between each video and its own label descriptions, and the second predicts the" semantic representation" of each video from the facets of other datasets as the inter-facet supervision. Extensive experiments demonstrate that learning 3D CNN via our MUFI framework on a union of four large-scale video datasets plus two image datasets leads to superior capability of video representation. The pre-learnt 3D CNN with MUFI also shows clear improvements over other approaches on several downstream video applications. More remarkably, MUFI achieves 98.1%/80.9% on UCF101/HMDB51 for action recognition and 101.5% in terms of CIDEr-D score on MSVD for video captioning.
format	text
author	QIU, Zhaofan TING, Yao NGO, Chong-wah ZHANG, Xiao-Ping WU, Dong MEI, Tao
author_facet	QIU, Zhaofan TING, Yao NGO, Chong-wah ZHANG, Xiao-Ping WU, Dong MEI, Tao
author_sort	QIU, Zhaofan
title	Boosting video representation learning with multi-faceted integration
title_short	Boosting video representation learning with multi-faceted integration
title_full	Boosting video representation learning with multi-faceted integration
title_fullStr	Boosting video representation learning with multi-faceted integration
title_full_unstemmed	Boosting video representation learning with multi-faceted integration
title_sort	boosting video representation learning with multi-faceted integration
publisher	Institutional Knowledge at Singapore Management University
publishDate	2021
url	https://ink.library.smu.edu.sg/sis_research/6808 https://ink.library.smu.edu.sg/context/sis_research/article/7811/viewcontent/cvpr21.pdf
_version_	1770576072815935488

Boosting video representation learning with multi-faceted integration

Similar Items