Self-supervised video representation learning by uncovering spatio-temporal statistics

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spa...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلفون الرئيسيون:	WANG, Jiangliu, JIAO, Jianbo, BAO, Linchao, HE, Shengfeng, LIU, Wei, LIU, Yun-hui
التنسيق:	text
اللغة:	English
منشور في:	Institutional Knowledge at Singapore Management University 2022
الموضوعات:	Task analysis Three-dimensional displays Neural networks Image color analysis Visualization Training Feature extraction Self-supervised learning representation learning video understanding 3D CNN Information Security
الوصول للمادة أونلاين:	https://ink.library.smu.edu.sg/sis_research/7839 https://ink.library.smu.edu.sg/context/sis_research/article/8842/viewcontent/self_supervised.pdf
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

id	sg-smu-ink.sis_research-8842
record_format	dspace
spelling	sg-smu-ink.sis_research-88422023-06-15T09:13:27Z Self-supervised video representation learning by uncovering spatio-temporal statistics WANG, Jiangliu JIAO, Jianbo BAO, Linchao HE, Shengfeng LIU, Wei LIU, Yun-hui This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts. 2022-07-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7839 info:doi/10.1109/TPAMI.2021.3057833 https://ink.library.smu.edu.sg/context/sis_research/article/8842/viewcontent/self_supervised.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Task analysis Three-dimensional displays Neural networks Image color analysis Visualization Training Feature extraction Self-supervised learning representation learning video understanding 3D CNN Information Security
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Task analysis Three-dimensional displays Neural networks Image color analysis Visualization Training Feature extraction Self-supervised learning representation learning video understanding 3D CNN Information Security
spellingShingle	Task analysis Three-dimensional displays Neural networks Image color analysis Visualization Training Feature extraction Self-supervised learning representation learning video understanding 3D CNN Information Security WANG, Jiangliu JIAO, Jianbo BAO, Linchao HE, Shengfeng LIU, Wei LIU, Yun-hui Self-supervised video representation learning by uncovering spatio-temporal statistics
description	This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.
format	text
author	WANG, Jiangliu JIAO, Jianbo BAO, Linchao HE, Shengfeng LIU, Wei LIU, Yun-hui
author_facet	WANG, Jiangliu JIAO, Jianbo BAO, Linchao HE, Shengfeng LIU, Wei LIU, Yun-hui
author_sort	WANG, Jiangliu
title	Self-supervised video representation learning by uncovering spatio-temporal statistics
title_short	Self-supervised video representation learning by uncovering spatio-temporal statistics
title_full	Self-supervised video representation learning by uncovering spatio-temporal statistics
title_fullStr	Self-supervised video representation learning by uncovering spatio-temporal statistics
title_full_unstemmed	Self-supervised video representation learning by uncovering spatio-temporal statistics
title_sort	self-supervised video representation learning by uncovering spatio-temporal statistics
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/sis_research/7839 https://ink.library.smu.edu.sg/context/sis_research/article/8842/viewcontent/self_supervised.pdf
_version_	1770576553845981184

Self-supervised video representation learning by uncovering spatio-temporal statistics

مواد مشابهة