Improving self-supervision in video representation learning

With the rapid advancement of deep learning techniques in computer vision, researchers have achieved high performance in video related downstream tasks such as action classification and action detection. However, a pressing issue in this field is the scarcity of labeled data. A video contains hundre...

Full description

Saved in:

Bibliographic Details
Main Author:	Liu, Hualin
Other Authors:	Zhang Hanwang
Format:	Thesis-Master by Research
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/152209
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-152209
record_format	dspace
spelling	sg-ntu-dr.10356-1522092021-09-06T02:34:42Z Improving self-supervision in video representation learning Liu, Hualin Zhang Hanwang School of Computer Science and Engineering Salesforce Research Asia hanwangzhang@ntu.edu.sg Engineering::Computer science and engineering With the rapid advancement of deep learning techniques in computer vision, researchers have achieved high performance in video related downstream tasks such as action classification and action detection. However, a pressing issue in this field is the scarcity of labeled data. A video contains hundreds of frames and hence it would take a daunt- ing effort to manually collect and label a large video dataset for researchers. There are two promising directions to tackle this problem. One is self-supervised learning and the other is semi-supervised learning. In our research, we focus on improving self-supervised video representation learning methods. Current methods based on instance discrimination tasks suffer from a major limitation: semantically-similar samples are treated as negatives and their representations are enforced to be different. To address this limitation, we propose smooth contrastive learning with a weak teacher, where we employ a teacher model to mine additional supervisory signals. Specifically, the teacher model computes a similarity distribution over weakly-augmented negative samples and uses it as an artificial label to smooth the one-hot label. The student is trained on strongly- augmented samples using the smoothed label. We evaluate the learned representation on action recognition and video retrieval tasks. The proposed Weak Teacher outperforms the baseline methods under the same dataset and computation budget. Master of Engineering 2021-07-23T00:34:19Z 2021-07-23T00:34:19Z 2021 Thesis-Master by Research Liu, H. (2021). Improving self-supervision in video representation learning. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/152209 https://hdl.handle.net/10356/152209 10.32657/10356/152209 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering
spellingShingle	Engineering::Computer science and engineering Liu, Hualin Improving self-supervision in video representation learning
description	With the rapid advancement of deep learning techniques in computer vision, researchers have achieved high performance in video related downstream tasks such as action classification and action detection. However, a pressing issue in this field is the scarcity of labeled data. A video contains hundreds of frames and hence it would take a daunt- ing effort to manually collect and label a large video dataset for researchers. There are two promising directions to tackle this problem. One is self-supervised learning and the other is semi-supervised learning. In our research, we focus on improving self-supervised video representation learning methods. Current methods based on instance discrimination tasks suffer from a major limitation: semantically-similar samples are treated as negatives and their representations are enforced to be different. To address this limitation, we propose smooth contrastive learning with a weak teacher, where we employ a teacher model to mine additional supervisory signals. Specifically, the teacher model computes a similarity distribution over weakly-augmented negative samples and uses it as an artificial label to smooth the one-hot label. The student is trained on strongly- augmented samples using the smoothed label. We evaluate the learned representation on action recognition and video retrieval tasks. The proposed Weak Teacher outperforms the baseline methods under the same dataset and computation budget.
author2	Zhang Hanwang
author_facet	Zhang Hanwang Liu, Hualin
format	Thesis-Master by Research
author	Liu, Hualin
author_sort	Liu, Hualin
title	Improving self-supervision in video representation learning
title_short	Improving self-supervision in video representation learning
title_full	Improving self-supervision in video representation learning
title_fullStr	Improving self-supervision in video representation learning
title_full_unstemmed	Improving self-supervision in video representation learning
title_sort	improving self-supervision in video representation learning
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/152209
_version_	1710686946196455424

Improving self-supervision in video representation learning

Similar Items